Database Document: PDB-REPRDB V6.0

PDB-REPRDB
DATABASE OF REPRESENTATIVE PROTEIN CHAINS IN PDB(PROTEIN DATA BANK)

Tamotsu NOGUCHI, Kentaro ONIZUKA, Yutaka AKIYAMA, and Minoru SAITO Parallel Application Laboratory, Tsukuba Research Center Real World Computing Partnership

                       Version 6.0 (PDB Rel. #83)   May 1998
                     ( This document was updated at 20 May 1998 )

Reference :
Noguchi T., Onizuka K., Akiyama Y., Saito M. (1997).
"PDB-REPRDB: A Database of Representative Protein Chains in PDB (Protein Data Bank)".
In Proceedings of the Fifth International Conference on Intelligent Systems for Molecular Biology, AAAI press, Menlo Park, CA.

1. INTRODUCTION

The PDB-REPRDB consists of a list of representative protein chains. The criteria of selecting the representatives are,

) quality of the atomic coordinate data,
) sequence uniqueness, and
) conformation uniqueness, particularly local conformation uniqueness.

The first version of PDB-REPRDB consisted of 763 representative chains from the PDB Release 70 (Oct. 1994) at Brookhaven National Laboratory and was released on GenomeNet WWW server (http://www.genome.ad.jp/htbin/show_pdbreprdb) in July 1995.

The second version (PDB-REPRDB Ver. 2.0) used data from the PDB Release 78 (Oct. 1996) and was released in April 1997 on our server (http://mpap1.trc.rwcp.or.jp/pdbreprdb).

The third version (PDB-REPRDB Ver. 3.0) used data from the PDB Release 80 (Apr. 1997) "PDB-RERPDB" has been available on the server (http://pdap1.trc.rwcp.or.jp/pdbreprdb).

The fourth version (PDB-REPRDB Ver. 4.0) uses data from PDB Release 81 (Jul. 1997) and includes NMR structures.

In this version, the selection criteria remains essentially the same as in the first one, but the selection procedure has been almost completely automated, by the use of a parallelized algorithm for a quick selection of representative chains .

2. COPYRIGHT NOTICE

Copyright 1998,1999 by Real World Computing Partnership(RWCP). All rights reserved.
For further information regarding permission for use or reproduction, please contact to Tamotsu Noguchi, Real World Computing Partnership,Parallel Application Laboratory, Tsukuba Research Center, Tsukuba Mitsui Bldg., 1-6-1 Takezono, Tsukuba 305, Japan. TEL:81(298)53-1707,FAX:81(298)53-1680,E-mail:noguchi@trc.rwcp.or.jp.

3. METHOD

The representative protein chains are selected as follows.

(1)

Exclude the following entries from the selection

(a): DNA and RNA data
(b): theoretically modeled data
(c): short chains (l < 40 residues)
(d): data with incomplete backbone coordinates
(e): data with incomplete side chain coordinates
(f): data without refinement (by X-PLOR, TNT, etc.)
*): Data of NMR spectroscopy are included in version 4.0

(2)

The selected chains in the entries are sorted according to their data quality as follows:
First, the selected chains are classified into three classes. Class A chains are those with both good resolution (<= 3.0 angstrom) and good R-Factor (<= 0.3). Class B chains are those with resolution (>= 3.0 angstrom) and R-Factor (>= 0.3). The chains derived by NMR spectroscopy are classified into class C.
Second, we sort the chains with respect to the resolution of structure determination within each class (A and B), and concatenate the class C chains.
The chains with the same resolution are further sorted by R-Factor value. When several chains have the same resolution and R-Factor, they are further sorted by:

(a): the number of chain breaks (the less the better)
(b): the number of non-standard amino acid residues (the less the better)
(c): the number of residues without backbone coordinates (the less the better)
(d): the number of residues without side chain coordinates (the less the better)
(e): whether mutant or wild (the wild type has priority)
(f): whether complex or not (the non-complex has priority)
(g): alphabetical order of the entry name (e.g. 1MCD < 1MCE, 5AT1A < 5AT1C)

(3)

The similar chains are eliminated from the sorted list as follows:
The first chain in the sorted list has the best quality and is the first representative. Then, suppose we have already selected N representatives from the sorted list. Now, the sorted list does not contain the chains

1): that have already been selected and taken out as representatives,
2): that have already been eliminated through the selection procedure of the N representatives.

Thus, the first chain remaining in the sorted list has the highest priority to be the next representative. We check the "similarity" between the first chain of the list and each of the chains in the sorted list. After that, the first chain becomes the (N+1)-st representative. And if the first chain is similar to some chain, the chain will be eliminated from the sorted list, and then, the second chain comes to the first of the sorted list. This procedure repeats until the sorted list goes to null.
We consider the chain is NOT similar, either

(a): if the sequential identity is less than a certain threshold value (<= 95%, 85%, 75%, 65%, 55%, 45%, 35% or 25%), where the sequence identity is measured after a pairwize sequence alignment,
(b): or, if the maximum value of the distance between the "CA" atoms when the two structures are superimposed is greater than a certain threshold (>= 10, 20, 30, 40, 50 angstrom or infinity).

Before superimposing the two structures, we align the two sequences by the pairwise sequence alignment to check their sequential identity. The matched residues in the alignment are superimposed by the least square fitting procedure. Finally, all the chains (in class A, B and C) are classified into protein-chain groups, where each chain is classified into the group whose representative chain is sequentially nearest to the chain.

( Since R-factor and resolution values are written in the unformatted form in PDB-files, the PDB file read program may in some cases find an incorrect value or not find a value for R-factor and/or resolution. )

4. FORMAT OF THE DATABASE

The database of representative protein chains in PDB, which are selected by the above method, consists of a Table of PDB-REPRDB Version 3.0, which shows the number of selected chains at several threshold sequence identity(ID%) and maximum distance between the pair of atoms each from the two structures (Dmax).

The values in this table are hyperlinked to the list of PDB-REPRDB at the corresponding threshold sequence identity(ID%) and structure similarity (Dmax).

The list contains PDB entry IDs, chain IDs, "*", number of amino acids (naa), resolution (Res), R-fator (Rfac), experimental method (Methd), the number of residues with side chain coordinates (n_sid), the number of residues with backbone coordinates (n_bck),the number of residues with CA coordinates (n_ca), the number of non-standard amino acid residues(n_naa), EC number and header lines in PDB.

The ID of protein chains (PDB entry IDs and chain IDs) in the list is arranged alphabetically and hyperlinked to the list of similar proteins that are not selected as the representative chain, and if the "*" is clicked, the protein 3D view will be displayed by "Rasmol".

In addition, the EC number in the list is hyperlinked to the corresponding LIGAND entry.

5. ACKNOWLEDGMENTS

We thank Dr. Susumu Goto and Prof. Minoru Kanehisa at Institute for Chemical Research, Kyoto University for useful discussions and suggestions.