PDB-REPRDB
DATABASE OF REPRESENTATIVE PROTEIN CHAINS IN PDB(PROTEIN DATA BANK)
Tamotsu NOGUCHI, Kentaro ONIZUKA, Yutaka AKIYAMA, and Minoru SAITO
Parallel Application Laboratory, Tsukuba Research Center
Real World Computing Partnership
Version 6.0 (PDB Rel. #83) May 1998
( This document was updated at 20 May 1998 )
Reference :
Noguchi T., Onizuka K., Akiyama Y., Saito M. (1997).
"PDB-REPRDB: A Database of Representative Protein Chains in PDB (Protein Data
Bank)".
In Proceedings of the Fifth International Conference on Intelligent Systems
for
Molecular Biology, AAAI press, Menlo Park, CA.
1. INTRODUCTION
The PDB-REPRDB consists of a list of representative protein chains. The
criteria of selecting the representatives are,
- ) quality of the atomic coordinate data,
- ) sequence uniqueness, and
- ) conformation
uniqueness, particularly local conformation uniqueness.
The first version of PDB-REPRDB consisted of 763 representative chains from
the PDB
Release 70 (Oct. 1994) at Brookhaven National Laboratory and was released on
GenomeNet WWW server (http://www.genome.ad.jp/htbin/show_pdbreprdb) in
July 1995.
The second version (PDB-REPRDB Ver. 2.0) used data from the PDB Release 78
(Oct. 1996) and
was released in April 1997 on our server
(http://mpap1.trc.rwcp.or.jp/pdbreprdb).
The third version (PDB-REPRDB Ver. 3.0) used data from the PDB Release 80
(Apr. 1997)
"PDB-RERPDB" has been available on the server
(http://pdap1.trc.rwcp.or.jp/pdbreprdb).
The fourth version (PDB-REPRDB Ver. 4.0) uses data from PDB Release 81 (Jul.
1997) and includes
NMR structures.
In this version, the selection criteria remains essentially the same as in
the first one, but the selection procedure has been almost completely
automated, by the use of a parallelized algorithm for a quick selection of
representative chains .
2. COPYRIGHT NOTICE
Copyright 1998,1999 by Real World Computing Partnership(RWCP). All rights
reserved.
For further information regarding permission for use or reproduction, please
contact to Tamotsu Noguchi, Real World Computing Partnership,Parallel
Application
Laboratory, Tsukuba Research Center, Tsukuba Mitsui Bldg., 1-6-1 Takezono,
Tsukuba
305, Japan.
TEL:81(298)53-1707,FAX:81(298)53-1680,E-mail:noguchi@trc.rwcp.or.jp.
3. METHOD
The representative protein chains are selected as follows.
- (1)
- Exclude the following entries from the selection
- (a)
- DNA and RNA data
- (b)
- theoretically modeled data
- (c)
- short chains (l < 40 residues)
- (d)
- data with incomplete backbone coordinates
- (e)
- data with incomplete side chain coordinates
- (f)
- data without refinement (by X-PLOR, TNT, etc.)
- *)
- Data of NMR spectroscopy are included in version 4.0
- (2)
-
The selected chains in the entries are sorted according to their data
quality as follows:
First, the selected chains are classified into three classes.
Class A chains are those with both good resolution (<= 3.0 angstrom)
and good R-Factor (<= 0.3). Class B chains are those with resolution
(>= 3.0 angstrom) and R-Factor (>= 0.3). The chains derived by NMR
spectroscopy are classified into class C.
Second, we sort the chains with respect to the resolution of
structure determination within each class (A and B), and concatenate
the class C chains.
The chains with the same resolution are further sorted by R-Factor value.
When several chains have the same resolution and R-Factor, they are further
sorted by:
- (a)
- the number of chain breaks (the less the better)
- (b)
- the number of non-standard amino acid residues (the less the better)
- (c)
- the number of residues without backbone coordinates (the less the
better)
- (d)
- the number of residues without side chain coordinates (the less the
better)
- (e)
- whether mutant or wild (the wild type has priority)
- (f)
- whether complex or not (the non-complex has priority)
- (g)
- alphabetical order of the entry name
(e.g. 1MCD < 1MCE, 5AT1A < 5AT1C)
- (3)
-
The similar chains are eliminated from the sorted list as follows:
The first chain in the sorted list has the best quality and is the first
representative. Then, suppose we have already selected N representatives
from the sorted list.
Now, the sorted list does not contain the chains
- 1)
- that
have already been selected and taken out as representatives,
- 2)
- that have
already been eliminated through the selection procedure
of the N representatives.
Thus, the first chain remaining in the sorted list has the highest priority to
be the next representative. We check the "similarity" between the first
chain of the list and each of the chains in the sorted list.
After that, the first chain becomes the (N+1)-st representative. And if
the first chain is similar to some chain, the chain will be eliminated
from the sorted list, and then, the second chain comes to the first of
the sorted list. This procedure repeats until the sorted list goes to null.
We consider the chain is NOT similar, either
- (a)
- if the sequential identity is less than a certain threshold value (<=
95%, 85%, 75%, 65%, 55%, 45%, 35% or 25%), where the sequence
identity is measured after a pairwize sequence alignment,
- (b)
-
or, if the maximum value of the distance between the "CA" atoms when
the two
structures are superimposed is greater than a certain threshold (>= 10, 20,
30, 40, 50 angstrom or infinity).
Before superimposing the two structures, we align the two sequences
by
the pairwise sequence alignment to check their sequential identity. The
matched
residues in the alignment are superimposed by the least square fitting
procedure.
Finally, all the chains (in class A, B and C) are classified into
protein-chain groups, where each chain is classified into the group
whose representative chain is sequentially nearest to the chain.
( Since R-factor and resolution values are written in the unformatted form in
PDB-files,
the PDB file read program may in some cases find an incorrect value or
not find a value for R-factor and/or resolution. )
4. FORMAT OF THE DATABASE
The database of representative protein chains in PDB, which are
selected by the above method, consists of a Table of
PDB-REPRDB Version 3.0, which shows the number of selected chains at several
threshold sequence identity(ID%) and maximum distance between the pair of
atoms
each from the two structures (Dmax).
The values in this table are hyperlinked to the list of PDB-REPRDB at
the corresponding threshold sequence identity(ID%) and structure similarity
(Dmax).
The list contains PDB entry IDs, chain IDs, "*", number of amino acids
(naa),
resolution (Res), R-fator (Rfac), experimental method (Methd), the number of
residues with side chain coordinates (n_sid), the number of residues with
backbone coordinates (n_bck),the number of residues with CA coordinates
(n_ca),
the number of non-standard amino acid residues(n_naa), EC number and header
lines
in PDB.
The ID of protein chains (PDB entry IDs and chain IDs) in the list is
arranged
alphabetically and hyperlinked to the list of similar proteins that are not
selected as the representative chain, and if the "*" is clicked, the protein
3D view will be displayed by "Rasmol".
In addition, the EC number in the list is hyperlinked to the corresponding
LIGAND entry.
5. ACKNOWLEDGMENTS
We thank Dr. Susumu Goto and Prof. Minoru Kanehisa at Institute
for Chemical Research, Kyoto University for useful discussions and
suggestions.
Link to PAPIA system