Database Document: PDB-REPRDB V4.0


PDB-REPRDB
DATABASE OF REPRESENTATIVE PROTEIN CHAINS IN PDB(PROTEIN DATA BANK)

Tamotsu NOGUCHI, Kentaro ONIZUKA, Yutaka AKIYAMA, and Minoru SAITO
Parallel Application Laboratory, Tsukuba Research Center
Real World Computing Partnership

                        Version 4.0 (PDB Rel. #81)   December 1997
                     ( This document was updated at 6 January 1998 )

Reference: Noguchi T., Onizuka K., Akiyama Y., Saito M. (1997).
 "PDB-REPRDB: A Database of Representative Protein Chains in PDB (Protein Data Bank)".
 In Proceedings of the Fifth International Conference on Intelligent Systems for 
 Molecular Biology, AAAI press, Menlo Park, CA.

1. INTRODUCTION

 The database of representative protein chains is consist of the representative
list of protein chains. The criteria of selecting the representatives are,
a) quality of atomic coordinate data,b) sequence uniqueness, and c) conformation
uniqueness particularly local.
 The first version of PDB-REPRDB consists of 763 representative chains from PDB
Release 70 (Oct. 1994) at Brookhaven National Laboratory and was released in
July 1995 on GenomeNet WWW server (http://www.genome.ad.jp/htbin/show_pdbreprdb).
 And the second version (PDB-REPRDB Ver. 2.0) selected from PDB Release 78 (Oct. 1996)
was released in April 1997 on our server (http://mpap1.trc.rwcp.or.jp/pdbreprdb).
 From the third version (PDB-REPRDB Ver. 3.0) selected from PDB Release 80 (Apr. 1997)
"PDB-RERPDB" has been available on the server (http://pdap1.trc.rwcp.or.jp/pdbreprdb).
 From the fourth version (PDB-REPRDB Ver. 4.0) selected from PDB Release 81 (Jul. 1997)
NMR data have been included in "PDB-RERPDB".
 The selection policy remains almost same, while the selection procedure, which
was almost completely automated by sophisticated algorithms, parallelized for a
quick selection of representative chains at this version.

2. COPYRIGHT NOTICE

 Copyright 1998,1999 by Real World Computing Partnership(RWCP). All rights reserved.
 For further information regarding permission for use or reproduction, please 
contact to Tamotsu Noguchi, Real World Computing Partnership,Parallel Application 
Laboratory, Tsukuba Research Center, Tsukuba Mitsui Bldg., 1-6-1 Takezono, Tsukuba
305, Japan. 
TEL:81(298)53-1707,FAX:81(298)53-1680,E-mail:noguchi@trc.rwcp.or.jp.

3. METHOD

  The representative protein chains are selected as follows.

(1) Exclude the following entries from the selection
   (a) DNA and RNA data
   (b) theoretically modeled data
   (c) short chains (l < 40 residues)
   (d) data without backbone coordinates at all residues
   (e) data without side chain coordinates at all residues
   (f) data without refinement (by X-PLOR, TNT, etc.)

    *) Data of NMR spectroscopy are included since a version 4.0

(2) All chains are extracted from each entry selected and actually
   are sorted according to the data quality.
    First, the selected chains are classified into three classes.
   Class A chains are those with good resolution (<= 3.0 angstrom)
   and good R-Factor (<= 0.3). Class B chains are those with resolution 
   (>= 3.0 angstrom) and R-Factor (>= 0.3). The chains derived by NMR
   spectroscopy are classified into class C.
    Second, we sort the chains with respect to the resolution of
   structure determination within each class (A and B), and concatenate
   the class C chains. 
    The chains with the same resolution are further sorted by R-Factor value.
   When plural chains have the same resolution and R-Factor, those are sorted by:
   (a) the number of chain breaks (the less the better)
   (b) the number of non-standard amino acid residues (the less the better)
   (c) the number of residues without backbone coordinates (the less the better)
   (d) the number of residues without side chain coordinates (the less the better)
   (e) whether mutant or wild  (the wild type has priority)
   (f) whether complex or not  (the non-complex has priority)
   (g) alphabetical order of the entry name
        (e.g. 1MCD < 1MCE, 5AT1A < 5AT1C)
 
(3) In this elimination phase, those chains with better quality have the
   priority to be the representatives. The first chain in the sorted
   list is to be the first representative because the chain has the best
   quality.
    Suppose we have already selected N representatives from the sorted
   list. Now that the sorted list does not contain the chains 1) which
   have already been selected and taken out as the representatives, 2) which have
   already been eliminated through the selection procedure
   of the N representatives.
    Thus, the first chain remained in the sorted list has the highest priority to
   be the next representative. We check the "similarity" between the first
   chain of the list and each of the chains in the sorted list.
   After that, the first chain becomes the (N+1)-st representative. And if
   the first chain is similar to some chain, the chain will be eliminated
   from the sorted list, and then, the second chain comes to the first of
   the sorted list.  This procedure repeats until the sorted list
   goes to null.
     In the similarity check, we consider the chain is NOT similar,

   (a) if the sequential identity is less than a certain threshold value (<= 95%,
      90%, 85%, 80%, 75%, 65%, 55%, 45%, 35% or 25%), where the sequence identity
      is measured after the pairwize sequence alignment, 
   (b) or, if the maximum distance between the superimposed pair of atoms each 
      from the two structures that we are looking at is greater than a certain
      threshold (>= 10, 20, 30, 40, 50 angstrom or infinity).

     Before superimposing the two structures, we have aligned the two sequences by
    the pairwise sequence alignment for the check of sequential identity. The matched
    sites in the alignment are superimposed by the least square fitting procedure.

     Finally, all the chains (in class A, B and C) are classified into
    protein-chain groups, where each chain is classified into the group
    whose representative chain is sequentially nearest to the chain.

( Since R-factor and resolution values are written in the unformatted form in PDB-files,
 the PDB file read program may in some cases find an incorrect value or
 not find a value for R-factor and/or resolution. )

4. FORMAT OF THE DATABASE 

 The database of representative protein chains in PDB, which are
selected by the above method, is consist of a Table of
PDB-REPRDB Version 3.0, which shows the number of selected chains at sevral
threshold sequence identity(ID%) and maximum distance between the pair of atoms
each from the two structures (Dmax).
 The values in this table hyperlinked to the list of PDB-REPRDB at
the corresponding threshold sequence identity(ID%) and structure similarity
(Dmax). 
 The list contains PDB entry IDs, chain IDs, "*", numbser of amino acids (naa),
resolution (Res), R-fator (Rfac), experimental method (Methd), the number of 
residues with side chain coordinates (n_sid), the number of residues with 
backbone coordinates (n_bck),the number of residues with CA coordinates (n_ca), 
the number of non-standard amino acid residues(n_naa), EC number and header lines 
in PDB.
 The ID of protein chains (PDB entry IDs and chain IDs) in the list is arranged
alphabetically and hyperlinked to the list of the similar proteins which are not 
selected as the representative chain, and if the "*" is clicked, the protein 
3D view will be displayed by the "Rasmol" respectively.
 And EC number in the list is hyperlinked to the corresponding LIGAND entry.

5. ACKNOWLEDGMENTS

 We thank Dr. Susumu Goto and Prof. Minoru Kanehisa at Institute
for Chemical Research, Kyoto University for useful discussions and suggestions.
PDB-REPRDB DATABASE OF REPRESENTATIVE PROTEIN CHAINS IN PDB(PROTEIN DATA BANK)