PDB-REPRDB DATABASE OF REPRESENTATIVE PROTEIN CHAINS IN PDB(PROTEIN DATA BANK)
Tamotsu NOGUCHI, Kentaro ONIZUKA, Yutaka AKIYAMA, and Minoru SAITO Parallel Application Laboratory, Tsukuba Research Center Real World Computing Partnership Version 4.0 (PDB Rel. #81) December 1997 ( This document was updated at 6 January 1998 ) Reference: Noguchi T., Onizuka K., Akiyama Y., Saito M. (1997). "PDB-REPRDB: A Database of Representative Protein Chains in PDB (Protein Data Bank)". In Proceedings of the Fifth International Conference on Intelligent Systems for Molecular Biology, AAAI press, Menlo Park, CA. 1. INTRODUCTION The database of representative protein chains is consist of the representative list of protein chains. The criteria of selecting the representatives are, a) quality of atomic coordinate data,b) sequence uniqueness, and c) conformation uniqueness particularly local. The first version of PDB-REPRDB consists of 763 representative chains from PDB Release 70 (Oct. 1994) at Brookhaven National Laboratory and was released in July 1995 on GenomeNet WWW server (http://www.genome.ad.jp/htbin/show_pdbreprdb). And the second version (PDB-REPRDB Ver. 2.0) selected from PDB Release 78 (Oct. 1996) was released in April 1997 on our server (http://mpap1.trc.rwcp.or.jp/pdbreprdb). From the third version (PDB-REPRDB Ver. 3.0) selected from PDB Release 80 (Apr. 1997) "PDB-RERPDB" has been available on the server (http://pdap1.trc.rwcp.or.jp/pdbreprdb). From the fourth version (PDB-REPRDB Ver. 4.0) selected from PDB Release 81 (Jul. 1997) NMR data have been included in "PDB-RERPDB". The selection policy remains almost same, while the selection procedure, which was almost completely automated by sophisticated algorithms, parallelized for a quick selection of representative chains at this version. 2. COPYRIGHT NOTICE Copyright 1998,1999 by Real World Computing Partnership(RWCP). All rights reserved. For further information regarding permission for use or reproduction, please contact to Tamotsu Noguchi, Real World Computing Partnership,Parallel Application Laboratory, Tsukuba Research Center, Tsukuba Mitsui Bldg., 1-6-1 Takezono, Tsukuba 305, Japan. TEL:81(298)53-1707,FAX:81(298)53-1680,E-mail:noguchi@trc.rwcp.or.jp. 3. METHOD The representative protein chains are selected as follows. (1) Exclude the following entries from the selection (a) DNA and RNA data (b) theoretically modeled data (c) short chains (l < 40 residues) (d) data without backbone coordinates at all residues (e) data without side chain coordinates at all residues (f) data without refinement (by X-PLOR, TNT, etc.) *) Data of NMR spectroscopy are included since a version 4.0 (2) All chains are extracted from each entry selected and actually are sorted according to the data quality. First, the selected chains are classified into three classes. Class A chains are those with good resolution (<= 3.0 angstrom) and good R-Factor (<= 0.3). Class B chains are those with resolution (>= 3.0 angstrom) and R-Factor (>= 0.3). The chains derived by NMR spectroscopy are classified into class C. Second, we sort the chains with respect to the resolution of structure determination within each class (A and B), and concatenate the class C chains. The chains with the same resolution are further sorted by R-Factor value. When plural chains have the same resolution and R-Factor, those are sorted by: (a) the number of chain breaks (the less the better) (b) the number of non-standard amino acid residues (the less the better) (c) the number of residues without backbone coordinates (the less the better) (d) the number of residues without side chain coordinates (the less the better) (e) whether mutant or wild (the wild type has priority) (f) whether complex or not (the non-complex has priority) (g) alphabetical order of the entry name (e.g. 1MCD < 1MCE, 5AT1A < 5AT1C) (3) In this elimination phase, those chains with better quality have the priority to be the representatives. The first chain in the sorted list is to be the first representative because the chain has the best quality. Suppose we have already selected N representatives from the sorted list. Now that the sorted list does not contain the chains 1) which have already been selected and taken out as the representatives, 2) which have already been eliminated through the selection procedure of the N representatives. Thus, the first chain remained in the sorted list has the highest priority to be the next representative. We check the "similarity" between the first chain of the list and each of the chains in the sorted list. After that, the first chain becomes the (N+1)-st representative. And if the first chain is similar to some chain, the chain will be eliminated from the sorted list, and then, the second chain comes to the first of the sorted list. This procedure repeats until the sorted list goes to null. In the similarity check, we consider the chain is NOT similar, (a) if the sequential identity is less than a certain threshold value (<= 95%, 90%, 85%, 80%, 75%, 65%, 55%, 45%, 35% or 25%), where the sequence identity is measured after the pairwize sequence alignment, (b) or, if the maximum distance between the superimposed pair of atoms each from the two structures that we are looking at is greater than a certain threshold (>= 10, 20, 30, 40, 50 angstrom or infinity). Before superimposing the two structures, we have aligned the two sequences by the pairwise sequence alignment for the check of sequential identity. The matched sites in the alignment are superimposed by the least square fitting procedure. Finally, all the chains (in class A, B and C) are classified into protein-chain groups, where each chain is classified into the group whose representative chain is sequentially nearest to the chain. ( Since R-factor and resolution values are written in the unformatted form in PDB-files, the PDB file read program may in some cases find an incorrect value or not find a value for R-factor and/or resolution. ) 4. FORMAT OF THE DATABASE The database of representative protein chains in PDB, which are selected by the above method, is consist of a Table of PDB-REPRDB Version 3.0, which shows the number of selected chains at sevral threshold sequence identity(ID%) and maximum distance between the pair of atoms each from the two structures (Dmax). The values in this table hyperlinked to the list of PDB-REPRDB at the corresponding threshold sequence identity(ID%) and structure similarity (Dmax). The list contains PDB entry IDs, chain IDs, "*", numbser of amino acids (naa), resolution (Res), R-fator (Rfac), experimental method (Methd), the number of residues with side chain coordinates (n_sid), the number of residues with backbone coordinates (n_bck),the number of residues with CA coordinates (n_ca), the number of non-standard amino acid residues(n_naa), EC number and header lines in PDB. The ID of protein chains (PDB entry IDs and chain IDs) in the list is arranged alphabetically and hyperlinked to the list of the similar proteins which are not selected as the representative chain, and if the "*" is clicked, the protein 3D view will be displayed by the "Rasmol" respectively. And EC number in the list is hyperlinked to the corresponding LIGAND entry. 5. ACKNOWLEDGMENTS We thank Dr. Susumu Goto and Prof. Minoru Kanehisa at Institute for Chemical Research, Kyoto University for useful discussions and suggestions.