Swiss-Prot release 21.0
Published March 1, 1992
SWISS-PROT RELEASE 21.0 RELEASE NOTES
1. INTRODUCTION
1.1 Evolution
Release 21.0 of SWISS-PROT contains 23742 sequence entries, comprising
7'866'596 amino acids abstracted from 23919 references. This represents
an increase of 5% over release 20. The recent growth of the data bank is
summarized below.
Release Date Number of entries Nb of amino acids
3.0 11/86 4160 969 641
4.0 04/87 4387 1 036 010
5.0 09/87 5205 1 327 683
6.0 01/88 6102 1 653 982
7.0 04/88 6821 1 885 771
8.0 08/88 7724 2 224 465
9.0 11/88 8702 2 498 140
10.0 03/89 10008 2 952 613
11.0 07/89 10856 3 265 966
12.0 10/89 12305 3 797 482
13.0 01/90 13837 4 347 336
14.0 04/90 15409 4 914 264
15.0 08/90 16941 5 486 399
16.0 11/90 18364 5 986 949
17.0 02/91 20024 6 524 504
18.0 05/91 20772 6 792 034
19.0 08/91 21795 7 173 785
20.0 11/91 22654 7 500 130
21.0 03/92 23742 7 866 596
1.2 Source of data
Release 21.0 has been updated using protein sequence data from release
31.0 of the PIR (Protein Identification Resource) protein data bank, as
well as translation of nucleotide sequence data from release 29.0 of the
EMBL Nucleotide Sequence Database.
<PAGE>
As an indication to the source of the sequence data in the SWISS-PROT
data bank we list here the statistics concerning the DR (Database cross-
references) pointer lines:
Entries with pointer(s) to only PIR entri(es): 4198
Entries with pointer(s) to only EMBL entri(es): 3031
Entries with pointer(s) to both EMBL and PIR entri(es): 16003
Entries with no pointers lines: 510
2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 20
2.1 Sequences and annotations
About 1100 sequences have been added since release 20, the sequence data
of 150 existing entries has been updated and the annotations of 2860
entries have been revised. In particular we have used reviews articles
to update the annotations of the following groups or families of
proteins:
- Acid phosphatases
- Acylphosphatases
- Bacterial regulatory proteins, luxR family
- Cyclins
- Cytochromes P450
- C-type lectin domain proteins
- Histidinol dehydrogenases
- Indole-3-glycerol phosphate synthases
- Microviridae sequences
- Myoglobins
- Osteonectin domain proteins
- Snakes venom phospholipases A2
- RecF proteins
- Scorpions venom beta-toxins
- PTN/MK heparin-binding protein family
- Tissue factor
2.2 Change in the format of the entry names
The dollar sign `$' in entry names has been replaced by the underscore
character `_'. This change is made on the behalf of users of sequence
analysis software running under the Unix operating system, where the
dollar sign is a reserved symbol. Example: the entry name `CYC$HUMAN'
has been changed to `CYC_HUMAN'.
2.3 New line type GN
The GN (Gene Name) line is a new line that is used to indicate the
name(s) of the gene(s) that encodes for the protein being described.
<PAGE>
Previously this information used to be found in the DE line as shown in
the following example.
In previous releases:
DE SERUM ALBUMIN PRECURSOR (GENE NAME: ALB).
In the current release:
DE SERUM ALBUMIN PRECURSOR.
GN ALB.
The format of the GN line is:
GN NAME1[ AND|OR NAME2...].
Examples:
GN ALB.
GN REX-1.
It often occurs that more than one gene name has been assigned to an
individual locus. In that case all the synonyms are listed. The word
`OR' separates the different designations. The first name in the list is
assumed to be the most correct (or most current) designation. Example:
GN HNS OR DRDX OR OSMZ OR BGLY.
In a few cases, multiple genes encode for an identical protein sequence.
In that case all the different gene names are listed. The word `AND'
separates the designations. Example:
GN CECA1 AND CECA2.
In very rare cases (only one occurrence has been found in the current
release) `AND' and `OR' could be both present. In that case parenthesis
are used as shown in the following example:
GN GVPA AND (GVPB OR GVPA2).
2.4 New line type RM
The RM (Reference Medline) line is used to indicate the Medline Unique
Identifier (UID) of a reference. Previously this information was listed
in the RC line using the `MEDLINE' token as shown in the following
example.
In previous releases:
RC MEDLINE=90205618;
<PAGE>
In the current release:
RM 90205618
The format of the RM line is:
RM nnnnnnnn
where `nnnnnnnn' is the eight digit Medline Unique Identifier (UID).
2.5 Secondary structure information
Thanks to the help of Chris Sander and Reinhard Schneider of the
Biocomputing group at EMBL we have added to the feature table of
sequence entries of proteins whose tertiary structure is known
experimentally, the secondary structure information corresponding to
that protein. The secondary structure assignment is made according to
DSSP (see Kabsch W., Sander C.; Biopolymers, 22:2577-2637(1983)) and the
information is extracted from the coordinate data sets of the Protein
Data Bank (PDB).
In the feature table only three types of secondary structure are
specified : helices (HELIX), beta-strand (STRAND) and turns (TURN).
Residues not specified in one of these classes are in a `loop' or
`random-coil' structure). Because the DSSP assignment has more than the
three common secondary structure classes, we have converted the
following DSSP assignments to HELIX, STRAND and TURN:
DSSP DSSP definition SWISS-PROT
code assignment
---- --------------------------------------------- --------------
H Alpha-helix HELIX
G 3(10) helix HELIX
I Pi-helix HELIX
E Hydrogen bonded beta-strand (extended strand) STRAND
B Residue in an isolated beta-bridge STRAND
T H-bonded turn (3-turn, 4-turn or 5-turn) TURN
S Bend (five-residue bend centered at residue i) Not specified
One should be aware of the following facts:
a) Segment Length. For helices (alpha and 3-10), the residue just before
and just after the helix as given by DSSP participates in the helical
hydrogen bonding pattern with a single H-bond. For some practical
purposes, one can therefore extend the HELIX range by one residue on
each side. E.g. HELIX 25-35 instead of HELIX 26-34. Also, the ends of
secondary structure segments are less well defined for lower
resolution structures. A fluctuation of +/- one residue is common.
b) Missing segments. In low resolution structures, badly formed helices
or strands may be omitted in the DSSP definition.
<PAGE>
c) Special helices and strands. Helices of length three are 3-10
helices, those of length four and longer are either alpha-helices or
3-10 helices (pi helices are extremely rare). A strand of length one
corresponds to a residue in an isolated beta-bridge. Such bridges can
be structurally important.
d) Missing secondary structure. No secondary structure is currently
given in the feature table in the following cases:
- No sequence data in the PDB entry.
- Structure for which only C-alpha coordinates are in PDB.
- NMR structure with more than one coordinate data set.
- Model (i.e. theoretical) structure.
2.6 Feature key name change
The secondary structure description feature key `BETA' has been renamed
`STRAND' (see the section above for its current definition).
2.7 Alu-derived warning entries
Following the advice and in collaboration with Jean-Michel Claverie of
the National Center for Biotechnology Information (NCBI, Washington
D.C.) we have added to SWISS-PROT Alu-derived "warning" entries. These
entries are provided in order to avoid the further 'pollution' of
protein sequence databases with Alu-derived amino acid sequences.
Alu repetitive sequences are interspersed in human and primate genomes
with an average spacing of 3 Kb. Some of them are actively transcribed
by pol III. Normal transcripts may contain Alu-derived sequences in 5'
or 3' untranslated regions. however, cDNA libraries also contain partial
and/or rearranged cDNAs ligated with Alu-derived sequence in any
orientation. This has been overlooked in several occasions, with the
consequence of erroneous Alu-derived amino acid sequences being
reported.
Various analyses indicate that Alu repeats fall into six classes (A to
F). Therefore six "warning" entries have been constituted with all six
frames conceptual translations of one random member of each of these
classes of Alu repeats. Any significant similarity of a putative protein
sequence with an Alu-translated entry must be taken as a warning that a
part of Alu repeat may have been artifactually included in the coding
nucleotide sequence.
These sequences have been assigned accession numbers P23959 (ALUA_HUMAN)
to P23964 (ALUF_HUMAN).
<PAGE>
2.8 Feature lines `spring cleaning'
We are in the process of `cleaning' up the comments part of feature
lines to homogenize the description of specific domains and sites.
For example regions enriched in one or more types of amino acid are now
described using the general format:
FT DOMAIN xxx xxx AA1[/AA2/.../AAN]-RICH.
Where AA1, AA2, ... AAN are valid amino-acid three letter codes (the
twenty standard codes with the addition of `GLA' for gamma-carboxylic
acid).
Examples:
FT DOMAIN 12 45 PRO-RICH.
FT DOMAIN 123 456 ASP/GLU-RICH (ACIDIC).
FT DOMAIN 246 678 SER/THR-RICH (LINKER REGION).
Many other changes of this nature have either been completed in this
release or are in the process of being carried out.
Also note that `non-experimental' derived features are now only
indicated by the qualifiers `PROBABLE', `POTENTIAL', or `BY SIMILARITY';
the use of qualifiers such as `PUTATIVE', `POSSIBLE', `TENTATIVE', etc.
has been discontinued.
This cleaning process will continue in the next two or three releases.
3. FORTHCOMING CHANGES
The following changes will be implemented starting with release 22.
3.1 A new feature table key: UNSURE
The UNSURE key will be used to describe region(s) of a sequence for
which the authors are unsure about the sequence assignment.
3.2. Others
Other changes are planned, but we are already past our deadline to
prepare this release so hare are some very brief notes!
- An ASN.1 version of SWISS-PROT will soon be officially available
(thanks to Mark Cavanaugh of the NCBI). Software developers that are
interested in such a version can already obtain a beta-test release
of SWISS-PROT 21 in ASN.1 format (For details contact me at
bairoch@cmu.unige.ch).
- We are thinking of some new topics for the CC lines.
- New developments concerning the integration of SWISS-PROT with other
data banks is in the 'pipeline'.
<PAGE>
4. ENZYME AND PROSITE
4.1 The ENZYME data bank
Release 8.0 of the ENZYME data bank is distributed along with release 21
of SWISS-PROT. ENZYME release 8.0 contains information relative to 3073
enzymes. The data bank is complete and up to date. Until new enzyme
nomenclature data is published we only plan to update the SWISS-PROT
pointers at each release of the protein sequence data bank, correct
eventual errors, and complete the information concerning synonyms and
cofactors using the literature.
4.2 The PROSITE data bank
Release 8.10 of the PROSITE data bank is distributed along with release
21 of SWISS-PROT. Release 8.10 contains 530 documentation chapters that
describes 605 different patterns. Release 8.10 does not really represent
a new release; the only changes between release 8.0 and 8.10 are
updating of the pointers to the SWISS-PROT entries whose name have been
modified between release 20 and 21. The next release of PROSITE (9.0)
will be distributed with release 22 of SWISS-PROT.
5. WE NEED YOUR HELP !
We welcome feedback from our users. We would especially appreciate that
you notify us if you find that sequences belonging to your field of
expertise are missing from the data bank. We also would like to be
notified about annotations to be updated, as for example if the function
of a protein has been clarified or if new post-translational information
has become available.
<PAGE>
APPENDIX A: SOME STATISTICS
A.1 Amino acid composition
A.1.1 Composition in percent for the complete data bank
Ala (A) 7.65 Gln (Q) 4.07 Leu (L) 9.15 Ser (S) 7.08
Arg (R) 5.23 Glu (E) 6.26 Lys (K) 5.83 Thr (T) 5.84
Asn (N) 4.45 Gly (G) 7.10 Met (M) 2.33 Trp (W) 1.30
Asp (D) 5.24 His (H) 2.27 Phe (F) 3.97 Tyr (Y) 3.22
Cys (C) 1.81 Ile (I) 5.46 Pro (P) 5.08 Val (V) 6.49
Asx (B) 0.01 Glx (Z) 0.01 Xaa (X) 0.03
A.1.2 Classification of the amino acids by their frequency
Leu, Ala, Gly, Ser, Val, Glu, Thr, Lys, Ile, Asp, Arg, Pro, Asn, Gln,
Phe, Tyr, Met, His, Cys, Trp
A.2 Repartition of the sequences by their organism of origin
Total number of species represented in this release of SWISS-PROT: 3159
A.2.1 Table of the frequency of occurrence of species
Species represented 1x: 1373
2x: 572
3x: 319
4x: 191
5x: 135
6x: 101
7x: 72
8x: 49
9x: 64
10x: 33
11- 20x: 128
21-100x: 95
>100x: 27
<PAGE>
A.2.2 Table of the most represented species
Number Frequency Species
1 1891 Human
2 1697 Escherichia coli
3 1116 Mouse
4 1057 Rat
5 803 Baker's yeast (Saccharomyces cerevisiae)
6 519 Bovine
7 443 Fruit fly (Drosophila melanogaster)
8 392 Chicken
9 347 Bacillus subtilis
10 279 African clawed frog (Xenopus laevis)
11 271 Rabbit
12 253 Pig
13 251 Vaccinia virus (strain Copenhagen)
14 218 Salmonella typhimurium
15 193 Human cytomegalovirus (strain AD169)
16 170 Maize
17 167 Bacteriophage T4
18 151 Vaccinia virus (strain WR)
19 135 Rice
20 125 Tobacco
21 121 Wheat
22 113 Pea
23 112 Staphylococcus aureus
24 104 Pseudomonas aeruginosa
25 103 Slime mold (Dictyostelium discoideum)
103 Barley
27 101 Sheep
28 100 Fission yeast (Schizosaccharomyces pombe)
29 95 Spinach
95 Dog
95 Caenorhabditis elegans
32 94 Soybean
33 92 Neurospora crassa
34 90 Pseudomonas putida
35 89 Agrobacterium tumefaciens
<PAGE>
A.3 Repartition of the sequences by size
From To Number From To Number
1- 50 1539 1001-1100 220
51- 100 2602 1101-1200 138
101- 150 3672 1201-1300 115
151- 200 2281 1301-1400 68
201- 250 1932 1401-1500 57
251- 300 1745 1501-1600 32
301- 350 1566 1601-1700 26
351- 400 1537 1701-1800 27
401- 450 1177 1801-1900 31
451- 500 1265 1901-2000 26
501- 550 926 2001-2100 10
551- 600 644 2101-2200 26
601- 650 463 2201-2300 31
651- 700 329 2301-2400 11
701- 750 310 2401-2500 12
751- 800 236 >2500 54
801- 850 194
851- 900 201
901- 950 122
951-1000 117
Currently the ten largest sequences are:
RYNR$RABIT 5037 a.a.
RYNR$HUMAN 5032 a.a.
APB$HUMAN 4563 a.a.
APOA$HUMAN 4548 a.a.
DYHC$TRIGR 4466 a.a.
POLG$BVDV 3988 a.a.
POLG$HCVA 3898 a.a.
POLG$HCVB 3898 a.a.
TRX$DROME 3759 a.a.
ACVA$PENCH 3746 a.a.
<PAGE>
APPENDIX B: ON-LINE EXPERTS
B.1 List of on-line experts for PROSITE and SWISS-PROT
Field of expertise Name Email address
--------------------------- ------------------ ----------------------------
African swine fever virus Yanez R.J. ryanez@cbm2.uam.es
Alcohol dehydrogenases Joernvall H. hans.jornvall@k1m.ki.se
Persson B. bengt@medfys.ki.se
Aldehyde dehydrogenases Joernvall H. hans.jornvall@k1m.ki.se
Persson B. bengt@medfys.ki.se
Alpha-crystallins/HSP-20 Leunissen J.A.M. jackl@caos.caos.kun.nl
de Jong W. u629000@hnykun11.bitnet
Alpha-2-macroglobulins Van Leuven F. fred@blekul13.bitnet
Apolipoproteins Boguski M.S. boguski@ncbi.nlm.nih.gov
Arrestins Kolakowski L.F.Jr. kolakowski@helix.mgh.harvard.edu
Bacteriophage P4 proteins Halling C. chh9@midway.uchicago.edu
Beta-lactamases Brannigan J. jab5@vaxa.york.ac.uk
Chitinases Henrissat B. cermav@frgren81.bitnet
Clusterin Peitsch M.C. peitsch@ulbio1.unil.ch
CTF/NF-I Mermod N. nmermod@ulys.unil.ch
Cytochromes P450 Holsztynska E.J. ela@netcom.uucp
netcom!ela@apple.com
DEAD-box helicases Linder P. linder@urz.unibas.ch
EF-hand calcium-binding Cox J.A. cox@sc2a.unige.ch
Kretsinger R.H. rhk5i@virginia.bitnet
Enoyl-CoA hydratase Hofmann K.O. khofmann@cipvax.biolan.uni-koeln.de
fruR/lacI family HTH proteins Reizer J. jreizer@ucsd.edu
GATA-type zinc-fingers Boguski M.S. boguski@ncbi.nlm.nih.gov
Glucanases Henrissat B. cermav@frgren81.bitnet
Beguin P. phycel@pasteur.bitnet
G-protein coupled receptors Chollet A. chollet@clients.switch.ch
Attwood T.K. bph6tka@biovax.leeds.ac.uk
GTPase-activating proteins Boguski M.S. boguski@ncbi.nlm.nih.gov
HMG1/2 and HMG-14/17 Landsman D. landsman@ncbi.nlm.nih.gov
Inorganic pyrophosphatases Kolakowski L.F.Jr. kolakowski@helix.mgh.harvard.edu
Integrases Roy P.H. 2020000@lavalvx1.bitnet
Lipocalins Boguski M.S. boguski@ncbi.nlm.nih.gov
Peitsch M.C. peitsch@ulbio1.unil.ch
MAC components / perforin Peitsch M.C. peitsch@ulbio1.unil.ch
Myelin proteolipid protein Hofmann K.O. khofmann@cipvax.biolan.uni-koeln.de
PEP requiring enzymes Reizer J. jreizer@ucsd.edu
Phytochromes Partis M.D. partis@gcri.afrc.ac.uk
Prokaryotic carbohydrate Reizer J. jreizer@ucsd.edu
kinases
Protein kinases Hanks S. hanks@vuctrvax.bitnet
Hunter T. hunter@salk.bitnet
PTS proteins Reizer J. jreizer@ucsd.edu
Restriction-modification Bickle T. bickle@urz.unibas.ch
enzymes Roberts R.J. roberts@cshl.org
<PAGE>
Ribosomal protein S3 Hallick R. hallick%biotec@arizona.edu
Ribosomal protein S15 Ellis S.R. srelli01@ulkyvm.bitnet
Ring-cleavage dioxygenases Harayama S. harayama@cmu.unige.ch
Sodium symporters Reizer J. jreizer@ucsd.edu
Subtilases Brannigan J. jab5@vaxa.york.ac.uk
Thiol proteases Turk B. turk@ijs.ac.mail.yu
Thiol proteases inhibitors Turk B. turk@ijs.ac.mail.yu
TPR repeats Boguski M.S. boguski@ncbi.nlm.nih.gov
Transit peptides von Heijne G. gunnar@cbts.sunet.se
Type-II membrane antigens Levy S. levy@cellbio.stanford.edu
Uracil-DNA glycosylase Aasland R. aasland@bio.uib.no
Xylose isomerase Jenkins J. jenkins@frira.afrc.ac.uk
WAP-type domain Claverie J.-M. jmc@ncbi.nlm.nih.gov
B.2 Requirements to fulfill to become an on-line expert
An expert should be a scientist working with specific famili(es) of
proteins (or specific domains) and which would:
a) Review the protein sequences in SWISS-PROT and the patterns/matrices
in PROSITE relevant to their field of research.
b) Agree to be contacted by people that have obtained new sequence(s)
which seem to belong to "their" familie(s) of proteins.
c) Have access to electronic mail and be willing to use it to send and
receive data.
If you are willing to be part of this scheme please contact Amos Bairoch
at one of the following electronic mail addresses:
bairoch@cmu.unige.ch
bairoch@cgecmu51.bitnet
<PAGE>
APPENDIX C: RELATIONSHIPS BETWEEN BIOMOLECULAR DATABASES
The current status of the relationships (cross-references) between some
biomolecular databases is shown in the following schematic:
**********************
*********************** <----- * EPD [Euk. Promot.] *
* EMBL Nucleotide * -----> **********************
* Sequence Data *
***************** ----> * Library * **********************
* FLYBASE * <---- *********************** <----- * ECD [E. coli map] *
* [Drosophila * ^ | ^ **********************
* genetic maps] * --------+ | | |
***************** <-----+ | | | +--------- **********************
| | | | +--------- * TFD [Trans. fact.] *
| | | | | +------> **********************
| | | | | |
***************** | v | v v | **********************
* REBASE * *********************** * ENZYME [Nomencl.] *
* [Restriction * <---- * SWISS-PROT * <----- **********************
* enzymes] * * Protein Sequence * |
***************** * Data Bank * v
*********************** **********************
***************** | ^ | ^ | ^ | | * OMIM [Diseases] *
* PROSITE * <-------+ | | | | | | +--------> **********************
* [Patterns] * ----------+ | | | | |
***************** | | | | +-----------> **********************
| | | | +-------------- * E. coli 2D gels *
| | | | **********************
| | | |
| | | +----------------> **********************
| | +------------------- * EcoGene/EcoSeq *
| v **********************
| ***********************
+--------> * PDB [3D structures] *
***********************
<PAGE>
