Swiss-Prot release 19.0
Published August 1, 1991
SWISS-PROT RELEASE 19.0 RELEASE NOTES
1. INTRODUCTION
1.1 Evolution
Release 19.0 of SWISS-PROT contains 21795 sequence entries, comprising
7'173'785 amino acids abstracted from 21773 references. This represents
an increase of 6% over release 18. The recent growth of the data bank is
summarized below.
Release Date Number of entries Nb of amino acids
3.0 11/86 4160 969 641
4.0 04/87 4387 1 036 010
5.0 09/87 5205 1 327 683
6.0 01/88 6102 1 653 982
7.0 04/88 6821 1 885 771
8.0 08/88 7724 2 224 465
9.0 11/88 8702 2 498 140
10.0 03/89 10008 2 952 613
11.0 07/89 10856 3 265 966
12.0 10/89 12305 3 797 482
13.0 01/90 13837 4 347 336
14.0 04/90 15409 4 914 264
15.0 08/90 16941 5 486 399
16.0 11/90 18364 5 986 949
17.0 02/91 20024 6 524 504
18.0 05/91 20772 6 792 034
19.0 08/91 21795 7 173 785
1.2 Source of data
Release 19.0 has been updated using protein sequence data from release
28.0 of the PIR (Protein Identification Resource) protein data bank, as
well as translation of nucleotide sequence data from release 27.0 of the
EMBL Nucleotide Sequence Database.
As an indication to the source of the sequence data in the SWISS-PROT
data bank we list here the statistics concerning the DR (Database cross-
references) pointer lines:
Entries with pointer(s) to only PIR entri(es): 3912
Entries with pointer(s) to only EMBL entri(es): 3542
Entries with pointer(s) to both EMBL and PIR entri(es): 13835
Entries with no pointers lines: 506
<PAGE>
2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 18
2.1 Sequences and annotations
About 1040 sequences have been added since release 18, the sequence data
of 194 existing entries has been updated and the annotations of 2220
entries have been revised. In particular we have used reviews articles
to update the annotations of the following groups or families of
proteins:
- Adenylosuccinate synthetase
- Aldose 1-epimerase
- Argininosuccinate synthase
- Amidases
- Bacterial regulatory proteins, lacI family
- Bacterial regulatory proteins, merR family
- Bacterioferritin
- Delta 1-pyrroline-5-carboxylate reductase
- DNA polymerase family X
- Eukaryotic molybdopterin-dependent oxidoreductases
- Eukaryotic porin
- Ferrochelatase
- Fibrillarin
- FMN-dependent alpha-hydroxy acid dehydrogenases
- Hemerythrins
- HlyD family secretion proteins
- Hok/gef family cell toxic proteins
- Matrixins
- Methylmalonyl-CoA mutase
- Phosphoribulokinase
- Plant viruses icosahedral capsid proteins
- Porphobilinogen deaminase
- Pyrokinins
- Ribonuclease T2 family
- Stathmin family
- Transglutaminases
- Ubiquitin-activating enzyme
- Zinc finger RFP/RPT-1 family
2.2 New line types: RP and RC
The following change has been implemented in release 19; the RN line has
been replaced by three line types: a modified RN (Reference Number) line
type containing just the reference number, a new RP (Reference Position)
line type containing the extent of the work carried out by the authors
of the reference, and a new RC (Reference Comment) line type containing
comments relevant to the reference (strain, tissue, etc.). Three
examples of the usage of these new line types are given below.
<PAGE>
RN [1]
RP SEQUENCE FROM N.A., AND SEQUENCE OF 1-23.
RC STRAIN=K12;
RN [1]
RP SEQUENCE OF 24-56 AND 67-89.
RC STRAIN=BALB/C; TISSUE=BRAIN;
RN [1]
RP X-RAY CRYSTALLOGRAPHY, 1.8 ANGSTROMS.
RC MEDLINE=91002678;
Within a reference block the RN and RP lines occur once, the RC line
occurs zero or more times.
The format of the RC line is:
RC TOKEN1=Text; TOKEN2=Text;....
Where the following tokens are currently defined: MEDLINE, PLASMID,
SPECIES, STRAIN, TISSUE, and TRANSPOSON.
The `SPECIES' token is only used when an entry describes a sequence
which is identical in more than one species; similarly the `PLASMID' is
only used if an entry describes a sequence identical in more than one
plasmid.
2.3 MEDLINE unique identifiers
Starting with release 19 each journal reference listed in SWISS-PROT
which exists in the National Library of Medicine (NLM) MEDLINE
bibliographic data bank includes the `Unique Identifier' (UI) of that
reference. This information is stored in the new RC line type using the
`MEDLINE' token. Example:
RC MEDLINE=90205618;
It is planned that, in a few months, MEDLINE will add cross-references
to SWISS-PROT.
<PAGE>
2.4 New cross-references
We have added cross-references to the Transcription Factors Database
(TFD) of David Ghosh (for a description see: Nucleic Acids Res. (1990)
18:1749-1756); as well as to the Drosophila Genetic Maps database (DMAP)
prepared by Michael Ashburner at the Department of Genetics in
Cambridge, England. These cross-references are present in the DR lines:
Data bank identifier: DMAP
Primary identifier : Gene unique identifier number (UID).
Secondary identifier: Latest release of DMAP that was used to derive
the cross-references.
Example : DR DMAP; 00055; RELEASE 9107.
Data bank identifier: TFD
Primary identifier : Unique identifier for the corresponding TFD
POLYPEPTIDES table entry (the TFD_ID field).
Secondary identifier: Latest release of TFD that was used to derive
the cross-references.
Example : DR TFD; P00040; RELEASE 3.0.
2.5 Minor change in the DT line format
There is now a single space character between the date and the comment
part of a DT line instead of the two spaces that used to exist in
previous releases. Example:
DT 01-MAY-1991 (REL. 18, CREATED)
has been changed to:
DT 01-MAY-1991 (REL. 18, CREATED)
2.6 Minor change in the RL lines for submissions
References for sequence information submitted to the international
nucleic acid databases (DDBJ, EMBL, Genbank) were represented by the
following subtype of RL lines:
RL SUBMITTED (JAN-1991) TO EMBL/GENBANK DATA BANKS.
Starting with release 19, these RL lines use the following format:
RL SUBMITTED (JAN-1991) TO EMBL/GENBANK/DDBJ DATA BANKS.
<PAGE>
2.7 Status of cross-references to PIR
We have continued adding cross-references to entries in the unannotated
sections of PIR (known as PIR2 and PIR3); currently we have cross-
references to 14078 sequence entries in PIR2/3 out of a total of 20265
entries in those sections in release 28 of PIR.
3. FORTHCOMING CHANGES
3.1 Change in the format of the entry names
Starting with release 21 we will replace the dollar sign `$' in entry
names by the underscore character `_'. This change is made on the behalf
of users of sequence analysis software running under the Unix operating
system, where the dollar sign is a reserved symbol. Example: the entry
name `CYC$HUMAN' will be changed to `CYC_HUMAN'.
4. ENZYME AND PROSITE
4.1 The ENZYME data bank
Release 6.0 of the ENZYME data bank is distributed along with release 19
of SWISS-PROT. ENZYME release 6.0 contains information relative to 3072
enzymes. The data bank is complete and up to date. Until new enzyme
nomenclature data is published we only plan to update the SWISS-PROT
pointers at each release of the protein sequence data bank, correct
eventual errors, and complete the information concerning synonyms and
cofactors using the literature.
4.2 The PROSITE data bank
Release 7.10 of the PROSITE data bank is distributed along with release
19 of SWISS-PROT. Release 7.10 contains 441 documentation chapters that
describes 508 different patterns. Release 7.10 does not really represent
a new release; the only changes between release 7.0 and 7.1 are updating
of the pointers to the SWISS-PROT entries whose name have been modified
between release 18 and 19. The next release of PROSITE (8.0) will be
distributed with release 20 of SWISS-PROT.
5. WE NEED YOUR HELP !
We welcome feedback from our users. We would especially appreciate that
you notify us if you find that sequences belonging to your field of
expertise are missing from the data bank. We also would like to be
notified about annotations to be updated, as for example if the function
of a protein has been clarified or if new post-translational information
has become available.
<PAGE>
APPENDIX A: SOME STATISTICS
A.1 Amino acid composition
A.1.1 Composition in percent for the complete data bank
Ala (A) 7.64 Gln (Q) 4.09 Leu (L) 9.11 Ser (S) 7.10
Arg (R) 5.23 Glu (E) 6.27 Lys (K) 5.85 Thr (T) 5.85
Asn (N) 4.46 Gly (G) 7.11 Met (M) 2.32 Trp (W) 1.30
Asp (D) 5.24 His (H) 2.27 Phe (F) 3.96 Tyr (Y) 3.21
Cys (C) 1.82 Ile (I) 5.45 Pro (P) 5.09 Val (V) 6.49
Asx (B) 0.01 Glx (Z) 0.01 Xaa (X) 0.03
A.1.2 Classification of the amino acids by their frequency
Leu, Ala, Gly, Ser, Val, Glu, Thr, Lys, Ile, Asp, Arg, Pro, Asn, Gln,
Phe, Tyr, Met, His, Cys, Trp
A.2 Repartition of the sequences by their organism of origin
Total number of species represented in this release of SWISS-PROT: 2986
A.2.1 Table of the frequency of occurrence of species
Species represented 1x: 1302
2x: 543
3x: 303
4x: 184
5x: 128
6x: 93
7x: 78
8x: 40
9x: 60
10x: 32
11- 20x: 111
21-100x: 88
>100x: 24
<PAGE>
A.2.2 Table of the most represented species
Number Frequency Species
1 1790 Human
2 1512 Escherichia coli
3 1043 Mouse
4 972 Rat
5 735 Baker's yeast (Saccharomyces cerevisiae)
6 499 Bovine
7 414 Fruit fly (Drosophila melanogaster)
8 362 Chicken
9 286 Bacillus subtilis
10 261 Rabbit
11 252 African clawed frog (Xenopus laevis)
12 251 Vaccinia virus (strain Copenhagen)
13 238 Pig
14 201 Salmonella typhimurium
15 193 Human cytomegalovirus (strain AD169)
16 166 Bacteriophage T4
17 154 Maize
18 132 Rice
19 113 Vaccinia virus (strain WR)
20 112 Tobacco
Pea
22 110 Wheat
23 105 Staphylococcus aureus
24 101 Slime mold (Dictyostelium discoideum)
25 98 Sheep
26 97 Barley
27 90 Fission yeast (Schizosaccharomyces pombe)
28 89 Spinach
29 87 Pseudomonas aeruginosa
30 85 Soybean
31 84 Liverwort (Marchantia polymorpha)
32 81 Agrobacterium tumefaciens
33 80 Dog
Klebsiella pneumoniae
35 79 Neurospora crassa
<PAGE>
A.3 Repartition of the sequences by size
From To Number From To Number
1- 50 1458 1001-1100 203
51- 100 2439 1101-1200 124
101- 150 3466 1201-1300 97
151- 200 2112 1301-1400 63
201- 250 1757 1401-1500 49
251- 300 1558 1501-1600 27
301- 350 1395 1601-1700 24
351- 400 1365 1701-1800 26
401- 450 1052 1801-1900 29
451- 500 1121 1901-2000 22
501- 550 855 2001-2100 9
551- 600 583 2101-2200 24
601- 650 422 2201-2300 30
651- 700 304 2301-2400 11
701- 750 284 2401-2500 13
751- 800 226 >2500 51
801- 850 178
851- 900 190
901- 950 116
951-1000 112
Currently the ten largest sequences are:
RYNR$RABIT 5037 a.a.
RYNR$HUMAN 5032 a.a.
APB$HUMAN 4563 a.a.
APOA$HUMAN 4548 a.a.
POLG$BVDV 3988 a.a.
POLG$HCVA 3898 a.a.
POLG$HCVB 3898 a.a.
TRX$DROME 3759 a.a.
ACVA$PENCH 3746 a.a.
DMD$HUMAN 3685 a.a.
<PAGE>
APPENDIX B: ON-LINE EXPERTS
B.1 List of on-line experts for PROSITE and SWISS-PROT
Field of expertise Name Email address
----------------------------- ------------------ --------------------------
African swine fever virus Yanez R.J. ryanez@cbm2.uam.es
Alcohol dehydrogenases Bengt P. bengt@medfys.ki.se
Aldehyde dehydrogenases Bengt P. bengt@medfys.ki.se
Alpha-crystallins/HSP-20 Leunissen J.A.M. jackl@caos.caos.kun.nl
Alpha-2-macroglobulins Van Leuven F. fred@blekul13.bitnet
Apolipoproteins Boguski M.S. boguski@ncbi.nlm.nih.gov
Arrestins Kolakowski L.F.Jr. lfk@athena.mit.edu
Bacteriophage P4 proteins Halling C. chh9@midway.uchicago.edu
Beta-lactamases Brannigan J. jab5@vaxa.york.ac.uk
Chitinases Henrissat B. cermav@frgren81.bitnet
CTF/NF-I Mermod N. nmermod@clsuni51.bitnet
Cytochromes P450 Holsztynska E.J. ela@netcom.uucp
netcom!ela@apple.com
EF-hand calcium-binding Cox J.A. cox@cgeuge52.bitnet
Kretsinger R.H. rhk5i@virginia.bitnet
Eryf1-type zinc-fingers Boguski M.S. boguski@ncbi.nlm.nih.gov
fruR/lacI family HTH proteins Reizer J. jreizer@ucsd.edu
Glucanases Henrissat B. cermav@frgren81.bitnet
Beguin P. phycel@pasteur.bitnet
G-protein coupled receptors Chollet A. chollet@clients.switch.ch
Attwood T.K. bph6tka@biovax.leeds.ac.uk
GTPase-activating proteins Boguski M.S. boguski@ncbi.nlm.nih.gov
HMG1/2 and HMG-14/17 Landsman D. landsman@ncbi.nlm.nih.gov
Inorganic pyrophosphatases Kolakowski L.F.Jr. lfk@athena.mit.edu
Integrases Roy P.H. 2020000@lavalvx1.bitnet
Phytochromes Partis M.D. partis@gcri.afrc.ac.uk
Prokaryotic carbohydrate Reizer J. jreizer@ucsd.edu
kinases
Protein kinases Hanks S. hanks@vuctrvax.bitnet
Restriction-modification Bickle T. bickle@urz.unibas.ch
enzymes Roberts R.J. roberts@cshl.org
Ribosomal protein S3 Hallick R. hallick%biotec@arizona.edu
Ribosomal protein S15 Ellis S.R. srelli01@ulkyvm.bitnet
Ring-cleavage dioxygenases Harayama S. harayama@cgecmu51.bitnet
Sodium symporters Reizer J. jreizer@ucsd.edu
Subtilisin family proteases Brannigan J. jab5@vaxa.york.ac.uk
Thiol proteases Turks B. turk@ijs.ac.mail.yu
Thiol proteases inhibitors Turks B. turk@ijs.ac.mail.yu
TPR repeats Boguski M.S. boguski@ncbi.nlm.nih.gov
Transit peptides von Heijne G. gunnar@cbts.sunet.se
Type-II membrane antigens Levy S. levy@cellbio.stanford.edu
Xylose isomerase Jenkins J. jenkins@frira.afrc.ac.uk
<PAGE>
B.2 Requirements to fulfill to become an on-line expert
An expert should be a scientist working with specific famili(es) of
proteins (or specific domains) and which would:
a) Review the protein sequences in SWISS-PROT and the patterns/matrices
in PROSITE relevant to their field of research.
b) Agree to be contacted by people that have obtained new sequence(s)
which seem to belong to "their" familie(s) of proteins.
c) Have access to electronic mail and be willing to use it to send and
receive data.
If you are willing to be part of this scheme please contact Amos Bairoch
at one of the following electronic mail addresses:
bairoch@cgecmu51.bitnet
bairoch@cmu.unige.ch
<PAGE>
