Swiss-Prot release 33.0
Published February 1, 1996
SWISS-PROT RELEASE 33.0 RELEASE NOTES
1. INTRODUCTION
1.1 Evolution
Release 33.0 of SWISS-PROT contains 52'205 sequence entries, comprising
18'531'384 amino acids abstracted from 45'351 references. This
represents an increase of 6.5% over release 32. The growth of the data
bank is summarized below.
Release Date Number of entries Nb of amino acids
2.0 09/86 3939 900 163
3.0 11/86 4160 969 641
4.0 04/87 4387 1 036 010
5.0 09/87 5205 1 327 683
6.0 01/88 6102 1 653 982
7.0 04/88 6821 1 885 771
8.0 08/88 7724 2 224 465
9.0 11/88 8702 2 498 140
10.0 03/89 10008 2 952 613
11.0 07/89 10856 3 265 966
12.0 10/89 12305 3 797 482
13.0 01/90 13837 4 347 336
14.0 04/90 15409 4 914 264
15.0 08/90 16941 5 486 399
16.0 11/90 18364 5 986 949
17.0 02/91 20024 6 524 504
18.0 05/91 20772 6 792 034
19.0 08/91 21795 7 173 785
20.0 11/91 22654 7 500 130
21.0 03/92 23742 7 866 596
22.0 05/92 25044 8 375 696
23.0 08/92 26706 9 011 391
24.0 12/92 28154 9 545 427
25.0 04/93 29955 10 214 020
26.0 07/93 31808 10 875 091
27.0 10/93 33329 11 484 420
28.0 02/94 36000 12 496 420
29.0 06/94 38303 13 464 008
30.0 10/94 40292 14 147 368
31.0 02/95 43470 15 335 248
32.0 11/95 49340 17 385 503
33.0 02/96 52205 18 531 384
2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 32
2.1 Sequences and annotations
2'910 sequences have been added since release 32, the sequence data of
1085 existing entries has been updated and the annotations of 6'340
entries have been revised.
Major annotations and sequences updates have been made in preparation of
the changes that will take place in release 33 (see section 3.1 of these
notes).
2.2 What's happening with the model organisms
We have selected a number of organisms that are the target of genome
sequencing and/or mapping projects and for which we intend to:
- Be as complete as possible. All sequences available at a given time
should be immediately included in SWISS-PROT. This also includes
sequence corrections and updates;
- Provide a higher level of annotation;
- Provide cross-references to specialized database(s) that contain,
among other data, some genetic information about the genes that code
for these proteins;
- Provide specific indices or documents.
What was done since the last release or in preparation for the next
release concerning model organisms:
- We have added Mycoplasma genitalium to the list of model organisms.
It is the second bacterial genome to be completely sequenced. We have
already annotated 344 of the 470 putative proteins encoded by this
small genome.
- We have started a major effort in catching up with the backlog of
sequences from eukaryotic model organisms. In particular we added 262
entries from yeast, 194 from human, 180 from S.pombe, 82 from
C.elegans, 68 from A.thaliana and 50 from Drosophila.
- We have added in SWISS-PROT, all the sequences from yeast chromosome
X. We plan to integrate data from chromosome XIII very soon.
Here is the current status of the model organisms:
Organism Database Index file Number of
cross-referenced sequences
-------------- --------------------- -------------- ---------
A.thaliana None yet In preparation 500
B.subtilis SubtiList SUBTILIS.TXT 1389
C.albicans None yet CALBICAN.TXT 106
C.elegans WormPep CELEGANS.TXT 1006
D.discoideum DictyDB DICTY.TXT 213
D.melanogaster FlyBase In preparation 818
E.coli EcoGene ECOLI.TXT 3471
H.influenzae None yet HAEINFLU.TXT 1577
H.sapiens MIM MIMTOSP.TXT 3475
M.genitalium None yet In preparation 344
S.cerevisiae LISTA/SGD YEAST.TXT 3653
S.typhimurium StyGene SALTY.TXT 603
S.pombe None yet POMBE.TXT 640
S.solfataricus None yet None yet 61
2.3 Major changes to the cross-references to EMBL
In this release, the format of the DR (Database cross-Reference) lines
pointing to EMBL Nucleotide Sequence Database entries have been changed
from:
DR EMBL; ACCESSION_NUMBER; ENTRY_NAME.
to:
DR EMBL; ACCESSION_NUMBER; PID; STATUS_IDENTIFIER.
Where 'PID' stands for the "Protein IDentification" number. It is a
number that you find in EMBL and GenBank in a qualifier called
"/db_xref" which is tagged to every CDS in the nucleotide database.
Example:
FT CDS 54..1382
FT /note="ribulose-1,5-bisphosphate carboxylase/
FT oxygenase activase precursor"
FT /db_xref="PID:g1006835"
When an EMBL database CDS exists as a sequence report in SWISS-PROT, the
SWISS-PROT DR lines of the corresponding SWISS-PROT entry has been
updated by citing the PID as secondary identifier. In all cases where a
PID has been integrated into SWISS-PROT, a "/db_xref" qualifier citing
the corresponding SWISS-PROT entry has been added to the EMBL database
CDS labeled with this PID. Example:
FT CDS 14556__15696
FT /gene="cytochrome b"
FT /codon_start=1
FT /product="apoprotein"
FT /db_xref="PID:g463170"
FT /db_xref="SWISS-PROT:P12778"
This approach enables us to point precisely from a given SWISS-PROT
entry to one of potentially many CDS in the corresponding EMBL entry and
vice versa. This change also allows the development of software tools
that automatically retrieve the part of a nucleotide sequence entry that
codes for a specific protein. This is especially useful in the context
of World-Wide Web as it will render obsolete the current situation
where, for example, one needs to retrieve the complete sequence of a
yeast chromosome when one wants the nucleotide sequence coding for a
specific protein encoded on that chromosome.
An additional important principle of the PID system is that whenever a
change is made to the nucleotide entry or to the annotations of that
entry and that this change produces a modification in the translated
protein sequence, the PID number corresponding to the modified CDS is
replaced by a completely new number. The old number will be kept in a
special field tagged to the CDS. The exact syntax of this field is under
discussion at the international nucleotide databases.
The new cross-referencing system will allow a much closer
interconnection between SWISS-PROT and the international nucleotide
sequence databases. For example, it will allow us to automatically take
into account sequence updates made to the nucleotide entry when these
updates have an impact on the derived protein sequence(s).
It should also be noted that the "PID" numbers in the context of GenBank
replace the "NCBI gi" numbering system which was present in the "/note"
qualifier. The "gi" identifiers for the nucleic acid sequences have been
replaced by "NID" (nucleic acid identifier) numbers.
The 'STATUS_IDENTIFIER' provides information about the relationship
between the sequence in the SWISS-PROT entry and the CDS in the
corresponding EMBL entry.
a) In most cases the translation of the EMBL nucleotide sequence CDS
results in the same sequence as shown in the corresponding SWISS-PROT
entry or the differences are mentioned in the SWISS-PROT feature (FT)
lines as CONFLICT, VARIANT or VARSPLIC and in the RP lines. In these
cases the status identifier shows a dash ("-").
Example:
DR EMBL; Y00312; G63880; -.
b) In some cases the translation of the EMBL nucleotide sequence CDS
results in a sequence different from the sequence shown in the
corresponding SWISS-PROT entry and the differences are either not
mentioned in the SWISS-PROT feature (FT) lines as CONFLICT, VARIANT or
VARSPLIC and in the RP lines, or do simply not meet the criteria for
such situations.
1) If the difference is due to a different start of the sequence (e.g.
SWISS-PROT believes that the start of the sequence is upstream or
downstream of the site annotated as the start of the sequence in the
EMBL database), the status identifier shows the comment "ALT_INIT".
Example:
DR EMBL; L29151; G466334; ALT_INIT.
2) If the difference is due to a different termination of the sequence
(e.g. SWISS-PROT believes that the termination of the sequence is
upstream or downstream of the site annotated as the end of the
sequence in the EMBL database), the status identifier shows the
comment "ALT_TERM". Example:
DR EMBL; L20562; G398099; ALT_TERM.
3) If the difference is due to frameshifts in the EMBL sequence, the
status identifier shows the comment "ALT_FRAME". Example:
DR EMBL; M95935; G146416; ALT_FRAME.
4) If the difference is not due to the cases mentioned above (e.g. wrong
intron-exon boundaries given in the EMBL entry) or to a mixture of
the cases mentioned above, the status identifier shows the comment
"ALT_SEQ". Example:
DR EMBL; X79206; G809602; ALT_SEQ.
c) In some cases the nucleotide sequence of a complete CDS is divided in
exons present in different EMBL entries. We point to the exon containing
EMBL entries by citing the PID as secondary identifier and adding the
comment "JOINED" into the status identifier. These EMBL entries are not
containing a CDS feature, they contain exons joined to a CDS feature
which is labeled with the given PID.
Example:
DR EMBL; M63397; G177196; -.
DR EMBL; M63395; G177196; JOINED.
DR EMBL; M63396; G177196; JOINED.
In the above example the SWISS-PROT sequence is derived from the CDS
labeled with the PID G177196. This CDS feature can be found in the EMBL
entry M63397. Exons belonging to this CDS are not only found in EMBL
entry M63397, but also in the EMBL entries M63395 and M63396.
d) In some cases there is no CDS feature key annotating a protein
translation in an EMBL entry and thus no PID for that CDS. Therefore it
is not possible for us to point to a PID as a secondary identifier. In
these cases we point to the relevant EMBL entries by including a dash
("-") in the position of the missing PID and "NOT_ANNOTATED_CDS" into
the status identifier.
Example:
DR EMBL; J04126; -; NOT_ANNOTATED_CDS.
2.4 New cross-references
We have added cross-references from SWISS-PROT to the Harefield Hospital
2D gel protein databases prepared under the supervisation of Mike Dunn
(see Corbett J.M., Wheeler C.H., Baker C.S., Yacoub M.H. and Dunn M.J.;
Electrophoresis 15:1459-1465(1994)). These cross-references are present
in the DR lines:
Data bank identifier: HSC-2DPAGE
Primary identifier: The protein spot unique identifier [1]
Secondary identifier: The species of origin [2]
Example: HSC-2DPAGE; P47985; HUMAN.
[1] Harefield 2D databases uses SWISS-PROT primary accession numbers as
the alphanumeric designation of spots that are linked to SWISS-PROT
entries
[2] Currently only `HUMAN' is used, but 'RAT' and 'DOG' will be added
in the next release.
2.5 Introduction of a new CC line-type topic (MASS SPECTROMETRY)
We have introduced a new 'topic' for the comments (CC) line-type: MASS
SPECTROMETRY. This topic is used to report the exact molecular weight of
a protein or part of a protein as determined by mass spectrometric
methods. The syntax of this new topic is:
CC -!- MASS SPECTROMETRY: MW=XXX[; MW_ERR=XX]; METHOD=XX[;
RANGE=XX-XX].
Where
- "MW=XX" is the determined molecular weight (MW);
- "MW_ERR=XX" (optional) is the accuracy or error range of the MW
measurement;
- "METHOD=XX" is the masss spectrometric method;
- "RANGE=XX-XX" (optional) is used to indicate what part of the protein
sequence entry corresponds to the molecular weight. If this qualifier
is not present, the MW value corresponds to the full length of the
protein sequence.
Examples of its usage:
CC -!- MASS SPECTROMETRY: MW=13423.3; METHOD=ELECTROSPRAY.
CC -!- MASS SPECTROMETRY: MW=71890; MW_ERR=7; METHOD=ELECTROSPRAY.
CC -!- MASS SPECTROMETRY: MW=8597.5; METHOD=ELECTROSPRAY;
CC RANGE=40-119.
It should be noted that the syntax of this topic may evolve in future
releases as we expect feedback from groups using MS for protein
identification on 2D gels, MW determination and characterization of
post-translational modifications.
2.6 Change in the syntax of the SQ line
The SQ (SeQuence header) line marks the beginning of the sequence data
and gives a quick summary of its content. The format of the SQ line used
to be:
SQ SEQUENCE XXXX AA; XXXXX MW; XXXXX CN;
The line contains the length of the sequence in amino-acids (AA)
followed by the molecular weight (MW) rounded to the nearest gram and a
checking number (CN) as shown in the example:
SQ SEQUENCE 104 AA; 11530 MW; 54319 CN;
Starting with this release, we have replaced the checking number (CN) by
a 32-bit CRC (Cyclic Redundancy Check) value. The new syntax is:
SQ SEQUENCE XXXX AA; XXXXX MW; XXXXXXXX CRC32;
Example:
SQ SEQUENCE 104 AA; 11530 MW; 7A70363C CRC32;
2.7 Status of the documentation files
SWISS-PROT is distributed with a large number of documentation files.
Some of these files have been available for a long time (the user
manual, release notes, the various indices for authors, citations,
keywords, etc.), but many have been created recently and we are
continuously adding new files. Since release 32, we have added 2 new
document files. The following table list all the documents that are
either currently available or that we plan to add in the next few
months.
USERMAN .TXT User manual
RELNOTES.TXT Release notes
SHORTDES.TXT Short description of entries in SWISS-PROT
JOURLIST.TXT List of abbreviations for journals cited
KEYWLIST.TXT List of keywords in use
SPECLIST.TXT List of organism identification codes
EXPERTS .TXT List of on-line experts for PROSITE and SWISS-PROT
SUBMIT .TXT Submission of sequence data to the SWISS-PROT data bank
ACINDEX .TXT Accession number index
AUTINDEX.TXT Author index
CITINDEX.TXT Citation index
KEYINDEX.TXT Keyword index
SPEINDEX.TXT Species index
7TMRLIST.TXT List of 7-transmembrane G-linked receptors entries
AATRNASY.TXT List of aminoacyl-tRNA synthetases
ALLERGEN.TXT Nomenclature and index of allergen sequences
CALBICAN.TXT Index of Candida albicans entries and their corresponding
gene designations
CDLIST .TXT CD nomenclature for surface proteins of human leucocytes
CELEGANS.TXT Index of Caenorhabditis elegans entries and their
corresponding gene
designations and WormPep cross-references
DICTY .TXT Index of Dictyostelium discoideum entries and their
corresponding gene
designations and DictyDB cross-references
EC2DTOSP.TXT Index of Escherichia coli Gene-protein database entries
referenced in SWISS-PROT
ECOLI .TXT Index of Escherichia coli K12 chromosomal entries and
their corresponding EcoGene cross-reference
EMBLTOSP.TXT Index of EMBL Database entries referenced in SWISS-PROT
[3]
EXTRADOM.TXT Nomenclature of extracellular domains
GLYCOSYL.TXT Classification of glycosyl hydrolases families and index
of glycosyl hydrolase entries [1]
HAEINFLU.TXT Index of Haemophilus influenzae RD chromosomal entries
HOXLIST .TXT Vertebrate homeotic Hox proteins: nomenclature and index
HUMCHR21.TXT Index of protein sequence entries encoded on human
chromosome 21
HUMCHR22.TXT Index of protein sequence entries encoded on human
chromosome 22
HUMCHRY .TXT Index of protein sequence entries encoded on human
chromosome Y
MIMTOSP .TXT Index of MIM entries referenced in SWISS-PROT
MYGENIT .TXT Index of Mycoplasma genitalium chromosomal entries [2]
NOMLIST .TXT List of nomenclature related references for proteins
PDBTOSP .TXT Index of Brookhaven PDB entries referenced in SWISS-PROT
PEPTIDAS.TXT Classification of peptidase families and index of
peptidases entries
PLASTID .TXT List of chloroplast and cyanelle encoded proteins
POMBE .TXT Index of Schizosaccharomyces pombe entries in SWISS-PROT
and their corresponding gene designations
RESTRIC .TXT List of restriction enzymes and methylases entries
RIBOSOMP.TXT Index of ribosomal proteins classified by families on the
basis of sequence similarities [2]
SALTY .TXT Index of Salmonella typhimurium LT2 chromosomal entries
and their corresponding StyGene cross-references
SUBTILIS.TXT Index of Bacillus subtilis 168 chromosomal entries and
their corresponding SubtiList cross-references
YEAST .TXT Index of Saccharomyces cerevisiae entries and their
corresponding gene designations
YEAST1 .TXT Yeast Chromosome I entries
YEAST2 .TXT Yeast Chromosome II entries
YEAST3 .TXT Yeast Chromosome III entries
YEAST5 .TXT Yeast Chromosome V entries
YEAST6 .TXT Yeast Chromosome VI entries
YEAST8 .TXT Yeast Chromosome VIII entries
YEAST9 .TXT Yeast Chromosome IX entries
YEAST10 .TXT Yeast Chromosome X entries [1]
YEAST11 .TXT Yeast Chromosome XI entries
YEAST13 .TXT Yeast Chromosome XIII entries [2]
Notes:
[1] New in release 33.
[2] Will be available starting with release 34 of October 1996.
[3] The format of that file was completely changed to take into account
the new format of cross-references to EMBL that includes the "PID"
(see section 2.3).
We have continued to include in some SWISS-PROT document files the
references of World-Wide Web sites relevant to the subject under
consideration. There are now 11 documents that include such links.
2.8 The ExPASy World-Wide Web server
2.8.1 Background information
The most efficient and user-friendly way to browse interactively in
SWISS-PROT, PROSITE, ENZYME, SWISS-2DPAGE and other databases. is to use
the World-Wide Web (WWW) molecular biology server ExPASy. WWW is a
global information retrieval system merging the power of world-wide
networks, hypertext and multimedia. Through hypertext links, it gives
access to documents and information available on thousands of servers
around the world. To access a WWW server one needs a WWW browser.
Currently, the most popular browser is Netscape Navigator(TM) from
Netscape Communications Corp. (available from ftp.netscape.com). Using a
WWW browser, one has access to all the hypertext documents stored on the
ExPASy server as well as many other WWW servers.
The ExPASy server was made available to the public in September 1993. On
February 1996 a cumulative total of 4 million connections was attained.
It may be accessed through its Uniform Resource Locator (URL - the
addressing system defined in WWW), which is:
http://expasy.hcuge.ch/
The ExPASy WWW server allows access, using the user-friendly hypertext
model, to the SWISS-PROT, PROSITE, ENZYME, SWISS-2DPAGE and SWISS-
3DIMAGE databases and, through any SWISS-PROT protein sequence entry, to
other databases such as EMBL, EcoCyc, FlyBase, GCRDb, LISTA, MaizeDB,
SubtiList, OMIM, PDB, HSSP, ProDom, REBASE, SGD, YEPD and Medline. Using
a browser which is able to display images one can also remotely access
2D gels image data from SWISS-2DPAGE. ExPAsy also offers many tools for
the analysis of protein sequences and 2D gels.
For more information on the ExPASy WWW server, you can read the
following article:
Appel R.D., Bairoch A., Hochstrasser D.F.
A new generation of information retrieval tools for biologists: the
example of the ExPASy WWW server.
Trends Biochem. Sci. 19:258-260(1994).
Or you can contact Dr. Ron Appel:
Email: ron.appel@dim.hcuge.ch
Fax: +41-22-372 61 98
2.8.2 SWISS-SHOP
Thanks to the work of Manuel Peitsch from the Geneva Glaxo Institute for
Molecular Biology, we can provide, on ExPASy, a service called SWISS-
SHOP. SWISS-Shop allows any users of SWISS-PROT to indicate what
proteins he/she is interested in. This can be done using various
criteria that can be combined:
- By entering one or more words that should be present in the
description line;
- By entering one or more species name(s) or taxonomic division(s);
- By entering one or more keywords;
- By entering one or more author names;
- By entering the accession number (or entry name) of a PROSITE pattern
or a user-defined sequence pattern;
- By entering the accession number (or entry name) of an existing
SWISS-PROT entry or by entering a "private" sequence.
Every week, the new sequences entered in SWISS-PROT are automatically
compared with all the criteria that have been defined by the users. If a
sequence corresponds to the selection criteria defined by a user, that
sequence is sent by electronic mail.
2.8.3 What is new on ExPASy
Since the last release, there has been a large number of new
developments on the ExPASy WWW server. Here are some highlights of these
changes:
- ProtScale is a new tool which we have implemented and that allows to
compute and represent the profile produced by an amino acid scale on
a selected protein in SWISS-PROT or entered by the user. 50 scales
are provided, including 'classics' such as the Kyte and Doolittle
hydrophobicity scale.
- We have added a new tool, SIM which computes a user defined number of
best non-intersecting alignments between two sequences. The results
of the alignment can be viewed graphically using the LALNVIEW program
developed by Laurent Duret (duret@dim.hcuge.ch) and which is
available (it can directly be downloaded from ExPASy) for PC under
MS-Windows, Macs and UNIX.
- We have recently started to create a list of Biomolecular servers for
our own usage, this list is available on the ExPASy top page or
directly from:
http://expasy.hcuge.ch/www/amos_www_links.html
- WWW links have been implemented between some SWISS-PROT entries and
HSC-2DPAGE (see section 2.4).
- Many other changes have been made to all parts of the server.
2.9 Weekly updates of SWISS-PROT
Weekly updates of SWISS-PROT are available by anonymous FTP. Three files
are updated at each update:
new_seq.dat Contains all the new entries since the last full release;
upd_seq.dat Contains the entries for which the sequence data has been
updated since the last release;
upd_ann.dat Contains the entries for which one or more annotation
fields have been updated since the last release.
Currently these files are available on the following anonymous ftp
servers:
Organization ExPASy (Geneva University Expert Protein Analysis System)
Address expasy.hcuge.ch (or 129.195.254.61)
Directory /databases/swiss-prot/updates
Organization National Center for Biotechnology Information (NCBI)
Address ncbi.nlm.nih.gov (or 130.14.20.1)
Directory /repository/swiss-prot/updates
Organization European Bioinformatics Institute (EBI)
Address ftp.ebi.ac.uk (or 193.62.196.6)
Directory /pub/databases/swissprot/new
Organization Bioinformatics Unit, Weizmann Institute of Science (WIS)
Address bioinformatics.weizmann.ac.il (or 132.76.55.12)
Directory /pub/databases/swiss-prot/updates
!! Important notes !!!
Although we try to follow a regular schedule, we do not promise to
update these files every week. In some cases two weeks will elapse in-
between two updates.
Due to the current mechanism used to build a release the entries that
are provided in these updates are not guaranteed to be error free.
3.0 IMPORTANT FORTHCOMING CHANGE
3.1 TREMBL - a supplement to SWISS-PROT
The ongoing genome sequencing and mapping projects have dramatically
increased the number of protein sequences to be incorporated into SWISS-
PROT. Since we do not want to dilute the quality standards of SWISS-PROT
by incorporating sequences into SWISS-PROT without proper sequence
analysis and annotation, we cannot speed up the incorporation of new
incoming data indefinitely. But as we also want to make the sequences
available as fast as possible, we will introduce with SWISS-PROT an
computer annotated supplement to SWISS-PROT. This supplement consists of
entries in SWISS-PROT-like format derived from the translation of all
coding sequences (CDS) in the EMBL nucleotide sequence database, except
the CDS already included in SWISS-PROT.
We name this supplement TREMBL (TRanslation from EMBL), since the
translation tools used to create the translations of the CDS are based
on the program 'trembl' written by Thure Etzold at the EMBL in
Heidelberg.
We will translate all CDS's in the EMBL Nucleotide Sequence Database
into TREMBL preentries. The preentries already as sequence reports in
SWISS-PROT will be excluded from TREMBL. Then the remaining entries will
be automatically merged whenever possible to reduce redundancy in
TREMBL.
We will split TREMBL in two main sections; SP-TREMBL and REM-TREMBL:
SP-TREMBL (SWISS-PROT TREMBL) will contain the entries which should be
incorporated into SWISS-PROT. SP-TREMBL will be partially redundant
against SWISS-PROT, since approximately half of these SP-TREMBL entries
will be only additional sequence reports of proteins already in SWISS-
PROT. We will try to merge these sequence reports as fast as possible
with the already existing SWISS-PROT entries for these proteins, so as
to make SWISS-PROT and TREMBL completely nonredundant.
REM-TREMBL (REMaining TREMBL) will contain the entries that we do not
want to include in SWISS-PROT. This section will be organized in four
subsections:
1) Most REM-TREMBL entries will be immunoglobulins and T-cell receptors.
We stopped entering immunoglobulins and T-cell receptors into SWISS-
PROT, because we only want to keep the germ line gene derived
translations of these proteins in SWISS-PROT and not all known
somatic recombinated variations of these proteins. We are expecting
more than 10'000 immunoglobulins and T-cell receptors in TREMBL. We
would like to create a specialized database dealing with these
sequences as a further supplement to SWISS-PROT and keep only a
representative cross-section of these proteins in SWISS-PROT.
2) Another category of data which will not be included in SWISS-PROT are
synthetic sequences. Again, we do not want to leave these entries in
TREMBL. Ideally one should build a specialized database for
artificial sequences as a further supplement to SWISS-PROT.
3) A third subsection consists of fragments with less than seven amino
acids.
4) The last subsection consists of CDS translations where we have strong
evidence to believe that these CDS are not coding for real proteins.
The first full release of TREMBL will be distributed with release 34 of
SWISS-PROT. However we are making available, with release 33, a beta
release so that users and software developers can send us feedback about
this new supplement to SWISS-PROT.
4. ENZYME AND PROSITE
4.1 The ENZYME data bank
Release 20.0 of the ENZYME data bank is distributed with release 33 of
SWISS-PROT. ENZYME release 20.0 contains information relative to 3601
enzymes.
4.2 The PROSITE data bank
Release 13.1 of the PROSITE data bank is distributed with release 33 of
SWISS-PROT. This release of PROSITE contains 889 documentation entries
that describe 1'167 different patterns, rules and profiles/matrices.
Release 13.1 does not really represent a new release; the only changes
between releases 13.0 and 13.1 are updating of the pointers to the
SWISS-PROT entries whose name have been modified between releases 32 and
33. The next release of PROSITE (14.0) will be distributed with release
35 of SWISS-PROT.
WE NEED YOUR HELP !
We welcome feedback from our users. We would especially appreciate that
you notify us if you find that sequences belonging to your field of
expertise are missing from the data bank. We also would like to be
notified about annotations to be updated, if, for example, the function
of a protein has been clarified or if new post-translational information
has become available.
========================================================================
APPENDIX A: SOME STATISTICS
A.1 Amino acid composition
A.1.1 Composition in percent for the complete data bank
Ala (A) 7.54 Gln (Q) 4.02 Leu (L) 9.31 Ser (S) 7.19
Arg (R) 5.15 Glu (E) 6.31 Lys (K) 5.94 Thr (T) 5.76
Asn (N) 4.54 Gly (G) 6.86 Met (M) 2.36 Trp (W) 1.26
Asp (D) 5.29 His (H) 2.23 Phe (F) 4.06 Tyr (Y) 3.21
Cys (C) 1.70 Ile (I) 5.72 Pro (P) 4.91 Val (V) 6.52
Asx (B) 0.001 Glx (Z) 0.001 Xaa (X) 0.02
A.1.2 Classification of the amino acids by their frequency
Leu, Ala, Ser, Gly, Val, Glu, Lys, Thr, Ile, Asp, Arg, Pro, Asn, Phe,
Gln, Tyr, Met, His, Cys, Trp
A.2 Repartition of the sequences by their organism of origin
Total number of species represented in this release of SWISS-PROT: 5020
A.2.1 Table of the frequency of occurrence of species
Species represented 1x: 2250
2x: 808
3x: 446
4x: 285
5x: 209
6x: 189
7x: 129
8x: 96
9x: 105
10x: 44
11- 20x: 204
21- 50x: 154
51-100x: 42
>100x: 59
A.2.2 Table of the most represented species
Number Frequency Species
1 3653 Baker's yeast (Saccharomyces cerevisiae)
2 3475 Human
3 3471 Escherichia coli
4 2137 Mouse
5 1866 Rat
6 1577 Haemophilus influenzae
7 1389 Bacillus subtilis
8 1006 Caenorhabditis elegans
9 833 Bovine
10 818 Fruit fly (Drosophila melanogaster)
11 642 Chicken
12 640 Fission yeast (Schizosaccharomyces pombe)
13 603 Salmonella typhimurium
14 508 African clawed frog (Xenopus laevis)
15 500 Arabidopsis thaliana (Mouse-ear cress)
16 469 Rabbit
17 397 Pig
18 344 Mycoplasma genitalium
19 326 Maize
20 275 Bacteriophage T4
21 256 Rice
22 253 Vaccinia virus (strain Copenhagen)
23 240 Pseudomonas aeruginosa
24 214 Slime mold (Dictyostelium discoideum)
25 213 Tobacco
26 203 Pea
27 193 Human cytomegalovirus (strain AD169)
28 187 Wheat
29 184 Vaccinia virus (strain WR)
30 176 Soybean
31 175 Barley
32 171 Staphylococcus aureus
171 Dog
34 165 Pseudomonas putida
165 Neurospora crassa
36 159 Sheep
37 158 Rhodobacter capsulatus
38 154 Autographa californica nuclear polyhedrosis virus
39 150 Marchantia polymorpha (Liverwort)
150 Klebsiella pneumoniae
41 146 Variola virus
146 Bacillus stearothermophilus
43 142 Spinach
142 Cyanophora paradoxa
45 141 Potato
46 139 Tomato
47 130 Rhizobium meliloti
48 123 Odontella sinensis
49 122 Mycobacterium leprae
50 119 Lactococcus lactis (subsp. lactis)
51 117 Agrobacterium tumefaciens
52 112 Synechocystis sp. (strain PCC 6803)
53 108 Chlamydomonas reinhardtii
54 106 Candida albicans
55 105 Guinea pig
56 104 Streptomyces coelicolor
104 Horse
58 101 Trypanosoma brucei brucei
101 Aspergillus nidulans
A.3 Repartition of the sequences by size
From To Number From To Number
1- 50 2706 1001-1100 471
51- 100 4851 1101-1200 340
101- 150 6660 1201-1300 258
151- 200 5047 1301-1400 169
201- 250 4552 1401-1500 146
251- 300 4075 1501-1600 88
301- 350 3857 1601-1700 68
351- 400 3897 1701-1800 63
401- 450 2963 1801-1900 69
451- 500 2974 1901-2000 41
501- 550 2141 2001-2100 24
551- 600 1521 2101-2200 53
601- 650 1120 2201-2300 56
651- 700 824 2301-2400 24
701- 750 761 2401-2500 31
751- 800 607 >2500 156
801- 850 477
851- 900 481
901- 950 345
951-1000 289
A.4 Longest sequences
The longest sequences (>=4000 residues) are listed here:
HTS1_COCCA 5217
FAT_DROME 5147
RYNR_RABIT 5037
RYNR_PIG 5035
RYNR_HUMAN 5032
RYNC_RABIT 4969
DYHC_DICDI 4725
DYHC_RAT 4644
DYHC_DROME 4639
APB_HUMAN 4563
APOA_HUMAN 4548
RRPA_CVMJH 4488
DYHC_ANTCR 4466
DYHC_TRIGR 4466
GRSB_BACBR 4451
PKSK_BACSU 4447
PKSL_BACSU 4427
YP73_CAEEL 4385
DYHC_NEUCR 4367
DYHC_EMENI 4344
PLEC_RAT 4140
DYHC_YEAST 4092
RRPA_CVH22 4085
A.5 Statistics for journal citations
Total number of journals cited in this release of SWISS-PROT: 710
A.5.1 Table of the frequency of journal citations
Journals cited 1x: 275
2x: 99
3x: 43
4x: 28
5x: 28
6x: 14
7x: 10
8x: 13
9x: 13
10x: 10
11- 20x: 54
21- 50x: 45
51-100x: 21
>100x: 57
A.5.2 List of the most cited journals in SWISS-PROT
Citations Journal abbreviation
--------- ----------------------------------
5010 J. BIOL. CHEM.
3191 NUCLEIC ACIDS RES.
3152 PROC. NATL. ACAD. SCI. U.S.A.
2136 J. BACTERIOL.
1828 GENE
1706 FEBS LETT.
1584 EUR. J. BIOCHEM.
1436 EMBO J.
1392 BIOCHEM. BIOPHYS. RES. COMMUN.
1359 NATURE
1300 BIOCHEMISTRY
1092 BIOCHIM. BIOPHYS. ACTA
1023 J. MOL. BIOL.
996 CELL
956 MOL. CELL. BIOL.
811 MOL. GEN. GENET.
756 PLANT MOL. BIOL.
713 VIROLOGY
708 BIOCHEM. J.
636 SCIENCE
585 MOL. MICROBIOL.
575 J. BIOCHEM.
458 J. VIROL.
407 J. GEN. VIROL.
367 GENOMICS
335 J. CELL BIOL.
299 GENES DEV.
291 PLANT PHYSIOL.
286 YEAST
266 CURR. GENET.
255 J. IMMUNOL.
255 BIOL. CHEM. HOPPE-SEYLER
240 ARCH. BIOCHEM. BIOPHYS.
233 INFECT. IMMUN.
221 MOL. BIOCHEM. PARASITOL.
213 HOPPE-SEYLER'S Z. PHYSIOL. CHEM.
204 HUM. MOL. GENET.
202 J. GEN. MICROBIOL.
193 MOL. ENDOCRINOL.
182 ONCOGENE
177 J. CLIN. INVEST.
169 FEMS MICROBIOL. LETT.
167 AM. J. HUM. GENET.
149 DNA
140 J. EXP. MED.
140 GENETICS
137 J. MOL. EVOL.
134 DEVELOPMENT
123 BLOOD
120 HUM. MUTAT.
117 HUM. GENET.
116 NEURON
114 DNA CELL BIOL.
110 NAT. GENET.
110 APPL. ENVIRON. MICROBIOL.
109 HEMOGLOBIN
104 AGRIC. BIOL. CHEM.
========================================================================
APPENDIX B: RELATIONSHIPS BETWEEN BIOMOLECULAR DATABASES
The current status of the relationships (cross-references) between some
biomolecular databases is shown in the following schematic:
***********************
****************** * EMBL Nucleotide * **********************
* EPD [Euk.Prom] * <---> * Sequence Database * <---- * ECDC [E.coli map] *
****************** * [EBI] * **********************
***********************
^ ^ ^ ^ ^ ^ ^ ^
****************** | | | I | | | |
* FlyBase * <------+ | | I | | | | **********************
* [D.melanogas.] * | | | I | | | +--------> * GCRDb [7TM recep.] *
****************** | | | I | | | | **********************
| | | I | | | |
****************** | | | I | | | | **********************
* SubtiList * <---------+ | I | | +-----------> * EcoGene [E.coli] *
* [B.subtilis] * | | | I | | | | **********************
****************** | | | I | | | |
| | | I | | | | **********************
****************** | | | I +---------------> * LISTA [Yeast] *
* MaizeDb * <-----------+ I | | | | **********************
* [Zea mays] * | | | I | | | |
****************** | | | I | | | | **********************
| | | I | +-------------> * SGD [Yeast] *
****************** | | | I | | | | **********************
* WormPep * | | | I | | | |
* [C.elegans] * <----+ | | | I | | | | **********************
****************** | | | | I | | | | +------> * DictyDB [D.disco.] *
| | | | I | | | | | **********************
****************** | v v v v v v v v v
* REBASE * *********************** **********************
* [Restriction * <--- * SWISS-PROT * <----- * ENZYME [Nomencl.] *
* enzymes] * * Protein Sequence * **********************
****************** * Data Bank * v
*********************** **********************
****************** ^ ^ ^ ^ ^ ^ ^ | ^ ^ | * OMIM [Human] *
* StyGene * | | | | | | | | | | +--------> **********************
* [S.Typhimurium]* <----+ | | | | | | | | |
****************** | | | | | | | | | **********************
| | | | | | | | +----------> * ECO2DBASE [2D] *
****************** | | | | | | | | **********************
* Transfac * <------+ | | | | | | |
****************** | | | | | | | **********************
| | | | | | +------------> * SWISS-2DPAGE [2D] *
****************** | | | | | | **********************
* Harefield [2D] * <--------+ | | | | |
****************** | | | | | **********************
| | | | +--------------> * Aarhus/Ghent [2D] *
****************** | | | | **********************
* PROSITE * | | | |
* [Patterns and * <----------+ | | +----------------> **********************
* profiles] * | | * YEPD [Yeast] [2D] *
****************** | +----------------+ **********************
| v |
| *********************** +-> **********************
+--------> * PDB [3D structures] * <----- * HSSP [3D similar.] *
*********************** **********************
=End=of=SWISS-PROT=release=33=notes=====================================
