Swiss-Prot release 42.0
Published October 1, 2003
------------------------------------------------------------------------
Swiss-Prot Protein Knowledgebase
Release Notes
Release 42, October 2003
------------------------------------------------------------------------
Table of contents
1) Warning
2) Introduction
3) Status of the model organisms
4) We need your help
5) Some statistics
1) WARNING
Please note that the format of the release notes changed. We now make
updated documents available also between major Swiss-Prot releases. The
distinct sections of this document have moved to the following sites:
* Description of the changes made to Swiss-Prot since the last
release: http://www.expasy.org/sprot/relnotes/sp_news.html. This
new document contains all recent modifications in Swiss-Prot
including minor changes with no impact on the work of software
developpers. Thus this document contains more information than
announced in the document 'sp_soon.html' (see below).
* Forthcoming changes: all modifications, which have an impact on
the Swiss-Prot format are announced in the document:
http://www.expasy.org/sprot/relnotes/sp_soon.html.
* Status of the documentation files:
http://www.expasy.org/sprot/userman.html#documentation
* The ExPASy World-Wide Web server:
o Explicit general and continuously updated documentation:
http://www.expasy.org/doc/expasy.pdf
o History of changes, improvements and new features:
http://www.expasy.org/history.html
o Swiss-Flash, a service that reports news of databases,
software and service developments:
http://www.expasy.org/swiss-flash/
* TrEMBL - a supplement to Swiss-Prot:
ftp://ftp.ebi.ac.uk/pub/databases/trembl/relnotes.txt
* FTP access to Swiss-Prot and TrEMBL:
http://www.expasy.org/sprot/userman.html#ftp and
http://www.expasy.org/sprot/download.html
* PROSITE release notes:
http://www.expasy.org/prosite/psrelnot.html
* Release statistics: http://www.expasy.org/sprot/relnotes/relstat.html
* Relationships between Swiss-Prot and some biomolecular databases:
http://www.expasy.org/sprot/userman.html#relship
2) INTRODUCTION
Release 42.0 of Swiss-Prot contains 135'850 sequence entries, comprising
50'046'799 amino acids abstracted from 109'694 references. This
represents an increase of 11% over release 41.0. The growth of the
database is summarized below.
Release Date Number of Number of
entries amino acids
2.0 09/86 3'939 900'163
3.0 11/86 4'160 969'641
4.0 04/87 4'387 1'036'010
5.0 09/87 5'205 1'327'683
6.0 01/88 6'102 1'653'982
7.0 04/88 6'821 1'885'771
8.0 08/88 7'724 2'224'465
9.0 11/88 8'702 2'498'140
10.0 03/89 10'008 2'952'613
11.0 07/89 10'856 3'265'966
12.0 10/89 12'305 3'797'482
13.0 01/90 13'837 4'347'336
14.0 04/90 15'409 4'914'264
15.0 08/90 16'941 5'486'399
16.0 11/90 18'364 5'986'949
17.0 02/91 20'024 6'524'504
18.0 05/91 20'772 6'792'034
19.0 08/91 21'795 7'173'785
20.0 11/91 22'654 7'500'130
21.0 03/92 23'742 7'866'596
22.0 05/92 25'044 8'375'696
23.0 08/92 26'706 9'011'391
24.0 12/92 28'154 9'545'427
25.0 04/93 29'955 10'214'020
26.0 07/93 31'808 10'875'091
27.0 10/93 33'329 11'484'420
28.0 02/94 36'000 12'496'420
29.0 06/94 38'303 13'464'008
30.0 10/94 40'292 14'147'368
31.0 02/95 43'470 15'335'248
32.0 11/95 49'340 17'385'503
33.0 02/96 52'205 18'531'384
34.0 10/96 59'021 21'210'389
35.0 11/97 69'113 25'083'768
36.0 07/98 74'019 26'840'295
37.0 12/98 77'977 28'268'293
38.0 07/99 80'000 29'085'965
39.0 05/00 86'593 31'411'114
40.0 10/01 101'602 37'315'215
41.0 02/03 122'564 44'986'459
42.0 10/03 135'850 50'046'799
3) STATUS OF THE MODEL ORGANISMS
We have selected a number of organisms that are the target of genome
sequencing and/or mapping projects and for which we intend to:
* be as complete as possible. All sequences available at a given
time should be immediately included in Swiss-Prot. This also
includes sequence corrections and updates;
* provide a higher level of annotation;
* provide cross-references to specialized database(s) that contain,
among other data, some information about the genes that code for
these proteins;
* provide specific indexes and documents.
From our efforts to annotate human sequence entries as completely as
possible arose the HPI project <http://www.expasy.org/sprot/hpi/>, and
the bacterial model organisms became the focus of the HAMAP project
<http://www.expasy.org/sprot/hamap/>. Here is the current status of the
model organisms which are not covered by these two projects:
Organism Database Index file Number of
cross-references sequences
------------ ---------------- -------------- ---------
A.thaliana None yet arath.txt 2'294
C.albicans None yet calbican.txt 272
C.elegans Wormpep celegans.txt 2'405
D.discoideum DictyDB dicty.txt 317
D.melanogaster FlyBase fly.txt 1'907
M.musculus MGD mgdtosp.txt 6'890
S.cerevisiae SGD yeast.txt 4'920
S.pombe GeneDB_SPombe pombe.txt 2'267
4) WE NEED YOUR HELP
We welcome feedback from our users. We would especially appreciate your
notifying us if you find that sequences belonging to your field of
expertise are missing from the database. We also would like to be
notified about annotations to be updated, if, for example, the function
of a protein has been clarified or if new information about
post-translational modifications has become available. To facilitate
this feedback we offer, on the ExPASy WWW server, a form that allows the
submission of updates and/or corrections to Swiss-Prot:
http://www.expasy.org/sprot/update.html
It is also possible, from any entry in Swiss-Prot displayed by the
ExPASy server, to submit updates and/or corrections for that particular
entry. Finally, you can also send your comments by electronic mail to
the address:
swiss-prot@expasy.org <mailto:swiss-prot@expasy.org>
Note that all update requests are assigned a unique identifier of the
form UR-Xnnnn (example: UR-A0123). This identifier is used internally by
the Swiss-Prot staff at SIB and EBI to track requests and is also used
in e-mail exchanges with the persons who have submitted a request.
5) SOME STATISTICS
AMINO ACID COMPOSITION
1.1 Composition in percent for the complete database
Ala (A) 7.76 Gln (Q) 3.92 Leu (L) 9.60 Ser (S) 6.94
Arg (R) 5.25 Glu (E) 6.55 Lys (K) 5.94 Thr (T) 5.49
Asn (N) 4.25 Gly (G) 6.90 Met (M) 2.37 Trp (W) 1.18
Asp (D) 5.29 His (H) 2.27 Phe (F) 4.05 Tyr (Y) 3.11
Cys (C) 1.57 Ile (I) 5.90 Pro (P) 4.87 Val (V) 6.67
Asx (B) 0.000 Glx (Z) 0.000 Xaa (X) 0.01
Legend: gray = aliphatic, red = acidic, green = small hydroxy,
blue = basic, black = aromatic, white = amide, yellow = sulfur
1.2 Classification of the amino acids by their frequency
Leu, Ala, Ser, Gly, Val, Glu, Lys, Ile, Thr, Asp, Arg, Pro, Asn, Phe,
Gln, Tyr, Met, His, Cys, Trp
2. TAXONOMIC ORIGIN
Total number of species represented in this release of Swiss-Prot: 8294
The first twenty species represent 55464 sequences: 40.8 % of the total
number of entries.
2.1 Table of the frequency of occurrence of species
Species represented 1x: 4024
2x: 1258
3x: 656
4x: 417
5x: 269
6x: 257
7x: 195
8x: 153
9x: 131
10x: 74
11- 20x: 342
21- 50x: 246
51-100x: 90
>100x: 182
2.2 Table of the most represented species
------ --------- --------------------------------------------
Number Frequency Species
------ --------- --------------------------------------------
1 10159 Homo sapiens (Human)
2 6890 Mus musculus (Mouse)
3 4920 Saccharomyces cerevisiae (Baker's yeast)
4 4832 Escherichia coli
5 3592 Rattus norvegicus (Rat)
6 2653 Bacillus subtilis
7 2405 Caenorhabditis elegans
8 2294 Arabidopsis thaliana (Mouse-ear cress)
9 2267 Schizosaccharomyces pombe (Fission yeast)
10 1907 Drosophila melanogaster (Fruit fly)
11 1773 Haemophilus influenzae
12 1603 Escherichia coli O157:H7
13 1591 Methanococcus jannaschii
14 1425 Bos taurus (Bovine)
15 1385 Mycobacterium tuberculosis
16 1367 Salmonella typhimurium
17 1234 Escherichia coli O6
18 1083 Gallus gallus (Chicken)
19 1082 Shigella flexneri
20 1002 Mycobacterium bovis
21 934 Synechocystis sp. (strain PCC 6803)
22 933 Salmonella typhi
23 929 Pseudomonas aeruginosa
24 916 Archaeoglobus fulgidus
25 865 Xenopus laevis (African clawed frog)
26 831 Sus scrofa (Pig)
27 746 Rhizobium meliloti (Sinorhizobium meliloti)
28 730 Aquifex aeolicus
29 714 Oryctolagus cuniculus (Rabbit)
30 710 Vibrio cholerae
31 687 Mycoplasma pneumoniae
32 650 Yersinia pestis
33 615 Pasteurella multocida
34 600 Treponema pallidum
35 600 Mycobacterium leprae
36 578 Streptomyces coelicolor
37 572 Buchnera aphidicola (subsp. Acyrthosiphon pisum)
38 560 Buchnera aphidicola (subsp. Schizaphis graminum)
39 551 Helicobacter pylori (Campylobacter pylori)
40 543 Bacillus halodurans
41 541 Rickettsia prowazekii
42 532 Helicobacter pylori J99 (Campylobacter pylori J99)
43 522 Vibrio parahaemolyticus
44 521 Methanobacterium thermoautotrophicum
45 516 Anabaena sp. (strain PCC 7120)
46 499 Zea mays (Maize)
47 486 Mycoplasma genitalium
48 473 Vibrio vulnificus
49 463 Lactococcus lactis (subsp. lactis) (Streptococcus lactis)
50 443 Neisseria meningitidis (serogroup B)
51 440 Neisseria meningitidis (serogroup A)
52 435 Thermotoga maritima
53 432 Ralstonia solanacearum (Pseudomonas solanacearum)
54 422 Oryza sativa (Rice)
55 419 Rhizobium loti (Mesorhizobium loti)
56 418 Agrobacterium tumefaciens (strain C58 / ATCC 33970)
57 417 Listeria monocytogenes
58 416 Caulobacter crescentus
59 414 Chlamydia trachomatis
60 411 Xanthomonas campestris (pv. campestris)
61 411 Borrelia burgdorferi (Lyme disease spirochete)
62 410 Listeria innocua
63 407 Clostridium acetobutylicum
64 403 Rhizobium sp. (strain NGR234)
65 402 Canis familiaris (Dog)
66 401 Chlamydia pneumoniae (Chlamydophila pneumoniae)
67 398 Xylella fastidiosa
68 392 Chlamydia muridarum
69 391 Pyrococcus horikoshii
70 389 Streptococcus pneumoniae
71 387 Xylella fastidiosa (strain Temecula1 / ATCC 700964)
72 384 Pyrococcus abyssi
73 375 Xanthomonas axonopodis (pv. citri)
74 374 Deinococcus radiodurans
75 370 Staphylococcus aureus (strain N315)
76 368 Staphylococcus aureus (strain Mu50 / ATCC 700699)
77 353 Clostridium perfringens
78 352 Nicotiana tabacum (Common tobacco)
79 351 Campylobacter jejuni
80 347 Staphylococcus aureus (strain MW2)
81 347 Corynebacterium glutamicum (Brevibacterium flavum)
82 346 Halobacterium sp. (strain NRC-1 / ATCC 700922 / JCM 11081)
83 341 Ovis aries (Sheep)
84 332 Sulfolobus solfataricus
85 329 Brucella melitensis
86 329 Rickettsia conorii
87 325 Pyrococcus furiosus
88 317 Dictyostelium discoideum (Slime mold)
89 317 Streptococcus pyogenes
90 315 Thermoanaerobacter tengcongensis
91 314 Methanosarcina mazei (Methanosarcina frisia)
92 307 Aeropyrum pernix
93 306 Neurospora crassa
94 305 Methanosarcina acetivorans
95 291 Pisum sativum (Garden pea)
96 285 Staphylococcus aureus
97 276 Chlorobium tepidum
98 272 Candida albicans (Yeast)
99 270 Streptococcus pyogenes (serotype M18)
100 269 Bradyrhizobium japonicum
2.3 Taxonomic distribution of the sequences
Kingdom sequences (% of the database)
Archaea 7773 ( 6%)
Bacteria 54879 ( 40%)
Eukaryota 64641 ( 48%)
Viruses 8557 ( 6%)
Within Eukaryota:
Category sequences (% of Eukaryota) (% of the complete database)
Human 10159 ( 16%) ( 7%)
Other Mammalia 17276 ( 27%) ( 13%)
Other Vertebrata 6106 ( 9%) ( 4%)
Viridiplantae 10227 ( 16%) ( 8%)
Fungi 9648 ( 15%) ( 7%)
Insecta 3568 ( 6%) ( 3%)
Nematoda 2633 ( 4%) ( 2%)
Other 5024 ( 8%) ( 4%)
3. SEQUENCE SIZE
Repartition of the sequences by size (excluding fragments)
From To Number From To Number
1- 50 2456 1001-1100 1205
51- 100 9273 1101-1200 849
101- 150 13503 1201-1300 610
151- 200 12518 1301-1400 413
201- 250 12918 1401-1500 331
251- 300 11315 1501-1600 236
301- 350 11678 1601-1700 179
351- 400 11033 1701-1800 132
401- 450 8414 1801-1900 138
451- 500 7190 1901-2000 120
501- 550 5591 2001-2100 68
551- 600 3670 2101-2200 103
601- 650 3131 2201-2300 100
651- 700 2227 2301-2400 58
701- 750 1974 2401-2500 57
751- 800 1649 >2500 362
801- 850 1278
851- 900 1280
901- 950 901
951-1000 794
The average sequence length in Swiss-Prot is 368 amino acids.
The shortest sequence is LUXE_VIBFI (P24272): 3 amino acids.
The longest sequence is SNE1_HUMAN (Q8NF91): 8797 amino acids.
4. JOURNAL CITATIONS
Note: the following citation statistics reflect the number of distinct
journal citations.
Total number of journals cited in this release of Swiss-Prot: 1392
4.1 Table of the frequency of journal citations
Journals cited 1x: 511
2x: 186
3x: 97
4x: 66
5x: 49
6x: 35
7x: 31
8x: 27
9x: 20
10x: 14
11- 20x: 111
21- 50x: 104
51-100x: 40
>100x: 101
4.2 List of the most cited journals in Swiss-Prot
Nb Citations Journal name
-- --------- -------------------------------------------------------------
1 9693 Journal of Biological Chemistry
2 5221 Proceedings of the National Academy of Sciences of the U.S.A.
3 3748 Journal of Bacteriology
4 3667 Nucleic Acids Research
5 3493 Gene
6 2744 FEBS Letters
7 2707 Biochemical and Biophysical Research Communications
8 2489 Biochemistry
9 2486 European Journal of Biochemistry
10 2279 The EMBO Journal
11 2143 Nature
12 2093 Biochimica et Biophysica Acta
13 1893 Journal of Molecular Biology
14 1831 Genomics
15 1668 Cell
16 1633 Molecular and Cellular Biology
17 1309 Biochemical Journal
18 1232 Science
19 1132 Molecular and General Genetics
20 1131 Plant Molecular Biology
21 1109 Molecular Microbiology
22 874 Journal of Biochemistry
23 851 Virology
24 790 Human Molecular Genetics
25 748 Journal of Cell Biology
26 696 Nature Genetics
27 651 Genes and Development
28 628 Journal of Virology
29 615 Plant Physiology
30 604 Human Mutation
31 597 Oncogene
32 587 The American Journal of Human Genetics
33 551 Infection and Immunity
34 546 Journal of Immunology
35 542 Yeast
36 510 Journal of General Virology
37 489 Archives of Biochemistry and Biophysics
38 486 Structure
39 466 FEMS Microbiology Letters
40 458 Microbiology
41 448 Development
42 404 Nature Structural Biology
43 399 Human Genetics
44 397 Genetics
45 391 Current Genetics
46 362 Molecular and Biochemical Parasitology
47 359 Blood
48 335 Applied and Environmental Microbiology
49 321 Journal of Clinical Investigation
50 310 Molecular Endocrinology
51 304 Mammalian Genome
52 299 Developmental Biology
53 297 Protein Science
54 291 DNA and Cell Biology
55 287 Immunogenetics
56 286 Journal of Molecular Evolution
57 271 Biological Chemistry Hoppe-Seyler
58 270 Cancer Research
59 263 Neuron
60 263 Journal of Experimental Medicine
61 254 Mechanisms of Development
62 248 Molecular Biology of the Cell
63 240 Acta Crystallographica, Section D
64 238 The Plant Cell
65 237 Endocrinology
66 232 DNA Sequence
67 231 Journal of General Microbiology
68 228 Journal of Cell Science
69 213 Hoppe-Seyler's Zeitschrift fur Physiologische Chemie
70 211 The Plant Journal
71 209 Molecular Biology and Evolution
72 204 Brain Research. Molecular Brain Research
73 202 Journal of Neuroscience
74 202 Journal of Neurochemistry
75 189 The Journal of Clinical Endocrinology and Metabolism
76 177 Cytogenetics and Cell Genetics
77 167 Comparative Biochemistry and Physiology
78 163 Toxicon
79 162 Bioscience, Biotechnology, and Biochemistry
80 156 DNA
81 153 Molecular Pharmacology
82 151 American Journal of Physiology
83 150 Antimicrobial Agents and Chemotherapy
84 141 Current Biology
85 140 Tissue Antigens
86 135 Molecular Cell
87 133 Biochimie
88 132 Proteins
89 132 Virus Research
90 132 DNA Research
91 130 Molecular Plant-Microbe Interactions
92 127 Bioorganicheskaia Khimiia
93 124 Journal of Investigative Dermatology
94 124 Peptides
95 120 Hemoglobin
96 115 Molecular and Cellular Endocrinology
97 115 Genome Research
98 114 Agricultural and Biological Chemistry
99 112 American Journal of Medical Genetics
100 108 Journal of Medical Genetics
101 102 European Journal of Immunology
5. STATISTICS FOR SOME LINE TYPES
The following table summarizes the total number of some Swiss-Prot lines,
as well as the number of entries with at least one such line, and the
frequency of the lines.
Total Number of Average
Line type / subtype number entries per entry
--------------------------------- -------- --------- ---------
References (RL) 262333 1.93
Journal 228944 125635 1.69
Submitted to EMBL/GenBank/DDBJ 30739 25552 0.23
Unpublished observations 521 517 <0.01
Submitted to Swiss-Prot 517 515 <0.01
Plant Gene Register 485 473 <0.01
Book citation 465 453 <0.01
Thesis 248 246 <0.01
Submitted to other databases 198 197 <0.01
Unpublished results 122 120 <0.01
Patent 92 91 <0.01
Worm Breeder's Gazette 2 2 <0.01
Comments (CC) 463600 3.41
SIMILARITY 134388 116832 0.99
FUNCTION 85939 84536 0.63
SUBCELLULAR LOCATION 63137 63137 0.46
CATALYTIC ACTIVITY 45791 43051 0.34
SUBUNIT 39346 39346 0.29
PATHWAY 21282 20412 0.16
TISSUE SPECIFICITY 15006 15006 0.11
COFACTOR 14929 14929 0.11
MISCELLANEOUS 8291 7646 0.06
PTM 8152 7401 0.06
ALTERNATIVE PRODUCTS 4673 4673 0.03
DOMAIN 4127 3805 0.03
CAUTION 3947 3715 0.03
INDUCTION 3856 3856 0.03
DEVELOPMENTAL STAGE 3667 3667 0.03
DISEASE 2401 1885 0.02
ENZYME REGULATION 1913 1913 0.01
MASS SPECTROMETRY 1021 918 0.01
DATABASE 884 812 0.01
POLYMORPHISM 435 426 <0.01
ALLERGEN 313 313 <0.01
BIOTECHNOLOGY 55 55 <0.01
PHARMACEUTICAL 47 47 <0.01
Features (FT) 720924 5.31
DOMAIN 107190 32438 0.79
TRANSMEM 85992 18759 0.63
CONFLICT 52547 18442 0.39
METAL 49380 12298 0.36
CARBOHYD 48156 11914 0.35
DISULFID 43940 11664 0.32
TURN 39112 2952 0.29
STRAND 36250 2640 0.27
ACT_SITE 28670 17375 0.21
HELIX 27705 2842 0.20
VARIANT 26136 4877 0.19
CHAIN 25029 20292 0.18
REPEAT 24898 4114 0.18
NP_BIND 17064 12177 0.13
SIGNAL 15697 15695 0.12
MOD_RES 14364 8131 0.11
NON_TER 10462 7968 0.08
BINDING 8851 6875 0.07
VARSPLIC 8719 3948 0.06
ZN_FING 8643 3067 0.06
SITE 8594 5257 0.06
INIT_MET 5915 5873 0.04
MUTAGEN 5538 1665 0.04
PROPEP 4936 4178 0.04
DNA_BIND 4398 4139 0.03
LIPID 3812 2533 0.03
PEPTIDE 2714 1085 0.02
TRANSIT 2693 2670 0.02
CA_BIND 1799 764 0.01
NON_CONS 823 425 0.01
CROSSLNK 438 345 <0.01
UNSURE 304 128 <0.01
SE_CYS 155 101 <0.01
Cross-references (DR) 1236729 9.10
EMBL 260342 129417 1.92
InterPro 245720 120841 1.81
Pfam 156422 115928 1.15
PROSITE 117863 74642 0.87
PIR 86853 79511 0.64
PRINTS 45909 40532 0.34
GO 44199 13630 0.33
SMART 40856 30862 0.30
HSSP 38180 38180 0.28
TIGRFAMs 35999 33456 0.26
ProDom 34326 32956 0.25
HAMAP 32758 32659 0.24
PDB 19661 5456 0.14
TIGR 12981 12906 0.10
Genew 8961 8913 0.07
MIM 8833 7540 0.07
MGD 6547 6528 0.05
SGD 4964 4910 0.04
GermOnline 4924 4874 0.04
EcoGene 4228 4226 0.03
MEROPS 3434 3319 0.03
TRANSFAC 2629 2357 0.02
WormPep 2619 2383 0.02
SubtiList 2608 2607 0.02
FlyBase 2425 2351 0.02
GeneDB_SPombe 2282 2252 0.02
TubercuList 1414 1377 0.01
StyGene 1323 1320 0.01
SWISS-2DPAGE 949 948 0.01
ListiList 828 769 0.01
PIRSF 679 679 <0.01
GK 671 671 <0.01
Leproma 604 600 <0.01
Gramene 415 413 <0.01
MaizeDB 411 406 <0.01
HIV 370 354 <0.01
REBASE 358 353 <0.01
ECO2DBASE 351 299 <0.01
DictyDb 320 317 <0.01
GlycoSuiteDB 259 259 <0.01
ZFIN 245 245 <0.01
PHCI-2DPAGE 212 212 <0.01
MypuList 145 145 <0.01
SagaList 134 134 <0.01
Aarhus/Ghent-2DPAGE 128 98 <0.01
Siena-2DPAGE 103 103 <0.01
HSC-2DPAGE 85 85 <0.01
PhosSite 53 53 <0.01
COMPLUYEAST-2DPAGE 50 50 <0.01
PMMA-2DPAGE 47 47 <0.01
Maize-2DPAGE 39 39 <0.01
ANU-2DPAGE 13 13 <0.01
6. MISCELLANEOUS STATISTICS
Total number of distinct authors cited in Swiss-Prot: 173952
Total number of entries encoded on a chloroplast: 3319
Total number of entries encoded on a mitochondrion: 2765
Total number of entries encoded on a cyanelle: 145
Total number of entries encoded on a plasmid: 2705
Number of fragments: 8096
Number of additional sequences encoded on splice variants: 6909
--End of document--
