You are using a version of Internet Explorer that may not display all features of this website. Please upgrade to a modern browser.
Swiss-Prot release 41.0
Published February 1, 2003
------------------------------------------------------------------------ Swiss-Prot Protein Knowledgebase Release Notes Release 41, February 2003 ------------------------------------------------------------------------ Table of contents 1 Introduction 2 Description of the changes made to Swiss-Prot since release 40 3 Forthcoming changes 4 Status of the documentation files 5 The ExPASy World-Wide Web server 6 TrEMBL - a supplement to Swiss-Prot 7 FTP access to Swiss-Prot and TrEMBL 8 ENZYME and PROSITE 9 We need your help! A Appendix A 1 Introduction Release 41.0 of Swiss-Prot contains 122'564 sequence entries, comprising 44'986'459 amino acids abstracted from 103'486 references. This represents an increase of 20% over release 40.0. The growth of the database is summarized below. Release Date Number of Number of entries amino acids 2.0 09/86 3'939 900'163 3.0 11/86 4'160 969'641 4.0 04/87 4'387 1'036'010 5.0 09/87 5'205 1'327'683 6.0 01/88 6'102 1'653'982 7.0 04/88 6'821 1'885'771 8.0 08/88 7'724 2'224'465 9.0 11/88 8'702 2'498'140 10.0 03/89 10'008 2'952'613 11.0 07/89 10'856 3'265'966 12.0 10/89 12'305 3'797'482 13.0 01/90 13'837 4'347'336 14.0 04/90 15'409 4'914'264 15.0 08/90 16'941 5'486'399 16.0 11/90 18'364 5'986'949 17.0 02/91 20'024 6'524'504 18.0 05/91 20'772 6'792'034 19.0 08/91 21'795 7'173'785 20.0 11/91 22'654 7'500'130 21.0 03/92 23'742 7'866'596 22.0 05/92 25'044 8'375'696 23.0 08/92 26'706 9'011'391 24.0 12/92 28'154 9'545'427 25.0 04/93 29'955 10'214'020 26.0 07/93 31'808 10'875'091 27.0 10/93 33'329 11'484'420 28.0 02/94 36'000 12'496'420 29.0 06/94 38'303 13'464'008 30.0 10/94 40'292 14'147'368 31.0 02/95 43'470 15'335'248 32.0 11/95 49'340 17'385'503 33.0 02/96 52'205 18'531'384 34.0 10/96 59'021 21'210'389 35.0 11/97 69'113 25'083'768 36.0 07/98 74'019 26'840'295 37.0 12/98 77'977 28'268'293 38.0 07/99 80'000 29'085'965 39.0 05/00 86'593 31'411'114 40.0 10/01 101'602 37'315'215 41.0 02/03 122'564 44'986'459 2 Description of the changes made to Swiss-Prot since release 40 2.1 Sequences and annotations 21'133 sequences have been added since release 40, the sequence data of 3'251 existing entries has been updated and the annotations of 57'525 entries have been revised. 2.2 The HPI project The Human Proteomics Initiative (HPI) puts a major effort on the annotation of all known human sequences according to the quality standards of Swiss-Prot. This means that, for each known protein, a wealth of information is provided, which includes the description of its function, its domain structure, subcellular location, post-translational modifications (PTMs), variants, similarities to other proteins, etc. This not only implies the annotation of newly detected proteins, but also the integration of new research data into the existing entries by specialized biologists, who are in close contact with experts all over the world. There are currently 9'172 annotated human sequences in Swiss-Prot. Up-to-date detailed statistics concerning the HPI project are available at: http://www.expasy.org/sprot/hpi/hpi_stat.html Simultaneously, two further efforts were increased: the description of human diseases associated with deficiency(ies) in the protein, and mammalian orthologs of human proteins are annotated at a level equivalent to that of the cognate human sequences. For all aspects of the HPI project, we would appreciate the help and collaboration of the scientific community. Information concerning the human proteome is highly critical to a large section of the life science community. We therefore appeal to the user community to fully participate in this initiative by providing all the necessary information to define and to speed up the comprehensive annotation of the human proteome. For a detailed description of the HPI project please consult: http://www.expasy.org/sprot/hpi/ 2.3 The HAMAP project The first complete microbial genome sequence was that of the bacterium Haemophilus influenzae, which became available in 1995. Since then, more than 100 bacterial and archaeal genomes have been sequenced and many more sequencing projects of pathogenic and nonpathogenic microbes are in progress. To date, the publicly available microbial genomes encode more than 230'000 different proteins. In order to handle the large amount of "raw" data coming from microbial genome sequencing, the High quality Automated Microbial Annotation of Proteomes (HAMAP) project was initiated. The project aims to automatically annotate a significant percentage of protein sequences, which originate from microbial genome sequencing projects. To maintain a high level quality of annotation, specific tools are developed to deal with two completely separate subsets of bacterial and archaeal proteins: proteins that have no recognizable similarity to any other microbial or non-microbial proteins ("ORFans") and proteins that are part of well-defined families or subfamilies. This is done by using a rule system that describes the level and extent of annotations that can be assigned by similarity with a prototype manually annotated entry. The result is a curated entry whose quality is identical to that produced manually by an expert annotator. Programs under development are designed to recognize protein peculiarities, and only proteins which match the defined criteria are processed automatically. Protein sequences which fail to fit into the rule system are further analyzed by Swiss-Prot expert annotators. For a detailed description of the HAMAP project and its current status please consult: http://www.expasy.org/sprot/hamap/ and: Gattiker A., Michoud K., Rivoire C., Auchincloss A.H., Coudert E., Lima T., Kersey P., Pagni M., Sigrist C.J.A., Lachaize C., Veuthey A.-L., Bairoch A. Automatic annotation of microbial proteomes in Swiss-Prot. Comput. Biol. Chem. 27:49-58(2003). 2.4 What's happening with the model organisms? We have selected a number of organisms that are the target of genome sequencing and/or mapping projects and for which we intend to: * be as complete as possible. All sequences available at a given time should be immediately included in Swiss-Prot. This also includes sequence corrections and updates; * provide a higher level of annotation; * provide cross-references to specialized database(s) that contain, among other data, some information about the genes that code for these proteins; * provide specific indexes and documents. From our efforts to annotate human sequence entries as completely as possible arose the HPI project (see 2.2), and the bacterial model organisms became the focus of the HAMAP project (see 2.3). Here is the current status of the model organisms which are not covered by these two projects: Organism Database Index file Number of cross-references sequences ------------ ---------------- -------------- --------- A.thaliana None yet arath.txt 1'952 C.albicans None yet calbican.txt 264 C.elegans Wormpep celegans.txt 2'291 D.discoideum DictyDB dicty.txt 316 D.melanogaster FlyBase fly.txt 1'764 M.musculus MGD mgdtosp.txt 6'169 S.cerevisiae SGD yeast.txt 4'892 S.pombe GeneDB_SPombe pombe.txt 2'116 2.5 'Nucleomorph' added to the OrGanelle (OG) line The OG (OrGanelle) line indicates from which genome a gene for a protein originates. Until now, defined terms in the OG line where 'Chloroplast', 'Cyanelle', 'Mitochondrion' and 'Plasmid'. The term 'Nucleomorph' has been added, which is the residual nucleus of an algal endosymbiont that resides inside its host cell. 2.6 Progress in the conversion of Swiss-Prot to mixed-case characters We are gradually converting Swiss-Prot entries from all 'UPPER CASE' to 'MiXeD CaSe'. With this release the RC (Reference Comment) line topic STRAIN and the CC line topic 'CATALYTIC ACTIVITY' have been converted. As described in section 3.2, the process of converting all of Swiss-Prot to mixed case continues. 2.7 Multiple RP lines Starting with release 41, there can be more than one RP (Reference Position) line per reference in a Swiss-Prot entry. The RP line describes the extent of the work carried out by the authors of the reference, e.g. the type of molecule that has been sequenced, protein characterization, PTM characterization, protein structure analysis, variation detection, etc. As the number of experimental results per publication has increased over the years, the limitation of using a single RP line per reference no longer allowed to add all the information while maintaining a consistent format. Therefore we decided to permit multiple RP lines. Example: RP SEQUENCE FROM N.A., SEQUENCE OF 23-42 AND 351-365, AND RP CHARACTERIZATION. 2.8 Changes concerning cross-references (DR line) 2.8.1 Schizosaccharomyces pombe GeneDB database We have added cross-references to the Schizosaccharomyces pombe GeneDB database (available at http://www.genedb.org/genedb/pombe/index.jsp), which contains all S. pombe known and predicted protein coding genes, pseudogenes and tRNAs. It is hosted by the Sanger Institute. The identifiers of the appropriate DR line are: Data bank identifier: GeneDB_SPombe Primary identifier: GeneDB's unique identifier for a S. pombe gene. Secondary identifier: None; a dash '-' is stored in that field. Example: DR GeneDB_SPombe; SPAC9E9.12c; -. 2.8.2 Genew We have added cross-references to the Human Gene Nomenclature Database Genew (available at http://www.gene.ucl.ac.uk/nomenclature/searchgenes.pl), which provides data for all human genes which have approved symbols. It is managed by the HUGO Gene Nomenclature Committee (HGNC). The identifiers of the appropriate DR line are: Data bank identifier: Genew Primary identifier: HGNC's unique identifier for a human gene Secondary identifier: HGNC's approved gene symbol. Example: DR Genew; HGNC:5217; HSD3B1. 2.8.3 Gramene We have added cross-references to the Gramene database, a comparative mapping resource for grains (available at http://www.gramene.org/). The format for the explicit links are: Data bank identifier: Gramene Primary identifier: Unique identifier for a protein, which is identical to the Swiss-Prot primary AC number of that protein. Secondary identifier: None; a dash '-' is stored in that field. Example: DR Gramene; Q06967; -. 2.8.4 HAMAP We have added cross-references to the collection of orthologous microbial protein families, generated manually by expert curators of the HAMAP (High-quality Automated and Manual Annotation of microbial Proteomes) project in the framework of the Swiss-Prot protein knowledgebase. The data is accessible at http://www.expasy.org/sprot/hamap/families.html. The identifiers of the appropriate DR line are: Data bank identifier: HAMAP Primary identifier: HAMAP unique identifier for a microbial protein family Secondary The values are either '-', 'fused', 'atypical' or identifier: 'atypical/fused'. The value '-' is a placeholder for an empty field; the 'fused' value indicates that the family rule does not cover the entire protein; the value 'atypical' points out that the protein is divergent in sequence or has mutated functional sites, and should not be included in family datasets. The value 'atypical/fused' indicates both latter findings. Tertiary Number of domains found in the protein, generally identifier: '1', rarely '2' for the fusion of 2 identical domains. Example: DR HAMAP; MF_00012; -; 1. 2.8.5 Phosphorylation Site Database We have added cross-references to the Phosphorylation Site Database, PhosSite (available at http://vigen.biochem.vt.edu/xpd/xpd.htm), which provides access to information from scientific literature concerning prokaryotic proteins that undergo covalent phosphorylation on the hydroxyl side chains of serine, threonine or tyrosine residues. The identifiers of the appropriate DR line are: Data bank identifier: PhosSite Primary identifier: Unique identifier for a phosphoprotein, which is identical to the Swiss-Prot primary AC number of that protein. Secondary identifier: None; a dash '-' is stored in that field. Example: DR PhosSite; P00955; -. 2.8.6 TIGRFAMs We have added cross-references to TIGRFAMs, a protein family database available at http://www.tigr.org/TIGRFAMs/. The identifiers of the appropriate DR line are: Data bank identifier: TIGRFAMs Primary identifier: TIGRFAMs unique identifier for a protein family. Secondary identifier: TIGRFAMs entry name for a protein family. Tertiary identifier: Number of hits found in the sequence. Example: DR TIGRFAMs; TIGR00630; uvra; 1. 2.8.7 CarbBank We have removed the Swiss-Prot cross-references to CarbBank. 2.8.8 GCRDb We have removed the Swiss-Prot cross-references to GCRDb. 2.8.9 Mendel We have removed the Swiss-Prot cross-references to Mendel. 2.8.10 YEPD We have removed the Swiss-Prot cross-references to the yeast electrophoresis protein database (YEPD). 2.9 Explicit links to dbSNP in FT VARIANT lines of human sequence entries In human protein sequence entries we have introduced explicit links to the Single Nucleotide Polymorphism database (dbSNP) from the feature description of FT VARIANT keys. The format of such links is: FT VARIANT from to description (IN dbSNP:accession_number). FT /FTId=VAR_number. Example: FT VARIANT 65 65 T -> I (IN dbSNP:1065419). FT /FTId=VAR_012009. 2.10 Feature key 'SIMILAR' became obsolete The feature key 'SIMILAR' was used to describe the extent of a similarity with another protein sequence. Nowadays, most domains with similarity to other proteins are known regions described in domain and family databases, which are annotated in Swiss-Prot with the feature key 'DOMAIN' or 'REPEAT' and the comment (CC) line topic 'SIMILARITY'; thus the feature key 'SIMILAR' became obsolete and will not be used again. 2.11 Version of SP in XML format A distribution version of Swiss-Prot and TrEMBL in XML format is being developed. The first draft of the XML specification was released for public review on February 21, 2002. For more information see http://www.ebi.ac.uk/swissprot/SP-ML/. Please send comments and suggestions by electronic mail to firstname.lastname@example.org. 3 Forthcoming changes Please note that these are the last release notes in this format. In future, forthcoming changes and recent modifications are announced to users also between major Swiss-Prot releases. The distinct sections of this document will move to the following sites: * 2. Description of the changes made to Swiss-Prot since the last release: http://www.expasy.org/sprot/relnotes/sp_news.html. This new document contains all recent modifications in Swiss-Prot including minor changes with no impact on the work of software developpers. Thus this document contains more information than announced in the document 'sp_soon.html' (see below). * 3. Forthcoming changes: http://www.expasy.org/sprot/relnotes/sp_soon.html. All modifications, which have an impact on the Swiss-Prot format are announced in this document. * 4. Status of the documentation files: http://www.expasy.org/sprot/userman.html#documentation * 5. The ExPASy World-Wide Web server: o Explicit general and continuously updated documentation: http://www.expasy.org/doc/expasy.pdf o History of changes, improvements and new features: http://www.expasy.org/history.html o Swiss-Flash, a service that reports news of databases, software and service developments: http://www.expasy.org/swiss-flash/ * 6. TrEMBL - a supplement to Swiss-Prot: ftp://ftp.ebi.ac.uk/pub/databases/trembl/relnotes.txt * 7. FTP access to Swiss-Prot and TrEMBL: http://www.expasy.org/sprot/userman.html#ftp and http://www.expasy.org/sprot/download.html * 8. ENZYME and PROSITE: Enzyme release notes (not yet) and http://www.expasy.org/prosite/psrelnot.html * Appendix A (Release statistics): http://www.expasy.org/sprot/relnotes/relstat.html * Appendix B (Relationships between Swiss-Prot and some biomolecular databases): http://www.expasy.org/sprot/userman.html#relship 3.1 Extension of the entry name format We endeavor to assign meaningful entry names that facilitate the identification of the proteins and the species of origin. Currently the entry name consists of up to ten uppercase alphanumeric characters. Swiss-Prot uses a general purpose naming convention that can be symbolized as X_Y, where X is a mnemonic code of at most 4 alphanumeric characters representing the protein name, the '_' sign serves as a separator, and the Y is a mnemonic species identification code of at most 5 alphanumeric characters representing the biological source of the protein. We are planning to elongate the mnemonic code for the protein name from up to 4 characters to up to 5 characters. E.g. the mnemonic code for the meiotic recombination protein rec10 is currently 'RE10'. After the introduction of extended entry names it could be modified to the 5-letter code 'REC10'. 3.2 Continuation of the conversion of Swiss-Prot to mixed-case characters We will continue to convert Swiss-Prot entries from all 'UPPER CASE' to 'MiXeD CaSe'. We are proceeding in the conversion of CC (Comment) lines, we will start to convert the GN (Gene Name) lines to mixed case, but also any other line type might be effected. 3.3 Reference Comment (RC) line topics may span lines The RC (Reference Comment) line store comments relevant to the reference cited, in currently 5 distinct topics: PLASMID, SPECIES, STRAIN, TISSUE and TRANSPOSON. It is not always possible to list all information within one line. Therefore we will allow multiple RC lines, in which one topic might span over a line. Example: RC STRAIN=Various strains; could become RC STRAIN=AZ.026, DC.005, GA.039, GA2181, IL.014, IN.018, KY.172, KY2.37, RC LA.013, MN.001, MNb027, MS.040, NY.016, OH.036, TN.173, TN2.38, RC UT.002, AL.012, AZ.180, MI.035, VA.015, and IL2.17; 3.4 New format of comment line (CC) topics We are continuing a major overhaul of various comment line topics. We would like the majority of the information stored to be usable by computer programs (while remaining human-readable). We are therefore standardizing the format of the topics. 3.4.1 ALTERNATIVE PRODUCTS We are gradually restructuring the CC (comment) line topic ALTERNATIVE PRODUCTS and introducing unique identifiers for each described isoform. Qualifiers, which will be introduced are described in the table below: Topic Description Event Biological process that results in the production of the alternative forms (Alternative promoter, Alternative splicing, Alternative initiation). Format: Event=controlled vocabulary; Example: Event=Alternative splicing; Named Number of isoforms listed in the topics 'Name' isoforms below the topic 'Event=Alternative splicing'. Format: Named isoforms=number; Example: Named isoforms=6; Comment Any comments concerning one or more isoforms; optional; may be longer than 1 line. Format: Comment=free text; Example: Comment=Experimental confirmation may be lacking for some isoforms; Name A common name for an isoform used in the literature or assigned by Swiss-Prot (currenty only available for spliced isoforms). Format: Name=common name; Example: Name=Alpha; Synonyms Synonyms for an isoform as used in the literature; optional. Format: Synonyms=synonym_1[, synonym_n]; Example: Synonyms=B, KL5; IsoId Unique identifier for an isoform, consisting of the Swiss-Prot accession number, followed by a dash and an identifier for this isoform. Format: IsoId=acc#-isoform_number[,acc#-isoform_number]; Example: IsoId=P05067-1; Sequence Lists all FT VARSPLIC identifiers (VSP_#), which are needed to build the sequence for a specific isoform. If the accession number of the IsoId does not correspond to the accession number of the current entry, this topic contains the term 'External'. Format: Sequence=VSP_#[,VSP_#]|Displayed|External|Not described; Example: Sequence=Displayed; Example: Sequence=VSP_000013, VSP_000014; Note Notes concerning current isoform; optional; Format: Note=free text; Example: Note=Predicted; In the case of 'Alternative initiation' the topic 'Event' can be followed by a 'Comment' of free text. Format: CC -!- ALTERNATIVE PRODUCTS: CC Event=Alternative initiation; CC Comment=Optional free text with information on alternative CC initiation or the products retrieved from this event. In the CC case of alternative initiation there will be no other topics; In the case of 'Alternative splicing' the topic 'Event' can be followed by a 'Comment' of free text and a listing of all described isoforms. Format: CC -!- ALTERNATIVE PRODUCTS: CC Event=Alternative splicing; CC Comment=Optional free text with information on alternative CC splicing or the products retrieved from this event; CC Name=isoform_1; Synonyms=synonym_1[, synonym_n]; CC IsoId=isoform_identifier_1[, isoform_identifer_n]; CC Sequence=VSP_identifier_1 [, VSP_identifier_n]; CC Note=Optional note concerning isoform_1; CC Name=isoform_n; Synonyms=synonym_1[, synonym_n]; CC IsoId=isoform_identifier_1[, isoform_identifer_n]; CC Sequence=VSP_identifier_1 [, VSP_identifier_n]; CC Note=Optional note concerning isoform_n; Example for new format of the CC lines and the corresponding FT lines for an entry with alternative splicing: ... CC -!- ALTERNATIVE PRODUCTS: CC Event=Alternative splicing; Named isoforms=9; CC Comment=Additional isoforms seem to exist. APP695, APP751 and CC APP770 are the major isoforms. The L-isoforms are referred to as CC appicans. Experimental confirmation may be lacking for some CC isoforms; CC Name=APP770; Synonyms=Prea4 770; CC IsoId=P05067-1; Sequence=Displayed; CC Name=APP305; CC IsoId=P05067-2; Sequence=VSP_000005, VSP_000006; CC Name=L-APP677; CC IsoId=P05067-3; Sequence=VSP_000002, VSP_000004, VSP_000009; CC Name=APP695; Synonyms=Prea4 695; CC IsoId=P05067-4; Sequence=VSP_000002, VSP_000004; CC Name=L-APP696; CC IsoId=P05067-5; Sequence=VSP_000002, VSP_000003, VSP_000009; CC Name=APP714; CC IsoId=P05067-6; Sequence=VSP_000002, VSP_000003; CC Name=L-APP733; CC IsoId=P05067-7; Sequence=VSP_000007, VSP_000008, VSP_000009; CC Name=APP751; Synonyms=Prea4 751; CC IsoId=P05067-8; Sequence=VSP_000007, VSP_000008; CC Name=L-APP752; CC IsoId=P05067-9; Sequence=VSP_000009; ... FT VARSPLIC 289 289 E -> V (in isoform APP695, isoform FT L-APP696, isoform L-APP677 and isoform FT APP714). FT /FTId=VSP_000002. FT VARSPLIC 290 345 Missing (in isoform L-APP696 and isoform FT APP714). FT /FTId=VSP_000003. FT VARSPLIC 290 364 Missing (in isoform APP695 and isoform FT L-APP677). FT /FTId=VSP_000004. FT VARSPLIC 290 305 VCSEQAETGPCRAMIS -> KWYKEVHSGQARWLML (in FT isoform APP305). FT /FTId=VSP_000005. FT VARSPLIC 306 770 Missing (in isoform APP305). FT /FTId=VSP_000006. FT VARSPLIC 345 345 M -> I (in isoform L-APP733 and isoform FT APP751). FT /FTId=VSP_000007. FT VARSPLIC 346 364 Missing (in isoform L-APP733 and isoform FT APP751). FT /FTId=VSP_000008. FT VARSPLIC 637 654 Missing (in isoform L-APP677, isoform FT L-APP696, isoform L-APP733 and isoform FT L-APP752). FT /FTId=VSP_000009. ... 3.4.2 PATHWAY We are gradually structuring the comment line topic PATHWAY. To describe the biochemical pathway in which the protein is involved, we use the following format: CC -!- PATHWAY: biochemical pathway; nth step.[ Comment.] Example: CC -!- PATHWAY: Coenzyme A (CoA) biosynthesis; first step. 3.4.3 COFACTOR The comment line topic COFACTOR is gradually being modified to the following format: CC -!- COFACTOR: cofactor1[, cofactor2 and cofactor3].[ Comment.] Examples: CC -!- COFACTOR: Magnesium. CC -!- COFACTOR: Copper, Manganese and Nickel. 3.5 Changes concerning cross-references (DR line) We will add cross-references to the Gene Ontology (GO) database (available at http://www.geneontology.org/), which provides controlled vocabularies for the description of the molecular function, biological process and cellular component of gene products. The identifiers of the appropriate DR line are: Data bank identifier: GO Primary identifier: GO's unique identifier for a GO term. Secondary identifier: A 1-letter abbreviation for one of the 3 ontology aspects, separated from the GO term by a column. If the term is longer than 45 characters, the first 43 characters are indicated followed by 3 dots ('...'). The abbreviations for the 3 distinct aspects of the ontology are P (biological Process), F (molecular Function) and C (cellular Component). Tertiary identifier: 3-character GO evidence code. Example: DR GO; GO:0003677; F:DNA binding; TAS. 3.6 Modifications concerning the feature table (FT line) We are investigating a major effort in the annotation of posttranslational modifications, which has an effect on various feature keys and feature descriptions. Major format changes are described below. 3.6.1 New feature key 'CROSSLNK' The feature key 'CROSSLNK' will be introduced to describe bonds between amino acids, which are formed posttranslationally within a peptide or between peptides, such as isopeptidic bonds, carbon-carbon linkages, carbon-nitrogen linkages and backbone condensations. It will also include the description of tioether bonds and thiolester bonds and thus the feature keys 'THIOETH' and 'THIOLEST' will be removed. Note: Disulfide bonds occur so often in proteins, that we will keep the special feature key 'DISULFID' to describe this kind of linkage. Format: FT CROSSLNK from to Description. 3.6.2 Removal of the feature key 'THIOETH' See section 3.6.1. 3.6.3 Removal of the feature key 'THIOLEST' See section 3.6.1. 4 Status of the documentation files Swiss-Prot is distributed with a large number of documentation files. Some of these files have been available for a long time (the user manual, release notes, the various indexes for authors, citations, keywords, etc.), but many have been created recently and we are continuously adding new files, and updating and modifying existing files. Please note that the header in many documentation files has changed. The following table lists all the documents that are currently available. See also section 7.3 for information on how to access updated versions of all documents between major releases. userman.txt User manual relnotes.txt Release notes for the current release (41) shortdes.txt Short description of entries in Swiss-Prot jourlist.txt List of cited journals keywlist.txt List of keywords plasmid.txt List of plasmids speclist.txt List of organism (species) identification codes tisslist.txt List of tissues experts.txt List of on-line experts for PROSITE and Swiss-Prot dbxref.txt List of databases cross-referenced in Swiss-Prot submit.txt Submission of sequence data to Swiss-Prot acindex.txt Accession number index autindex.txt Author index citindex.txt Citation index keyindex.txt Keyword index speindex.txt Species index deleteac.txt Deleted accession number index 7tmrlist.txt List of 7-transmembrane G-linked receptor entries aatrnasy.txt List of aminoacyl-tRNA synthetases allergen.txt Nomenclature and index of allergen sequences annbioch.txt Swiss-Prot annotation: how is biochemical information assigned to sequence entries arath.txt Index of Arabidopsis thaliana entries and their corresponding gene designations [see 2] bacsu.txt Index of Bacillus subtilis strain 168 chromosomal entries and their corresponding SubtiList cross-references [see 1] bloodgrp.txt Blood group antigen proteins bucai.txt Index of Buchnera aphidicola (subsp. Acyrthosiphon pisum) entries [see 2] bucap.txt Index of Buchnera aphidicola (subsp. Schizaphis graminum) entries[see 2] calbican.txt Index of Candida albicans entries and their corresponding gene designations cdlist.txt CD nomenclature for surface proteins of human leucocytes Index of Caenorhabditis elegans entries and their celegans.txt corresponding gene designations and WormPep cross-references Index of Dictyostelium discoideum entries and their dicty.txt corresponding gene designations and DictyDB cross-references ec2dtosp.txt Index of Escherichia coli Gene-protein database (ECO2DBASE) entries referenced in Swiss-Prot ecoli.txt Index of Escherichia coli strain K12 chromosomal entries and their corresponding EcoGene cross-references embltosp.txt Index of EMBL Nucleotide Sequence Database entries referenced in Swiss-Prot extradom.txt Nomenclature of extracellular domains fly.txt Index of Drosophila entries and their corresponding FlyBase cross-references glycosid.txt Classification of glycosyl hydrolase families and index of glycosyl hydrolase entries in Swiss-Prot haein.txt Index of Haemophilus influenzae strain Rd chromosomal entries [see 1] helpy.txt Index of Helicobacter pylori strain 26695 chromosomal entries [see 1] hoxlist.txt Vertebrate homeotic Hox proteins: nomenclature and index humchr01.txt Index of proteins encoded on human chromosome 1 humchr02.txt Index of proteins encoded on human chromosome 2 humchr03.txt Index of proteins encoded on human chromosome 3 humchr04.txt Index of proteins encoded on human chromosome 4 humchr05.txt Index of proteins encoded on human chromosome 5 humchr06.txt Index of proteins encoded on human chromosome 6 humchr07.txt Index of proteins encoded on human chromosome 7 humchr08.txt Index of proteins encoded on human chromosome 8 humchr09.txt Index of proteins encoded on human chromosome 9 humchr10.txt Index of proteins encoded on human chromosome 10 humchr11.txt Index of proteins encoded on human chromosome 11 humchr12.txt Index of proteins encoded on human chromosome 12 humchr13.txt Index of proteins encoded on human chromosome 13 humchr14.txt Index of proteins encoded on human chromosome 14 humchr15.txt Index of proteins encoded on human chromosome 15 humchr16.txt Index of proteins encoded on human chromosome 16 humchr17.txt Index of proteins encoded on human chromosome 17 humchr18.txt Index of proteins encoded on human chromosome 18 humchr19.txt Index of proteins encoded on human chromosome 19 humchr20.txt Index of proteins encoded on human chromosome 20 humchr21.txt Index of proteins encoded on human chromosome 21 humchr22.txt Index of proteins encoded on human chromosome 22 humchrx.txt Index of proteins encoded on human chromosome X humchry.txt Index of proteins encoded on human chromosome Y humpvar.txt Index of human proteins with sequence variants initfact.txt List and index of translation initiation factors intein.txt Index of intein-containing entries referenced in Swiss-Prot metallo.txt Classification of metallothioneins and index of the entries in Swiss-Prot metja.txt Index of Methanococcus jannaschii entries [see 1] mgdtosp.txt Index of MGD entries referenced in Swiss-Prot mimtosp.txt Index of MIM entries referenced in Swiss-Prot mycge.txt Index of Mycoplasma genitalium strain G-37 chromosomal entries [see 1] mycpn.txt Index of Mycoplasma pneumoniae strain M129 chromosomal entries [see 2] ngr234.txt Table of predicted proteins in Rhizobium plasmid pNGR234a nomlist.txt List of nomenclature related references for proteins pdbtosp.txt Index of Protein Data Bank (PDB) entries referenced in Swiss-Prot peptidas.txt Classification of peptidase families and index of peptidase entries in Swiss-Prot plastid.txt List of chloroplast and cyanelle encoded proteins pombe.txt Index of Schizosaccharomyces pombe entries and their corresponding gene designations restric.txt List of restriction enzyme and methylase entries ribosomp.txt Index of ribosomal proteins classified by families on the basis of sequence similarities ricpr.txt Index of Rickettsia prowazekii strain Madrid E entries [see 1] salty.txt Index of Salmonella typhimurium strain LT2 chromosomal entries and their corresponding StyGene cross-references syny3.txt Index of Synechocystis sp. strain PCC 6803 entries [see 1] upflist.txt List of UPF (Uncharacterized Protein Families) and index of members yeast.txt Index of Saccharomyces cerevisiae entries in Swiss-Prot and their corresponding gene designations yeast1.txt Yeast chromosome I entries yeast2.txt Yeast chromosome II entries yeast3.txt Yeast chromosome III entries yeast5.txt Yeast chromosome V entries yeast6.txt Yeast chromosome VI entries yeast7.txt Yeast chromosome VII entries yeast8.txt Yeast chromosome VIII entries yeast9.txt Yeast chromosome IX entries yeast10.txt Yeast chromosome X entries yeast11.txt Yeast chromosome XI entries yeast13.txt Yeast chromosome XIII entries yeast14.txt Yeast chromosome XIV entries Notes: 1) The filenames for indexes of microbe-specific entries have been renamed; the filename is now composed of the 5-letter code used for the species in the Swiss-Prot entry name and the extension 'txt'. This modification concerns the following files: 'bacsu.txt' (formerly: 'subtilis.txt'), 'haein.txt' (formerly: 'haeinflu.txt'), 'helpy.txt' (formerly: 'hpylori.txt'), 'metja.txt' (formerly: 'mjannasc.txt'), 'mycge.txt' (formerly: 'mgenital.txt'), 'ricpr.txt' (formerly: 'rprowaze.txt'), 'syny3.txt' (formerly: 'pcc6803.txt'). 2) The files 'arath.txt', 'bucai.txt', 'bucap.txt' and 'mycpn.txt' are new documents introduced since release 40. We have continued to include in some Swiss-Prot documentation files the references to Web sites relevant to the subject under consideration. There are now 89 documents that include such links. 5 New features of the ExPASy World-Wide Web server related to Swiss-Prot Explicit general and continuously updated documentation about the ExPASy server is available at http://www.expasy.org/doc/expasy.pdf. ExPASy is constantly modified and improved. If you wish to be informed on the changes made to the server you can either: * Read the document 'History of changes, improvements and new features' which is available at the address: http://www.expasy.org/history.html * Subscribe to Swiss-Flash, a service that reports news of databases, software and service developments. By subscribing to this service, you will automatically get Swiss-Flash bulletins by electronic mail. To subscribe, use the address: http://www.expasy.org/swiss-flash/. Among all the improvements and the new features introduced since the last Swiss-Prot release, here are those that we believe are specifically useful to Swiss-Prot users: 1. The NiceProt view of Swiss-Prot has been further improved: access to documentation has been facilitated by adding "mouse-over" hypertext links from various sections in NiceProt to the corresponding information in the user manual. Those hypertext links, which give access to documentation rather than the data related to the protein entry, are visually different from the ordinary hyperlinks. While they are not immediately recognizable as such, the user can see that they are clickable by moving the mouse pointer over the section headings such as "References" or "Keywords". A short description of the linked information appears at the bottom of the web browser, and when clicked, a small additional window is opened with related information extracted from the user manual. Similarly, in the "Cross-references" section, the names of the databases to which an entry is cross-referenced are linked to the corresponding sections in the document dbxref.txt (List of databases cross-referenced in Swiss-Prot). 2. Implicit links have been added to the resources AraC-XylS, Ensembl and ModBase. We have removed the implicit links to DOMO, which is no longer maintained. For more details on Swiss-Prot cross-references, implicit and explicit links, you can read: Gasteiger E., Jung E., Bairoch A. Swiss-Prot: connecting biological knowledge via a protein database. Curr. Issues Mol. Biol. 3:47-55(2001) 3. A few improvements have been applied to the pages describing the Human Proteomics Initiative (HPI). For each human chromosome a link is provided to the corresponding index of Swiss-Prot entries, to relevant information in the EBI Proteome database, in Ensembl, in the Human Genome Resources at NCBI and in euGenes at Indiana University. The HPI status report has been modified to include, for each of the counted items (e.g. splice variants, variants, references) not only the absolute number, but also the maximal and average number of occurrences per entry, and the number of entries concerned by the counted item. 6 TrEMBL - a supplement to Swiss-Prot The ongoing genome sequencing and mapping projects have dramatically increased the number of protein sequences to be incorporated into Swiss-Prot. Since we do not want to dilute the quality standards of Swiss-Prot by incorporating sequences into the database without proper sequence analysis and annotation, we cannot speed up the incorporation of new incoming data indefinitely. But as we also want to make the sequences available as quickly as possible, we introduced in 1995 a computer annotated supplement to Swiss-Prot. This supplement consists of entries in Swiss-Prot-like format derived from the translation of all coding sequences (CDS) in the EMBL nucleotide sequence database, except those already included in Swiss-Prot. This supplement is named TrEMBL (Translation from EMBL). It can be considered as a preliminary section of Swiss-Prot. This Swiss-Prot release is supplemented by TrEMBL release 21. TrEMBL is available by FTP from the EBI and ExPASy servers in the directory '/databases/trembl'. It can be queried on WWW by the EBI and ExPASy SRS servers. It is distributed with its own set of release notes. 7 FTP access to Swiss-Prot and TrEMBL 7.1 Generalities Swiss-Prot is available for download on the following anonymous FTP servers: Organization Swiss Institute of Bioinformatics (SIB) ftp.expasy.org, au.expasy.org, bo.expasy.org, Address ca.expasy.org, cn.expasy.org, kr.expasy.org, tw.expasy.org, us.expasy.org Directory /databases/swiss-prot/ Organization European Bioinformatics Institute (EBI) Address ftp.ebi.ac.uk Directory /pub/databases/swissprot/ 7.2 Non-redundant database On the ExPASy and EBI FTP servers we distribute files that make up a non-redundant and complete protein sequence database consisting of three components: 1) Swiss-Prot 2) TrEMBL 3) New entries to be integrated later into TrEMBL (hereafter known as TrEMBL_New) Every week three files are completely rebuilt. These files are named: sprot.dat.gz, trembl.dat.gz and trembl_new.dat.gz. As indicated by their '. gz' extension, these are gzip-compressed files which, when decompressed, produce ASCII files in Swiss-Prot format. Three other files are also available (sprot.fas.gz, trembl.fas.gz and trembl_new.fas.gz) which are compressed 'fasta' format sequence files useful for building the databases used by FASTA, BLAST and other sequence similarity search programs. Please do not use these files for any other purpose, as you will lose all annotations by using this stripped-down format. The files for the non-redundant database are stored in the directory '/databases/sp_tr_nrdb' on the ExPASy FTP server (ftp.expasy.org) and in the directory '/pub/databases/sp_tr_nrdb' on the EBI FTP server (ftp.ebi.ac.uk). Additional notes: * The Swiss-Prot file continuously grows as new annotated sequences are added. * The TrEMBL file decreases in size as sequences are moved out of that section after being annotated and moved into Swiss-Prot. Four times a year a new release of TrEMBL is built at EBI, at this point the TrEMBL file increases in size as it then includes all of the new data (see next section) that has accumulated since the last release. * The TrEMBL_New file starts as a very small file and grows in size until a new release of TrEMBL is available. * Swiss-Prot and TrEMBL share the same system of accession numbers. Therefore you will not find any primary accession number duplicated between the two sections. A TrEMBL entry (and its associated accession number(s)) can either move to Swiss-Prot as a new entry or be merged with an existing Swiss-Prot entry. In the latter case, the accession number(s) of that TrEMBL entry are added to that of the Swiss-Prot entry. * TrEMBL_New does not have real accession numbers. However it was necessary to have an 'AC' line so as to be able to use it with different software products. This AC line contains a temporary identifier which consists of the protein_ID (protein sequence identifier) of the coding sequence in the parent nucleotide sequence. * TrEMBL_New is quite messy! You will of course find new sequence entries but you will also encounter sequences that are going to be used to update existing TrEMBL or Swiss-Prot entries. None of the "cleaning" steps that are applied to produce a TrEMBL release are run on TrEMBL_New nor are any of the computer-annotation software tools that are used to enhance the information content of TrEMBL. TrEMBL_New is provided only so that users can be sure not to miss any important new sequences when they run similarity searches. * While these three files allow you to build what we call a 'non-redundant' database, it must be noted that this is not completely a true statement. Without going into a long explanation we can say that this is currently the best attempt in providing a complete selection of protein sequence entries while trying to eliminate redundancies. While Swiss-Prot is completely (well 99.994% !) non-redundant, TrEMBL is far from being non-redundant and the addition of Swiss-Prot + TrEMBL is even less so. * To describe to your users the version of the non-redundant database that you are providing them with, you should use a statement of the form: Swiss-Prot release 41.x of xx-yyy-2003; TrEMBL release 23.x of xx-yyy-2003; TrEMBL_New of xx-yyy-2003. 7.3 Weekly updates of Swiss-Prot documents Whilst the ExPASy FTP server so far only allowed FTP access to the Swiss-Prot documents and indexes in their versions at the time of the last full release, all documents are now updated with every weekly release of Swiss-Prot. They are available for FTP download from the directory /databases/swiss-prot/updated_doc/. 7.4 Weekly updates of Swiss-Prot Weekly updates of Swiss-Prot are available by anonymous FTP. Three files are generated at each update: new_seq.dat Contains all the new entries since the last full release; upd_seq.dat Contains the entries for which the sequence data has been updated since the last release; upd_ann.dat Contains the entries for which one or more annotation fields have been updated since the last release. Important notes * Although we try to follow a regular schedule, we do not promise to update these files every week. In most cases two weeks may elapse between two updates. * Instead of using the above files, you can, every week, download an updated copy of the Swiss-Prot database. This file is available in the directory containing the non-redundant database (see section 7.2). 8 ENZYME and PROSITE 8.1 The ENZYME nomenclature database Release 30.0 of the ENZYME nomenclature database is distributed with release 41 of Swiss-Prot. ENZYME release 30.0 contains information relative to 4'136 enzymes. In this release, we have added a significant number of new entries and we also updated many entries. 8.2 The PROSITE database PROSITE now comes with its own release notes. 9 We need your help! We welcome feedback from our users. We would especially appreciate your notifying us if you find that sequences belonging to your field of expertise are missing from the database. We also would like to be notified about annotations to be updated, if, for example, the function of a protein has been clarified or if new information about post-translational modifications has become available. To facilitate this feedback we offer, on the ExPASy WWW server, a form that allows the submission of updates and/or corrections to Swiss-Prot: http://www.expasy.org/sprot/update.html It is also possible, from any entry in Swiss-Prot displayed by the ExPASy server, to submit updates and/or corrections for that particular entry. Finally, you can also send your comments by electronic mail to the address: email@example.com Note that all update requests are assigned a unique identifier of the form UR-Xnnnn (example: UR-A0123). This identifier is used internally by the Swiss-Prot staff at SIB and EBI to track requests and is also used in e-mail exchanges with the persons who have submitted a request. APPENDIX A: Some statistics A.1 Amino acid composition A.1.1 Composition in percent for the complete database Ala (A) 7.72 Gln (Q) 3.92 Leu (L) 9.56 Ser (S) 6.98 Arg (R) 5.24 Glu (E) 6.54 Lys (K) 5.96 Thr (T) 5.51 Asn (N) 4.28 Gly (G) 6.90 Met (M) 2.36 Trp (W) 1.18 Asp (D) 5.27 His (H) 2.26 Phe (F) 4.06 Tyr (Y) 3.13 Cys (C) 1.60 Ile (I) 5.88 Pro (P) 4.88 Val (V) 6.66 Asx (B) 0.000 Glx (Z) 0.000 Xaa (X) 0.01 A.1.2 Classification of the amino acids by their frequency Leu, Ala, Ser, Gly, Val, Glu, Lys, Ile, Thr, Asp, Arg, Pro, Asn, Phe, Gln, Tyr, Met, His, Cys, Trp A.2 Taxonomic origin Total number of species represented in this release of Swiss-Prot: 7'778 The first twenty species represent 51'656 sequences: 42.1% of the total number of entries. A.2.1 Table of the frequency of occurrence of species Species represented 1x: 3679 2x: 1206 3x: 619 4x: 403 5x: 273 6x: 251 7x: 192 8x: 146 9x: 120 10x: 66 11- 20x: 331 21- 50x: 250 51-100x: 84 >100x: 158 A.2.2 Table of the most represented species ------ --------- -------------------------------------------- Number Frequency Species ------ --------- -------------------------------------------- 1 9172 Homo sapiens (Human) 2 6169 Mus musculus (Mouse) 3 4892 Saccharomyces cerevisiae (Baker's yeast) 4 4832 Escherichia coli 5 3442 Rattus norvegicus (Rat) 6 2402 Bacillus subtilis 7 2291 Caenorhabditis elegans 8 2116 Schizosaccharomyces pombe (Fission yeast) 9 1952 Arabidopsis thaliana (Mouse-ear cress) 10 1773 Haemophilus influenzae 11 1764 Drosophila melanogaster (Fruit fly) 12 1529 Methanococcus jannaschii 13 1485 Escherichia coli O157:H7 14 1389 Bos taurus (Bovine) 15 1371 Mycobacterium tuberculosis 16 1240 Salmonella typhimurium 17 1062 Gallus gallus (Chicken) 18 942 Shigella flexneri 19 919 Synechocystis sp. (strain PCC 6803) 20 914 Escherichia coli O6 21 876 Archaeoglobus fulgidus 22 839 Pseudomonas aeruginosa 23 838 Xenopus laevis (African clawed frog) 24 822 Sus scrofa (Pig) 25 771 Salmonella typhi 26 716 Aquifex aeolicus 27 704 Oryctolagus cuniculus (Rabbit) 28 687 Mycoplasma pneumoniae 29 670 Rhizobium meliloti (Sinorhizobium meliloti) 30 609 Vibrio cholerae 31 599 Treponema pallidum 32 581 Mycobacterium leprae 33 572 Buchnera aphidicola (subsp. Acyrthosiphon pisum) 34 560 Buchnera aphidicola (subsp. Schizaphis graminum) 35 536 Helicobacter pylori (Campylobacter pylori) 36 535 Rickettsia prowazekii 37 524 Yersinia pestis 38 519 Helicobacter pylori J99 (Campylobacter pylori J99) 39 519 Streptomyces coelicolor 40 494 Bacillus halodurans 41 491 Zea mays (Maize) 42 491 Methanobacterium thermoautotrophicum 43 486 Mycoplasma genitalium 44 480 Pasteurella multocida 45 454 Anabaena sp. (strain PCC 7120) 46 432 Lactococcus lactis (subsp. lactis) (Streptococcus lactis) 47 419 Thermotoga maritima 48 416 Oryza sativa (Rice) 49 405 Borrelia burgdorferi (Lyme disease spirochete) 50 404 Chlamydia trachomatis 51 403 Rhizobium sp. (strain NGR234) 52 393 Canis familiaris (Dog) 53 391 Chlamydia pneumoniae (Chlamydophila pneumoniae) 54 390 Neisseria meningitidis (serogroup B) 55 386 Neisseria meningitidis (serogroup A) 56 381 Chlamydia muridarum 57 366 Caulobacter crescentus 58 365 Pyrococcus horikoshii 59 359 Listeria monocytogenes 60 359 Clostridium acetobutylicum 61 357 Pyrococcus abyssi 62 354 Ralstonia solanacearum (Pseudomonas solanacearum) 63 352 Listeria innocua 64 352 Rhizobium loti (Mesorhizobium loti) 65 350 Streptococcus pneumoniae 66 346 Agrobacterium tumefaciens (strain C58 / ATCC 33970) 67 341 Nicotiana tabacum (Common tobacco) 68 337 Xylella fastidiosa 69 335 Deinococcus radiodurans 70 332 Ovis aries (Sheep) 71 326 Xanthomonas campestris (pv. campestris) 72 325 Halobacterium sp. (strain NRC-1) 73 320 Staphylococcus aureus (strain N315) 74 320 Campylobacter jejuni 75 317 Staphylococcus aureus (strain Mu50 / ATCC 700699) 76 316 Dictyostelium discoideum (Slime mold) 77 311 Clostridium perfringens 78 299 Sulfolobus solfataricus 79 297 Staphylococcus aureus (strain MW2) 80 290 Corynebacterium glutamicum (Brevibacterium flavum) 81 288 Pisum sativum (Garden pea) 82 287 Xanthomonas axonopodis (pv. citri) 83 285 Streptococcus pyogenes 84 283 Aeropyrum pernix 85 278 Pyrococcus furiosus 86 278 Staphylococcus aureus 87 269 Brucella melitensis 88 268 Bacteriophage T4 89 266 Neurospora crassa 90 265 Triticum aestivum (Wheat) 91 264 Candida albicans (Yeast) 92 261 Rickettsia conorii 93 258 Hordeum vulgare (Barley) 94 254 Vaccinia virus (strain Copenhagen) 95 251 Glycine max (Soybean) 96 250 Lycopersicon esculentum (Tomato) 97 248 Rhodobacter capsulatus (Rhodopseudomonas capsulata) 98 247 Thermoanaerobacter tengcongensis 99 246 Solanum tuberosum (Potato) 100 244 Pseudomonas putida A.2.3 Taxonomic distribution of the sequences Kingdom Sequences (% of the database) Archaea 7119 ( 6%) Bacteria 46344 ( 38%) Eukaryota 60623 ( 49%) Viruses 8478 ( 7%) Within Eukaryota: Category sequences (% of Eukaryota) (% of the complete database) Human 9172 ( 15%) ( 7%) Other Mammalia 16041 ( 26%) ( 13%) Other Vertebrata 5806 ( 10%) ( 5%) Viridiplantae 9581 ( 16%) ( 8%) Fungi 9337 ( 15%) ( 8%) Insecta 3352 ( 6%) ( 3%) Nematoda 2504 ( 4%) ( 2%) Other 4830 ( 8%) ( 4%) A.3 Sequence size A.3.1 Repartition of the sequences by size (excluding fragments) From To Number From To Number 1- 50 2283 1001-1100 1127 51- 100 8420 1101-1200 796 101- 150 12542 1201-1300 550 151- 200 11267 1301-1400 379 201- 250 11387 1401-1500 305 251- 300 10019 1501-1600 213 301- 350 10039 1601-1700 166 351- 400 9804 1701-1800 118 401- 450 7435 1801-1900 128 451- 500 6547 1901-2000 106 501- 550 5067 2001-2100 59 551- 600 3400 2101-2200 96 601- 650 2753 2201-2300 99 651- 700 2015 2301-2400 57 701- 750 1766 2401-2500 56 751- 800 1474 >2500 326 801- 850 1101 851- 900 1142 901- 950 817 951-1000 704 A.3.2 Longest and shortest sequences The shortest sequence is GRWM_HUMAN (P24272) : 3 amino acids. The longest sequence is NEBU_HUMAN (P20929) : 6669 amino acids. A.4 Journal citations Note: the following citation statistics reflect the number of distinct journal citations. Total number of journals cited in this release of Swiss-Prot: 1'316 A.4.1 Table of the frequency of journal citations Journals cited 1x: 496 2x: 167 3x: 84 4x: 61 5x: 46 6x: 47 7x: 26 8x: 25 9x: 22 10x: 11 11- 20x: 98 21- 50x: 98 51-100x: 39 >100x: 96 A.4.2 List of the most cited journals in Swiss-Prot Nb Citations Journal name -- --------- ------------------------------------------------------------- 1 9138 Journal of Biological Chemistry 2 5013 Proceedings of the National Academy of Sciences of the U.S.A. 3 3631 Nucleic Acids Research 4 3612 Journal of Bacteriology 5 3381 Gene 6 2663 FEBS Letters 7 2598 Biochemical and Biophysical Research Communications 8 2429 European Journal of Biochemistry 9 2383 Biochemistry 10 2171 The EMBO Journal 11 2045 Nature 12 2024 Biochimica et Biophysica Acta 13 1821 Journal of Molecular Biology 14 1752 Genomics 15 1579 Cell 16 1542 Molecular and Cellular Biology 17 1243 Biochemical Journal 18 1146 Science 19 1123 Plant Molecular Biology 20 1117 Molecular and General Genetics 21 1068 Molecular Microbiology 22 855 Journal of Biochemistry 23 830 Virology 24 748 Human Molecular Genetics 25 693 Journal of Cell Biology 26 645 Nature Genetics 27 597 Journal of Virology 28 588 Plant Physiology 29 582 Human Mutation 30 579 Genes and Development 31 550 Oncogene 32 538 The American Journal of Human Genetics 33 530 Infection and Immunity 34 529 Yeast 35 516 Journal of Immunology 36 494 Journal of General Virology 37 469 Archives of Biochemistry and Biophysics 38 454 Structure 39 446 FEMS Microbiology Letters 40 433 Microbiology 41 394 Development 42 379 Human Genetics 43 376 Current Genetics 44 376 Nature Structural Biology 45 347 Genetics 46 343 Molecular and Biochemical Parasitology 47 335 Blood 48 317 Applied and Environmental Microbiology 49 313 Journal of Clinical Investigation 50 299 Molecular Endocrinology 51 283 DNA and Cell Biology 52 282 Protein Science 53 281 Journal of Molecular Evolution 54 276 Developmental Biology 55 276 Mammalian Genome 56 271 Biological Chemistry Hoppe-Seyler 57 251 Cancer Research 58 248 Journal of Experimental Medicine 59 246 Neuron 60 241 Immunogenetics 61 240 Mechanisms of Development 62 229 Journal of General Microbiology 63 228 Endocrinology 64 221 DNA Sequence 65 217 Acta Crystallographica, Section D 66 213 Hoppe-Seyler's Zeitschrift fur Physiologische Chemie 67 209 Molecular Biology of the Cell 68 207 The Plant Cell 69 203 Journal of Cell Science 70 191 Molecular Biology and Evolution 71 190 Brain Research. Molecular Brain Research 72 187 The Plant Journal 73 183 Journal of Neurochemistry 74 180 Journal of Neuroscience 75 160 Comparative Biochemistry and Physiology 76 158 Cytogenetics and Cell Genetics 77 156 DNA 78 154 Bioscience, Biotechnology, and Biochemistry 79 152 The Journal of Clinical Endocrinology and Metabolism 80 145 Toxicon 81 144 Molecular Pharmacology 82 143 Antimicrobial Agents and Chemotherapy 83 140 American Journal of Physiology 84 131 Biochimie 85 127 Bioorganicheskaia Khimiia 86 125 Virus Research 87 125 Proteins 88 122 DNA Research 89 121 Molecular Plant-Microbe Interactions 90 119 Hemoglobin 91 116 Peptides 92 114 Agricultural and Biological Chemistry 93 112 Current Biology 94 111 Journal of Investigative Dermatology 95 110 Molecular and Cellular Endocrinology 96 106 Genome Research A.5 Statistics for some line types The following table summarizes the total number of some Swiss-Prot lines, as well as the number of entries with at least one such line, and the frequency of the lines. Total Number of Average Line type / subtype number entries per entry --------------------------------- -------- --------- --------- References (RL) 232571 1.90 Journal 195556 111991 1.60 Submitted to EMBL/GenBank/DDBJ 34500 27873 0.28 Unpublished observations 536 532 <0.01 Submitted to Swiss-Prot 464 462 <0.01 Plant Gene Register 463 453 <0.01 Book citation 460 450 <0.01 Thesis 190 188 <0.01 Submitted to other databases 190 189 <0.01 Unpublished results 123 121 <0.01 Patent 87 86 <0.01 Worm Breeder's Gazette 2 2 <0.01 Comments (CC) 405433 3.31 SIMILARITY 117866 103489 0.96 FUNCTION 77092 75796 0.63 SUBCELLULAR LOCATION 55038 55038 0.45 CATALYTIC ACTIVITY 39528 37138 0.32 SUBUNIT 33846 33846 0.28 PATHWAY 17449 16966 0.14 TISSUE SPECIFICITY 13626 13626 0.11 COFACTOR 12141 12141 0.10 MISCELLANEOUS 7816 7190 0.06 PTM 7140 6571 0.06 ALTERNATIVE PRODUCTS 3946 3946 0.03 INDUCTION 3558 3558 0.03 DOMAIN 3535 3241 0.03 DEVELOPMENTAL STAGE 3362 3362 0.03 CAUTION 3342 3172 0.03 DISEASE 2244 1868 0.02 ENZYME REGULATION 1753 1753 0.01 MASS SPECTROMETRY 893 810 0.01 DATABASE 818 751 0.01 POLYMORPHISM 343 334 <0.01 BIOTECHNOLOGY 50 50 <0.01 PHARMACEUTICAL 47 47 <0.01 Features (FT) 655938 5.35 DOMAIN 95401 28727 0.78 TRANSMEM 77067 16988 0.63 CONFLICT 47337 16661 0.39 CARBOHYD 45507 11138 0.37 DISULFID 41846 10872 0.34 TURN 39177 2956 0.32 METAL 36827 10004 0.30 STRAND 36304 2644 0.30 HELIX 27742 2845 0.23 ACT_SITE 24322 15216 0.20 CHAIN 23456 19176 0.19 VARIANT 23307 4423 0.19 REPEAT 22336 3704 0.18 NP_BIND 15500 10893 0.13 SIGNAL 14828 14826 0.12 MOD_RES 13336 7528 0.11 NON_TER 10321 7875 0.08 BINDING 8145 6285 0.07 ZN_FING 7821 2770 0.06 VARSPLIC 6951 3249 0.06 SITE 6265 4319 0.05 INIT_MET 5574 5545 0.05 PROPEP 4686 4026 0.04 MUTAGEN 4273 1337 0.03 DNA_BIND 4193 3949 0.03 CA_BIND 4049 1149 0.03 LIPID 2946 2395 0.02 TRANSIT 2582 2562 0.02 PEPTIDE 2517 1001 0.02 NON_CONS 804 411 0.01 UNSURE 290 123 <0.01 SE_CYS 111 73 <0.01 THIOETH 94 32 <0.01 THIOLEST 23 23 <0.01 Cross-references (DR) 999237 8.15 EMBL 230657 116257 1.88 InterPro 195677 104236 1.60 Pfam 133012 99557 1.09 PROSITE 105218 66696 0.86 PIR 47040 35736 0.38 PRINTS 39413 34822 0.32 SMART 38729 29473 0.32 HSSP 38069 38069 0.31 TIGRFAMs 31394 29063 0.26 ProDom 30120 28820 0.25 HAMAP 23868 23778 0.19 PDB 11737 3547 0.10 TIGR 11065 11020 0.09 MIM 8171 7086 0.07 Genew 7836 7788 0.06 MGD 5820 5805 0.05 SGD 4936 4882 0.04 EcoGene 4228 4226 0.03 MEROPS 3316 3222 0.03 TRANSFAC 2464 2214 0.02 WormPep 2413 2239 0.02 SubtiList 2362 2361 0.02 FlyBase 2236 2173 0.02 GeneDB_SPombe 2131 2101 0.02 TubercuList 1400 1363 0.01 StyGene 1196 1193 0.01 SWISS-2DPAGE 810 809 0.01 ListiList 712 658 0.01 Leproma 585 581 <0.01 Gramene 411 411 <0.01 MaizeDB 405 401 <0.01 HIV 370 354 <0.01 REBASE 358 353 <0.01 ECO2DBASE 351 299 <0.01 DictyDb 319 316 <0.01 GlycoSuiteDB 259 259 <0.01 ZFIN 225 225 <0.01 PHCI-2DPAGE 211 211 <0.01 MypuList 131 131 <0.01 Aarhus/Ghent-2DPAGE 128 98 <0.01 Siena-2DPAGE 104 104 <0.01 HSC-2DPAGE 85 85 <0.01 PhosSite 53 53 <0.01 COMPLUYEAST-2DPAGE 50 50 <0.01 PMMA-2DPAGE 47 47 <0.01 Maize-2DPAGE 39 39 <0.01 SagaList 25 25 <0.01 ANU-2DPAGE 15 15 <0.01 A.6 Miscellaneous statistics Total number of distinct authors cited in Swiss-Prot: 164'410 Total number of chloroplast-encoded sequences: 3'131 Total number of mitochondrial-encoded sequences: 2'385 Total number of cyanelle-encoded sequences: 145 Total number of plasmid-encoded sequences: 2'624 Number of additional sequences encoded in splice variants : 5'661 --End of document--