You are using a version of Internet Explorer that may not display all features of this website. Please upgrade to a modern browser.
Swiss-Prot release 40.0
Published October 1, 2001
------------------------------------------------------------------------- SWISS-PROT Protein Knowledgebase Release Notes Release 40, October 2001 ------------------------------------------------------------------------- Table of contents 1 Introduction 2 Description of the changes made to SWISS-PROT since release 38 3 Forthcoming changes 4 Status of the documentation files 5 The ExPASy World-Wide Web server 6 TrEMBL - a supplement to SWISS-PROT 7 FTP access to SWISS-PROT and TrEMBL 8 ENZYME and PROSITE 9 We need your help! A Appendix A 1 Introduction Release 40.0 of SWISS-PROT contains 101'602 sequence entries, comprising 37'315'215 amino acids abstracted from 91'880 references. This represents an increase of 18% over release 39. The growth of the data bank is summarized below. Release Date Number of Number of entries amino acids 2.0 09/86 3'939 900'163 3.0 11/86 4'160 969'641 4.0 04/87 4'387 1'036'010 5.0 09/87 5'205 1'327'683 6.0 01/88 6'102 1'653'982 7.0 04/88 6'821 1'885'771 8.0 08/88 7'724 2'224'465 9.0 11/88 8'702 2'498'140 10.0 03/89 10'008 2'952'613 11.0 07/89 10'856 3'265'966 12.0 10/89 12'305 3'797'482 13.0 01/90 13'837 4'347'336 14.0 04/90 15'409 4'914'264 15.0 08/90 16'941 5'486'399 16.0 11/90 18'364 5'986'949 17.0 02/91 20'024 6'524'504 18.0 05/91 20'772 6'792'034 19.0 08/91 21'795 7'173'785 20.0 11/91 22'654 7'500'130 21.0 03/92 23'742 7'866'596 22.0 05/92 25'044 8'375'696 23.0 08/92 26'706 9'011'391 24.0 12/92 28'154 9'545'427 25.0 04/93 29'955 10'214'020 26.0 07/93 31'808 10'875'091 27.0 10/93 33'329 11'484'420 28.0 02/94 36'000 12'496'420 29.0 06/94 38'303 13'464'008 30.0 10/94 40'292 14'147'368 31.0 02/95 43'470 15'335'248 32.0 11/95 49'340 17'385'503 33.0 02/96 52'205 18'531'384 34.0 10/96 59'021 21'210'389 35.0 11/97 69'113 25'083'768 36.0 07/98 74'019 26'840'295 37.0 12/98 77'977 28'268'293 38.0 07/99 80'000 29'085'965 39.0 05/00 86'593 31'411'114 40.0 10/01 101'602 37'315'215 2 Description of the changes made to SWISS-PROT since release 38 The name of the database changed from 'SWISS-PROT protein sequence database' to 'SWISS-PROT knowledgebase' to emphasize the fact that SWISS-PROT collects, by far, more than just information on protein sequences and that it is a central linking and linked database which connects the various findings in the diverse fields of proteomics research. We apologize that due to technical problems we never posted the release notes of release 39. Therefore this document describes the changes that took place not only since release 39 but also those between releases 38 and 39. 2.1 Sequences and annotations 15'184 sequences have been added since release 39, the sequence data of 2'908 existing entries has been updated and the annotations of 44' 684 entries have been revised. With this release SWISS-PROT has passed the symbolic mark of 100 thousand entries. 2.2 The HPI project The Human Proteomics Initiative (HPI) has been introduced to put a major effort on the annotation of all known human sequences according to the quality standards of SWISS-PROT. This means that, for each known protein, a wealth of information is provided, which includes the description of its function, its domain structure, subcellular location, posttranslational modifications, variants, similarities to other proteins, etc. This not only implies the annotation of newly detected proteins, but also the integration of new research data to the existing entries by specialized biologists, who are in close contact with experts all over the world. There are currently 7'471 annotated human sequences in SWISS-PROT. These entries are associated with 19'922 literature references, 18' 974 experimental or predicted PTM's, 1'697 splice variants and 12'061 polymorphisms (most of which are linked with disease states). Simultaneously, two further efforts were increased: the description of human diseases associated with deficiency(ies) in the protein and mammalian orthologs of human proteins are annotated at a level equivalent to that of the cognate human sequences. For all aspects of the HPI projects, we would appreciate the help and collaboration of the scientific community. Information concerning the human proteome is highly critical to a large section of the life science community. We therefore appeal to the user community to fully participate in this initiative by providing all the necessary information to help and to speed up the comprehensive annotation of the human proteome. For a detailed description of the HPI project and its current status please consult: http://www.expasy.org/sprot/hpi/ 2.3 The HAMAP project The first complete microbial genomic sequence was that of the bacterium Haemophilus influenzae, which became available in 1995. Since then more than 50 bacterial and archaeal genomes have been sequenced and many more sequencing projects of pathogenic as well as nonpathogenic microbes are in progress. To date, the publicly available microbial genomes collectively encode more than 100'000 different proteins. In order to handle the large amount of "raw" data coming from the microbial genomic sequencing, the High quality Automated Microbial Annotation of Proteomes (HAMAP) project was initiated. The latter aims to automatically annotate a significant percentage of proteins which originate from microbial genome sequencing projects. To maintain a high level quality of annotation, specific tools are developed to deal with two completely separate subsets of bacterial and archaeal proteins: proteins that have no recognizable similarity to any other microbial or non-microbial proteins ("ORFans") and proteins that are part of well-defined families or subfamilies. This is done by using a rule system that describes the level and extent of annotations that can be assigned by similarity with a prototype manually-annotated entry. The result is a curated entry whose quality is identical to that produced manually by an expert annotator. The programs in development are designed to recognize protein peculiarities, and only proteins which match the defined criteria will be processed automatically. Protein sequences which fail to fit into that rule system will be further analyzed by SWISS-PROT expert annotators. For a detailed description of the HAMAP project and its current status please consult: http://www.expasy.org/sprot/hamap/ 2.4 What's happening with the model organisms? We have selected a number of organisms that are the target of genome sequencing and/or mapping projects and for which we intend to: * be as complete as possible. All sequences available at a given time should be immediately included in SWISS-PROT. This also includes sequence corrections and updates; * provide a higher level of annotation; * provide cross-references to specialized database(s) that contain, among other data, some genetic information about the genes that code for these proteins; * provide specific indices or documents. From our efforts to annotate human sequence entries as complete as possible arose the HPI project (see 2.2), and the bacterial model organisms became part of the HAMAP project (see 2.3). Here is the current status of the model organisms which are not covered by these two projects: Organism Database Index file Number of cross-references sequences ------------ ---------------- -------------- --------- A.thaliana None yet In preparation 1'409 C.albicans None yet CALBICAN.TXT 256 C.elegans Wormpep CELEGANS.TXT 2'184 D.discoideum DictyDB DICTY.TXT 311 D.melanogaster FlyBase FLY.TXT 1'514 M.musculus MGD MGDTOSP.TXT 4'816 S.cerevisiae SGD YEAST.TXT 4'859 S.pombe None yet POMBE.TXT 1'782 2.5 Progress in the conversion of SWISS-PROT to mixed-case characters We are gradually converting SWISS-PROT entries from all 'UPPER CASE' to 'MiXeD CaSe'. The line-types that have been converted between release 38 and 40 are: DE (DEscription), most RC (Reference Comment) topics (SPECIES, TISSUE, PLASMID and TRANSPOSON) and DR (Database cross-Reference). The new OX line (Organism cross-reference; see section 2.8) and the new CC topics PHARMACEUTICAL and BIOTECHNOLOGY (described in section 2.11) have been introduced in mixed case. The CC topic MASS SPECTROMETRY has been converted to mixed case. As described in section 3.5, the process of converting all of SWISS-PROT to mixed case continues. 2.6 Extension of the accession number system With the creation of the TrEMBL database and the rapid increase in the amount of sequence data, we were faced with a problem of availability of accession numbers. We used a system based on a one-letter prefix followed by 5 digits. This system was also used by the nucleotide sequence databases which had originally reserved for SWISS-PROT the prefix letters 'O', 'P' and 'Q'. Having run out of space (due mainly to EST's), the nucleotide sequence databases have been forced to choose a new format, which became a two-letter prefix followed by 6 digits. We have now used up all possible numbers with 'O', 'P' and 'Q'. As we believe that changing the format of the accession numbers to that used now by the nucleotide database would have created havoc on the numerous software packages using SWISS-PROT, we decided to keep a system of accession numbers based on a 6-character code, but with the following format extension: 1 2 3 4 5 6 [O,P,Q] [0-9] [A-Z, 0-9] [A-Z, 0-9] [A-Z, 0-9] [0-9] What the above means is that we kept a 6-character code, but that in positions 3, 4 and 5 of this code any combination of letters and numbers can be present. This format allows a total of 14 million accession numbers (compared with only 300'000 with the former system). We only allow numbers in positions 2 and 6 so that the SWISS-PROT accession numbers cannot be mistaken with gene names, acronyms, other type of accession numbers or any kind of word! Examples: P0A3S2, Q2ASD4, O13YX2, P9B123. 2.7 Multiple AC lines Starting from release 39, there can be more than one AC (ACcession) line per SWISS-PROT entry. Strictly speaking this was not a format change and the SWISS-PROT user's manual always indicated that there could be more than one AC line per entry. Until recently, a single line was sufficient and the majority of entries contained only a single accession number. But, in the process of providing an optimally non-redundant database, we are merging information from TrEMBL entries into SWISS-PROT entries. When we merge a TrEMBL entry to a SWISS-PROT one, we add to the latter the accession number(s) of the TrEMBL entry. The repetition of such a process sometimes produces an accession number list which can no longer fit in a single AC line. Therefore there are now some entries with two, three (as shown below) or more AC lines. AC P16070; P22511; Q04858; Q13419; Q13957; Q13958; Q13959; Q13960; AC Q13961; Q13967; Q13968; Q13980; Q15861; Q16064; Q16065; Q16066; AC Q16208; Q16522; 2.8 Introduction of the new line type OX: Organism taxonomy cross-reference The OX (Organism taXonomy cross-reference) line has been introduced to indicate the identifier to a specific organism in a taxonomic database. The number of taxonomic codes is identical to the number of species given in the OS line. There can be more than one OX line in an entry and its format is: OX Taxonomy-database_Qualifier=Taxonomic code[, Taxonomic code...]; There are cross-references to the taxonomic database of NCBI, which is associated with the qualifier 'TaxID' and a one-to six-digit taxonomic code. Examples of its usage: OX NCBI_TaxID=10116; OX NCBI_TaxID=9606, 10090, 9913, 9823, 10141, 10029, 10030, 10116, 9986, OX 9031, 8355, 7227, 7213, 7108, 7130; 2.9 Changes concerning the RC line We are gradually implementing controlled vocabularies for the different type of RC tokens. To complement the tissue list (TISSLIST.TXT), we have now added a plasmid list (PLASMID.TXT) and are in the process of creating a strain list. Controlled vocabularies are part of the SWISS-PROT documentation files that are all described in section 4. 2.10 Changes concerning the RX line The RX line format changed, and it now provides identifiers also to the bibliographic database PubMed. The old format was: RX MEDLINE; unique_identifier. The new format is: RX BIBLIOGRAPHIC_DATABASE=IDENTIFIER[; BIBLIOGRAPHIC_DATABASE=IDENTIFIER...]; Example of RX lines: RX PubMed=9145897; RX MEDLINE=79012484; PubMed=358200; 2.11 Introduction of two new CC line topics: BIOTECHNOLOGY and PHARMACEUTICAL We have introduced two new 'topics' for the comments (CC) line type. The topic 'BIOTECHNOLOGY' has been introduced to describe the use of a specific protein in the biotechnological industry. This topic contains the name(s) of the compani(es) that produce the protein or the genetically manipulated organism as well as a short description of the biotechnological function of the protein. The brand name(s), under which a protein is available, is added, if applicable. Examples of the usage: CC -!- BIOTECHNOLOGY: Introduced by genetic manipulation and CC expressed in improved ripening tomato by Monsanto. ACC is the CC immediate precursor of the phytohormone ethylene who is CC involved in the control of ripening. ACC deaminase reduces CC ethylene biosynthesis and thus extend the shelf life of fruits CC and vegetables. CC -!- BIOTECHNOLOGY: Used in the food industry for high temperature CC liquefaction of starch-containing mashes and in the detergent CC industry to remove starch. Sold under the name Termamyl by CC Novozymes. The topic 'PHARMACEUTICAL' has been introduced to describe the use of a specific protein as a pharmaceutical drug. The information provided by such a topic will include the brand name(s) under which a protein is available, the name(s) of the compani(es) that produce it as well as a short description of the therapeutic usage of the protein. It should be noted that any entries containing such a comment field will also be tagged with the keyword 'Pharmaceutical'. Examples of the usage: CC -!- PHARMACEUTICAL: Available under the names Avonex (Biogen), CC Betaseron (Berlex) and Rebif (Serono). Used in the treatment CC of multiple sclerosis (MS). Betaseron is a slightly modified CC form of IFNB1 with two residue substitutions. CC -!- PHARMACEUTICAL: Available under the name Proleukin (Chiron). CC Used in patients with renal cell carcinoma or metastatic CC melanoma. 2.12 Cleaning up of comment line (CC) topics We are continuing a major overhaul of various comment line topics. We would like the majority of the information stored to be usable by computer programs (while being human-readable). We are therefore standardizing the format of the topics. The two sub-formats of the topic ALTERNATIVE PRODUCTS: CC -!- ALTERNATIVE PRODUCTS: isoforms; (shown here), CC , and ; are produced by alternative splicing. CC [Comment.] CC -!- ALTERNATIVE PRODUCTS: isoforms; (shown here), CC and ; are produced by alternative CC initiation. [Comment.] Examples: CC -!- ALTERNATIVE PRODUCTS: At least 5 isoforms; 1 (shown here), 2, 3, 4 CC and 5; are produced by alternative splicing. They differ in their CC acetylcholine receptor clustering activity. CC -!- ALTERNATIVE PRODUCTS: 3 isoforms; TRAC-2 (shown here), TRAC-3 and CC TRAC-4; are produced by alternative initiation. We are gradually cleaning up the comment line topic SIMILARITY. To describe the similarity of the protein to a protein family, we use the following subformat: CC -!- SIMILARITY: Belongs to the <family_name>[. <sub-family_name>]. Examples: CC -!- SIMILARITY: Belongs to the 14-3-3 family. CC -!- SIMILARITY: Belongs to the glucosamine/galactosamine-6-phosphate CC isomerase family. 6-phosphogluconolactonase subfamily. To describe conserved domains within a protein sequence, we use the subformat: CC -!- SIMILARITY: Contains n <domain_name>. Examples: CC -!- SIMILARITY: Contains 10 HEAT repeats. CC -!- SIMILARITY: Contains 1 FKBP-type PPIase domain. 2.13 Changes concerning cross-references (DR line) We have added cross-references from SWISS-PROT to the following databases: 2.13.1 GlycoSuiteDB GlycoSuiteDB, a database of glycan structures available at http://www.glycosuite.com/ (see Cooper C.A., Harrison M.J., Wilkins M.R. and Packer N.H.; Nucleic Acids Res. 29:332-335(2001)). The identifiers of the appropriate DR line are: Data bank identifier: GlycoSuiteDB Primary identifier: GlycoSuiteDB unique identifier for a glycoprotein, which is identical to the SWISS-PROT primary AC number of that protein. Secondary identifier: None; a dash '-' is stored in that field. Example: DR GlycoSuiteDB; P05067; -. 2.13.2 SMART The Simple Modular Architecture Research Tool (SMART), a database of functional sites available at http://smart.embl-heidelberg.de/ (see Schultz J., Copley R.R., Doerks T., Ponting C.P. and Bork P.; Nucleic Acids Res. 28:231-234(2000)). The cross-references for this database are composed of the following items: Data bank identifier: SMART Primary identifier: SMART unique identifier for a domain. Secondary identifier: Abbreviation for the name of a domain or module. Fourth item: Number of hits of the domain in the entry. Example: DR SMART; SM00370; LRR; 6. 2.13.3 Leproma The Mycobacterium leprae genome database Leproma, which is available at http://genolist.pasteur.fr/Leproma/. The information is available in the DR line: Data bank identifier: Leproma Primary identifier: Leproma unique identifer for an ORF. Secondary identifier: None; a dash '-' is stored in that field. Example: DR Leproma; ML0485; -. 2.13.4 MEROPS MEROPS, the protease database available at http://www.merops.co.uk/ (see Rawlings N.D. and Barrett A.J.; Nucleic Acids Res. 28:323-325(2000)). The following information is available in the two qualifiers of the DR line: Data bank identifier: MEROPS Primary identifier: The MEROPS unique identifier for a peptidase. Secondary identifier: None; a dash '-' is stored in that field. Example: DR MEROPS; M41.001; -. 2.13.5 MypuList The Mycoplasma pulmonis genome database MypuList, available at http://genolist.pasteur.fr/MypuList/. The following information is available in the two identifiers of the DR line: Data bank identifier: MypuList Primary identifier: The MypuList unique identifier for an ORF. Secondary identifier: None; a dash '-' is stored in that field. Example: DR MypuList; MYPU_4900; -. 2.13.6 ProDom Cross-references to the ProDom protein domain database used to be provided as implicit links; links are now also available as explicit links: Data bank identifier: ProDom Primary identifier: The ProDom unique identifier for a domain. Secondary identifier: The ProDom entry name. Fourth item: Number of hits of the domain in the entry. Example for an DR ProDom; PD000600; 14-3-3; 1. explicit link: 2.13.7 ANU-2DPAGE The Australian National University Two-Dimensional Polyacrylamide Gel Electrophoresis Database (ANU-2DPAGE) is available at http://semele.anu.edu.au/2d/2d.html (see Imin N., Kerim T., Weinman J.J. and Rolfe B.G.; Proteomics 1:1149-1161(2001)). The following information is available in the DR line: Data bank identifier: ANU-2DPAGE Primary identifier: ANU-2DPAGE unique identifier, which is identical to the SWISS-PROT primary AC number of that protein. Secondary identifier: None; a dash '-' is stored in that field. Example: DR ANU-2DPAGE; Q9XEA8; -. 2.13.8 COMPLUYEAST-2DPAGE Two-dimensional polyacrylamide gel electrophoresis database at Universidad Complutense de Madrid (COMPLUYEAST-2DPAGE) is available at http://babbage.csc.ucm.es/2d/2d.html. The following informaiton is available in the DR line: Data bank identifier: COMPLUYEAST-2DPAGE Primary COMPLUYEAST-2DPAGE unique identifier, which is identifier: identical to the SWISS-PROT primary AC number of that protein. Secondary identifier: None; a dash '-' is stored in that field. Example: DR COMPLUYEAST-2DPAGE; P43067; -. 2.13.9 PHCI-2DPAGE The Parasite Host Cell Interaction 2D-PAGE database (PHCI-2DPAGE) is available at http://www.gram.au.dk/2d/2d.html. The cross-references for this database are composed of the following items: Data bank identifier: PHCI-2DPAGE Primary identifier: PHCI-2DPAGE unique identifier, which is identical to the SWISS-PROT primary AC number of that protein. Secondary identifier: None; a dash '-' is stored in that field. Example: DR PHCI-2DPAGE; Q9Z6V3; -. 2.13.10 PMMA-2DPAGE The Purkyne Military Medical Academy 2D-PAGE database (PMMA-2DPAGE) is available at http://www.pmma.pmfhk.cz/2d/2d.html. The identifers of the appropriate DR line are: Data bank identifier: PMMA-2DPAGE Primary identifier: PMMA-2DPAGE unique identifier, which is identical to the SWISS-PROT primary AC number of that protein. Secondary identifier: None; a dash '-' is stored in that field. Example: DR PMMA-2DPAGE; Q01995; -. 2.13.11 Siena-2DPAGE The 2D-PAGE database from the Department of Molecular Biology, University of Siena, Italy, is available at http://www.bio-mol.unisi.it/2d/2d.html. The components of the corresponding DR line are: Data bank identifier: Siena-2DPAGE Primary identifier: Siena-2DPAGE unique identifier, which is identical to the SWISS-PROT primary AC number of that protein. Secondary identifier: None; a dash '-' is stored in that field. Example: DR Siena-2DPAGE; P01591; -. 2.14 Introduction of a new FT key: SE_CYS Selenocysteine is the 21st 'natural' amino acid. It is now known to occur in several prokaryotic and eukaryotic proteins. Its mRNA codon is UGA, which usually serves as a stop codon, but with a specific downstream sequence forming a loop and a specific translational elongation factor. It is recognized as the site of selenocysteine incorporation into proteins. The joint nomenclature committee of the IUPAC/IUBMB (see http://www.chem.qmw.ac.uk/iupac/jcbn/) officially recommended (http://www.chem.qmw.ac.uk/iubmb/newsletter/1999/item3.html) a three-letter and a one-letter symbol for selenocysteine, namely 'Sec' and 'U'. Introducing a new one-letter code in the sequence records would have disrupt most, if not all, sequence analysis software. We therefore decided to change, in SWISS-PROT, the rules used to annotate the presence of selenocysteine residues in sequence entries in the manner described below. Selenocysteines were stored, in the sequence records, using the one-letter symbol 'C' for cysteine and are indicated in the feature table (FT) by a line of the type: FT BINDING x x SELENIUM. The one-letter code has not been changed (for the reason explained above), but we introduced a specific feature key (SE_CYS) to indicate the presence of a selenocysteine at a given sequence position. The above example has therefore been changed to: FT SE_CYS x x We also want to remind users that the keyword ' Selenocysteine' continues to be used to tag sequence entries that contain at least one such residue. 2.15 Introduction of feature identifiers to the feature keys CARBOHYD and VARIANT We have introduced unique and stable feature identifiers (FTId) which allow to construct links directly from position-specific annotation in the feature table to specialized protein-related databases. Examples are databases specialized in certain types of posttranslational modifications of proteins, or in mutations. The FTId is always the last component in the feature description. 2.15.1 Feature identifiers in FT VARIANT lines of human sequence entries The feature identifiers in the FT VARIANT lines of human sequence entries allow to refer to a sequence variation and serve as anchors for specifically directed links. A federated single human mutation database (HmutDB; http://www2.ebi.ac.uk/mutations/central/proposal.html) has been proposed, and the complete set of all FT VARIANT lines has been indexed for SRS at EBI (http://srs.ebi.ac.uk/), under the name SWISSCHANGE. The database SWISSCHANGE can be queried by SWISS-PROT FTIds. The format of FT VARIANT lines with feature identifiers is: FT VARIANT x x Description. FT /FTId=VAR_number. Example: FT VARIANT 3 3 A -> L. FT /FTId=VAR_000001. 2.15.2 Feature identifiers in FT CARBOHYD lines The same principle is used to further enhance the links to GlycoSuiteDB, an annotated database of glycan structures (see section 2.13.1). So in addition the explicit global link in the DR line, we create unique feature identifiers for each of the FT CARBOHYD lines, which will allow direct access to the glycan structure. The format of FT CARBOHYD lines with feature identifiers is: FT CARBOHYD x x Description. FT /FTId=CAR_number. Example: FT CARBOHYD 251 251 N-LINKED (GLCNAC...). FT /FTId=CAR_000070. 2.16 Change in the syntax of the SQ line The SQ (SeQuence header) line marks the beginning of the sequence data and gives a quick summary of its content. The format of the SQ line was: SQ SEQUENCE XXXX AA; XXXXXX MW; XXXXXXXX CRC32; The last information item in the SQ line was a 32-bit CRC (Cyclic Redundancy Check) value which is computed from the sequence. As the number of available sequences is increasing rapidly, there are now a few cases where two sequences can share the same CRC32 (but none, which also share the same molecular weight 'MW' or number of amino acids 'AA' ). To address this issue we replaced the 32-bit CRC value by a 64-bit CRC. The format of the SQ line changed therefore to: SQ SEQUENCE XXXX AA; XXXXXX MW; XXXXXXXXXXXXXXXX CRC64; Example: SQ SEQUENCE 233 AA; 25630 MW; 146A1B48A1475C86 CRC64; 3 Forthcoming changes 3.1 Version of SP in XML format A distribution version of SWISS-PROT and TrEMBL in XML format is being developed. The specifications of this new format will be described when it will be first implemented in TrEMBL. 3.2 Extension of the entry name format We endeavor to assign meaningful entry names that facilitate the identification of the proteins and the species of origin concerning an entry. Currently the entry name consists of up to ten uppercase alphanumeric characters. SWISS-PROT uses a general purpose naming convention that can be symbolized as X_Y, where X is a mnemonic code of at most 4 alphanumeric characters representing the protein name, the '_' sign serves as a separator, and the Y is a mnemonic species identification code of at most 5 alphanumeric characters representing the biological source of the protein. We are planning to elongate the mnemonic code for the protein name from up to 4 characters to up to 5 characters. E.g. the mnemonic code for the meiotic recombination protein rec10 is currently 'RE10'. After the introduction of extended entry names it could be modified to the 5-letter code 'REC10'. 3.3 Multiple RP lines Starting with release 41, there can be more than one RP (Reference Position) line per reference in a SWISS-PROT entry. The RP line describes the extent of the work carried out by the authors of the reference, e.g. molecule type that has been sequenced, the characterization of the protein, characterization of PTMs, analysis of the protein structure, detection of variants, etc. As the number of experimental results per publication increased over the years the limitation of using a single RP line per reference became more and more often insufficient to add all the information while being consistent in format. So we decided to have multiple RP lines. Example: RP SEQUENCE FROM N.A., PARTIAL SEQUENCE, AND CHARACTERIZATION. could become RP SEQUENCE FROM N.A., SEQUENCE OF 23-42 AND 351-365, AND RP CHARACTERIZATION. 3.4 Cleaning up of comment line (CC) topics We are continuing a major overhaul of various comment line topics. We would like the majority of the information stored to be usable by computer programs (while being human-readable). We are therefore standardizing the format of the topics. We are gradually cleaning up the comment line topic PATHWAY. To describe the biochemical pathway in which the protein is involved, we use the following format: CC -!- PATHWAY: biochemical pathway; nth step[. Comment]. Example: CC -!- PATHWAY: Coenzyme A (CoA) biosynthesis; first step. The comment line topic COFACTOR will be modified gradually to the following format: CC -!- COFACTOR: cofactor1[, cofactor2 and cofactor3][. Comment]. Examples: CC -!- COFACTOR: Magnesium. CC -!- COFACTOR: Copper, Manganese, and Nickel. 3.5 Continuation of the conversion of SWISS-PROT to mixed-case characters We will continue to convert SWISS-PROT entries from all 'UPPER CASE' to 'MiXeD CaSe'. In release 41 we are planning to convert the GN (Gene Name) line, the RC (Reference Comment) line topic STRAIN, and the CC (Comment) line topics CATALYTIC ACTIVITY and PATHWAY. Here is an example of what a SWISS-PROT entry will look like in release 41: ID GSA_ECOLI STANDARD; PRT; 426 AA. AC P23893; P78277; DT 01-NOV-1991 (Rel. 20, Created) DT 01-NOV-1997 (Rel. 35, Last sequence update) DT 01-MAR-2002 (Rel. 41, Last annotation update) DE Glutamate-1-semialdehyde 2,1-aminomutase (EC 188.8.131.52) (GSA) DE (Glutamate-1-semialdehyde aminotransferase) (GSA-AT). GN hemL or gsa or popC or B0154. OS Escherichia coli. OC Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae; OC Escherichia. OX NCBI_TaxID=562; RN  RP SEQUENCE FROM N.A. RX MEDLINE=91155920; PubMed=1900346; RA Grimm B., Bull A., Breu V.; RT "Structural genes of glutamate 1-semialdehyde aminotransferase for RT porphyrin synthesis in a cyanobacterium and Escherichia coli."; RL Mol. Gen. Genet. 225:1-10(1991). RN  RP SEQUENCE FROM N.A. RC STRAIN=K12 / W3110; RX MEDLINE=94261430; PubMed=8202364; RA Fujita N., Mori H., Yura T., Ishihama A.; RT "Systematic sequencing of the Escherichia coli genome: analysis of RT the 2.4-4.1 min (110,917-193,643 bp) region."; RL Nucleic Acids Res. 22:1637-1639(1994). RN  RP SEQUENCE FROM N.A. RC STRAIN=K12 / MG1655; RX MEDLINE=97426617; PubMed=9278503; RA Blattner F.R., Plunkett G. III, Bloch C.A., Perna N.T., Burland V., RA Riley M., Collado-Vides J., Glasner J.D., Rode C.K., Mayhew G.F., RA Gregor J., Davis N.W., Kirkpatrick H.A., Goeden M.A., Rose D.J., RA Mau B., Shao Y.; RT "The complete genome sequence of Escherichia coli K-12."; RL Science 277:1453-1474(1997). RN  RP SEQUENCE FROM N.A. RA Schramm S., Duncan M., Allen E., Araujo R., Aparicio A., Chung E., RA Davis K., Federspiel N., Hyman R., Kalman S., Komp C., Kurdi O., RA Lashkari D., Lew H., Lin D., Namath A., Oefner P., Roberts D., RA Davis R.W.; RL Submitted (SEP-1996) to the EMBL/GenBank/DDBJ databases. RN  RP CHARACTERIZATION. RX MEDLINE=91258321; PubMed=2045363; RA Ilag L.L., Jahn D., Eggertsson G., Soell D.; RT "The Escherichia coli hemL gene encodes glutamate 1-semialdehyde RT aminotransferase."; RL J. Bacteriol. 173:3408-3413(1991). RN  RP MUTAGENESIS OF LYS-265. RX MEDLINE=92353044; PubMed=1643048; RA Ilag L.L., Jahn D.; RT "Activity and spectroscopic properties of the Escherichia coli RT glutamate 1-semialdehyde aminotransferase and the putative active RT site mutant K265R."; RL Biochemistry 31:7143-7151(1992). CC -!- CATALYTIC ACTIVITY: (S)-4-amino-5-oxopentanoate = CC 5-aminolevulinate. CC -!- COFACTOR: PYRIDOXAL PHOSPHATE. CC -!- PATHWAY: Porphyrin biosynthesis by the C5 pathway; second step. CC -!- SUBUNIT: HOMODIMER. CC -!- SUBCELLULAR LOCATION: CYTOPLASMIC (POTENTIAL). CC -!- SIMILARITY: BELONGS TO CLASS-III OF PYRIDOXAL-PHOSPHATE-DEPENDENT CC AMINOTRANSFERASES. DR EMBL; X53696; CAA37734.1; -. DR EMBL; D26562; CAB20274.1; -. DR EMBL; AE000125; AAC73265.1; -. DR EMBL; U70214; AAB08584.1; -. DR PIR; S13327; S13327. DR PIR; S45223; S45223. DR HSSP; P24630; 2GSA. DR EcoGene; EG10432; hemL. DR InterPro; IPR000954; Aminotran_3. DR Pfam; PF00202; aminotran_3; 1. DR PROSITE; PS00600; AA_TRANSFER_CLASS_3; 1. KW Porphyrin biosynthesis; Isomerase; Pyridoxal phosphate; KW Complete proteome. FT BINDING 265 265 PYRIDOXAL PHOSPHATE (PROBABLE). FT MUTAGEN 265 265 K->R: 2% OF WILD-TYPE ACTIVITY. FT CONFLICT 2 2 S -> R (IN REF. 1 AND 2). FT CONFLICT 9 9 S -> Q (IN REF. 1 AND 2). SQ SEQUENCE 426 AA; 45366 MW; BED817E100468CF2 CRC64; MSKSENLYSA ARELIPGGVN SPVRAFTGVG GTPLFIEKAD GAYLYDVDGK AYIDYVGSWG PMVLGHNHPA IRNAVIEAAE RGLSFGAPTE MEVKMAQLVT ELVPTMDMVR MVNSGTEATM SAIRLARGFT GRDKIIKFEG CYHGHADCLL VKAGSGALTL GQPNSPGVPA DFAKYTLTCT YNDLASVRAA FEQYPQEIAC IIVEPVAGNM NCVPPLPEFL PGLRALCDEF GALLIIDEVM TGFRVALAGA QDYYGVVPDL TCLGKIIGGG MPVGAFGGRR DVMDALAPTG PVYQAGTLSG NPIAMAAGFA CLNEVAQPGV HETLDELTTR LAEGLLEAAE EAGIPLVVNH VGGMFGIFFT DAESVTCYQD VMACDVERFK RFFHMMLDEG VYLAPSAFEA GFMSVAHSME DINNTIDAAR RVFAKL // 4 Status of the documentation files SWISS-PROT is distributed with a large number of documentation files. Some of these files have been available for a long time (the user manual, release notes, the various indices for authors, citations, keywords, etc.), but many have been created recently and we are continuously adding new files, and updating and modifying existing files. Please note that the header in many documentaiton files changed. The following table lists all the documents that are currently available. See also section 7.3 for information on how to access updated versions of all documents in-between major releases. USERMAN.TXT User manual RELNOTES.TXT Release notes for the current release (40) SHORTDES.TXT Short description of entries in SWISS-PROT [see 1] JOURLIST.TXT List of cited journals KEYWLIST.TXT List of keywords PLASMID.TXT List of plasmids [see 2] SPECLIST.TXT List of organism (species) identification codes TISSLIST.TXT List of tissues EXPERTS.TXT List of on-line experts for PROSITE and SWISS-PROT DBXREF.TXT List of databases cross-referenced in SWISS-PROT [see 2] SUBMIT.TXT Submission of sequence data to SWISS-PROT ACINDEX.TXT Accession number index AUTINDEX.TXT Authors index CITINDEX.TXT Citation index KEYINDEX.TXT Keywords index SPEINDEX.TXT Species index DELETEAC.TXT Deleted accession number index 7TMRLIST.TXT List of 7-transmembrane G-linked receptor entries [see 1] AATRNASY.TXT List of aminoacyl-tRNA synthetases ALLERGEN.TXT Nomenclature and index of allergen sequences ANNBIOCH.TXT SWISS-PROT annotation: how is biochemical information assigned to sequence entries BLOODGRP.TXT Blood group antigen proteins CALBICAN.TXT Index of Candida albicans entries and their corresponding gene designations CDLIST.TXT CD nomenclature for surface proteins of human leucocytes Index of Caenorhabditis elegans entries and their CELEGANS.TXT corresponding gene designations and WormPep cross-references Index of Dictyostelium discoideum entries and their DICTY.TXT corresponding gene designations and DictyDB cross-references EC2DTOSP.TXT Index of Escherichia coli Gene-protein database (ECO2DBASE) entries referenced in SWISS-PROT ECOLI.TXT Index of Escherichia coli strain K12 chromosomal entries and their corresponding EcoGene cross-references EMBLTOSP.TXT Index of EMBL Nucleotide Sequence Database entries referenced in SWISS-PROT EXTRADOM.TXT Nomenclature of extracellular domains FLY.TXT Index of Drosophila entries and their corresponding FlyBase cross-references GLYCOSID.TXT Classification of glycosyl hydrolase families and index of glycosyl hydrolase entries in SWISS-PROT HAEINFLU.TXT Index of Haemophilus influenzae strain Rd chromosomal entries HOXLIST.TXT Vertebrate homeotic Hox proteins: nomenclature and index HPYLORI.TXT Index of Helicobacter pylori strain 26695 chromosomal entries HUMCHR01.TXT Index of proteins encoded on human chromosome 1 [see 2] HUMCHR02.TXT Index of proteins encoded on human chromosome 2 [see 2] HUMCHR03.TXT Index of proteins encoded on human chromosome 3 [see 2] HUMCHR04.TXT Index of proteins encoded on human chromosome 4 [see 2] HUMCHR05.TXT Index of proteins encoded on human chromosome 5 [see 2] HUMCHR06.TXT Index of proteins encoded on human chromosome 6 [see 2] HUMCHR07.TXT Index of proteins encoded on human chromosome 7 [see 2] HUMCHR08.TXT Index of proteins encoded on human chromosome 8 [see 2] HUMCHR09.TXT Index of proteins encoded on human chromosome 9 [see 2] HUMCHR10.TXT Index of proteins encoded on human chromosome 10 [see 2] HUMCHR11.TXT Index of proteins encoded on human chromosome 11 [see 2] HUMCHR12.TXT Index of proteins encoded on human chromosome 12 [see 2] HUMCHR13.TXT Index of proteins encoded on human chromosome 13 HUMCHR14.TXT Index of proteins encoded on human chromosome 14 [see 2] HUMCHR15.TXT Index of proteins encoded on human chromosome 15 [see 2] HUMCHR16.TXT Index of proteins encoded on human chromosome 16 HUMCHR17.TXT Index of proteins encoded on human chromosome 17 HUMCHR18.TXT Index of proteins encoded on human chromosome 18 HUMCHR19.TXT Index of proteins encoded on human chromosome 19 HUMCHR20.TXT Index of proteins encoded on human chromosome 20 HUMCHR21.TXT Index of proteins encoded on human chromosome 21 HUMCHR22.TXT Index of proteins encoded on human chromosome 22 HUMCHRX.TXT Index of proteins encoded on human chromosome X HUMCHRY.TXT Index of proteins encoded on human chromosome Y HUMPVAR.TXT Index of human proteins with sequence variants INITFACT.TXT List and index of translation initiation factors INTEIN.TXT Index of intein-containing entries referenced in SWISS-PROT [see 2] METALLO.TXT Classification of metallothioneins and index of the entries in SWISS-PROT MGDTOSP.TXT Index of MGD entries referenced in SWISS-PROT MGENITAL.TXT Index of Mycoplasma genitalium strain G-37 chromosomal entries MIMTOSP.TXT Index of MIM entries referenced in SWISS-PROT MJANNASC.TXT Index of Methanococcus jannaschii entries NGR234.TXT Table of predicted proteins in Rhizobium plasmid pNGR234a NOMLIST.TXT List of nomenclature related references for proteins PCC6803.TXT Index of Synechocystis strain PCC 6803 entries PDBTOSP.TXT Index of Protein Data Bank (PDB) entries referenced in SWISS-PROT PEPTIDAS.TXT Classification of peptidase families and index of peptidase entries in SWISS-PROT PLASTID.TXT List of chloroplast and cyanelle encoded proteins POMBE.TXT Index of Schizosaccharomyces pombe entries and their corresponding gene designations RESTRIC.TXT List of restriction enzyme and methylase entries RIBOSOMP.TXT Index of ribosomal proteins classified by families on the basis of sequence similarities RPROWAZE.TXT Index of Rickettsia prowazekii strain Madrid E entries [see 2] SALTY.TXT Index of Salmonella typhimurium strain LT2 chromosomal entries and their corresponding StyGene cross-references SUBTILIS.TXT Index of Bacillus subtilis strain 168 chromosomal entries and their corresponding SubtiList cross-references UPFLIST.TXT UPF (Uncharacterized Protein Families) list and index of members YEAST.TXT Index of Saccharomyces cerevisiae entries in SWISS-PROT and their corresponding gene designations YEAST1.TXT Yeast Chromosome I entries YEAST2.TXT Yeast Chromosome II entries YEAST3.TXT Yeast Chromosome III entries YEAST5.TXT Yeast Chromosome V entries YEAST6.TXT Yeast Chromosome VI entries YEAST7.TXT Yeast Chromosome VII entries YEAST8.TXT Yeast Chromosome VIII entries YEAST9.TXT Yeast Chromosome IX entries YEAST10.TXT Yeast Chromosome X entries YEAST11.TXT Yeast Chromosome XI entries YEAST13.TXT Yeast Chromosome XIII entries YEAST14.TXT Yeast Chromosome XIV entries Notes: 1 The '7TMRLIST.TXT' and 'SHORTDES.TXT' files have been converted to mixed-case characters. 2 The 'DBXREF.TXT', 'HUMCHR01.TXT', 'HUMCHR02.TXT', 'HUMCHR03.TXT', 'HUMCHR04.TXT', 'HUMCHR05.TXT', 'HUMCHR06.TXT', 'HUMCHR07.TXT', 'HUMCHR08.TXT', 'HUMCHR09.TXT', 'HUMCHR10.TXT', 'HUMCHR11.TXT', 'HUMCHR12.TXT', 'HUMCHR14.TXT', 'HUMCHR15.TXT', 'INTEIN.TXT', 'PLASMID.TXT', and 'RPROWAZE.TXT' files are new documents introduced since release 38. We have continued to include in some SWISS-PROT documentation files the references of Web sites relevant to the subject under consideration. There are now 89 documents that include such links. 5 The ExPASy World-Wide Web server 5.1 Background information The most efficient and user-friendly way to browse interactively in SWISS-PROT, PROSITE, ENZYME, SWISS-2DPAGE and other databases is to use the World-Wide Web (WWW) molecular biology server ExPASy. The ExPASy server was made available to the public in September 1993 and is reachable at the following address: http://www.expasy.org/ The ExPASy WWW server allows access, using the user-friendly hypertext model, to the SWISS-PROT/TrEMBL, PROSITE, ENZYME, SWISS-2DPAGE, SWISS-3DIMAGE and CD40Lbase databases. And, through any SWISS-PROT protein sequence entry, to other databases such as EMBL, Eco2DBASE, EcoCyc, EcoGene, FlyBase, GCRDb, GlycoSuiteDB, MaizeDB, OMIM, PDB, HSSP, Pfam, ProDom, REBASE, SGD, SubtiList, TRANSFAC, YPD, ZFIN and Medline. ExPASy also offers many tools for the analysis of protein sequences and 2D gels. There are currently five mirror sites of ExPASy, i.e. exact copies of the server. The ExPASy mirror sites are located in: Australia http://au.expasy.org/ at the Australian Proteome Analysis Facility (APAF), Sydney Canada http://ca.expasy.org/ at the Canadian Bioinformatics Resource (CBR), Halifax China http://cn.expasy.org/ at the Center of Bioinformatics, Peking University, Beijing Korea http://kr.expasy.org/ at the Yonsei Proteome Research Center Taiwan http://tw.expasy.org/ at the National Health Research Institutes (NHRI), Taipei Explicit general and continuously updated documentation about the ExPASy server is available at http://www.expasy.org/doc/expasy.pdf. 5.2 Swiss-Shop We provide, on ExPASy, a service called Swiss-Shop (http://www.expasy.org/swiss-shop/). Swiss-Shop is an automated sequence alerting system which allows users to obtain, by email, new sequence entries relevant to their field(s) of interest. Every week, the new sequences entered in SWISS-PROT are automatically compared with all the criteria that have been defined by the users. If a sequence corresponds to the selection criteria defined by a user, that sequence is sent by electronic mail. Various criteria can be combined: * By entering one or more words that should be present in the description line; * By entering one or more species name(s) or taxonomic division(s); * By entering one or more keywords; * By entering one or more author names; * By entering the accession number (or entry name) of a PROSITE pattern or a user-defined sequence pattern. In this case, all new SWISS-PROT entries matching this pattern will be reported; * By entering the accession number (or entry name) of an existing SWISS-PROT entry or by entering a 'private' sequence. In this case, all new SWISS-PROT entries similar to that sequence will be reported. 5.3 What is new on ExPASy ExPASy is constantly modified and improved. If you wish to be informed on the changes made to the server you can either: * Read the document 'History of changes, improvements and new features' which is available at the address: http://www.expasy.org/history.html * Subscribe to Swiss-Flash, a service that reports news of databases, software and service developments. By subscribing to this service, you will automatically get Swiss-Flash bulletins by electronic mail. To subscribe, use the address: http://www.expasy.org/swiss-flash/ Among all the improvements and the new features introduced since the last SWISS-PROT release, here are those that we believe are specifically useful to SWISS-PROT users: 1. A new and improved version of the NiceProt view of SWISS-PROT is available and offers the following new features: a link to a printer-friendly view of a SWISS-PROT entry, display of the length of certain features in the FT lines, and access to a new tool, the 'Feature aligner' which allows to select features for submission to the ClustalW multiple alignment program. 2. SWISS-PROT release statistics are now available for every update of the database (http://www.expasy.org/sprot/relnotes/relstat.html). Among other parameters, statistics about database growth, average sequence lengths and amino acid composition, taxonomic origin, journal citations and database cross-references are presented, including some graphics. 3. A new view is available within the SRS Sequence Retrieval System. It displays, for each protein corresponding to a user query, gene name(s) and organism (in addition to the parameters ID, AC, description and sequence length which are displayed by the default view "Short description"). This new view is entitled "Long description" and is available from the menu "Use view ..." in the SRS query form. 4. The SIB Blast interface (accessible also via "Quick BLAST" or from the bottom of every SWISS-PROT/TrEMBL entry) now offers the possibility to restrict the similarity search by using taxonomic criteria. A "Taxonomic View" of the results can also be obtained via the BLAST result page. The user can also select a number of matching sequences and directly submit them to a ClustalW search, or retrieve and download the corresponding SWISS-PROT/TrEMBL entries. An alternative view of the results, NiceBlast, is available, which consists of an html table, detailing complete descriptions of all matching proteins, including the full protein name, gene name, sequence length and organism. 5. Explicit cross-references have been implemented between SWISS-PROT and BLOCKS, GlycoSuiteDB, InterPro, Leproma, MEROPS, MypuList, SMART, TubercuList, ANU-2DPAGE, PHCI-2DPAGE, PMMA-2DPAGE, COMPLUYEAST-2DPAGE, and Siena-2DPAGE. Implicit links have been added to the resources DIP, GeneCensus, GeneLynx, HUGE and NucleaRDB. 6. A new tool has been added to the ExPASy suite of proteomics tools: FindPept (http://www.expasy.org/tools/findpept.html) can identify peptides that result from unspecific cleavage of proteins from their experimental masses, taking into account artefactual chemical modifications, post-translational modifications (PTM) and protease autolytic cleavage. This new tool has been closely integrated with the other proteomics tools on ExPASy, such as PeptIdent and FindMod. 7. The Sulfinator (http://www.expasy.org/tools/sulfinator/) is a newly developed tool to predict tyrosine sulfation sites for a protein sequence, using four different Hidden Markov Models (HMM). 8. Sequences of alternatively spliced isoforms of the same protein are documented in the feature table of that protein sequence record. In collaboration with the SWISS-PROT group at EBI, a program varsplic.pl has been written to generate additional records from SWISS-PROT and TrEMBL, one for each splice isoform of each protein. The resulting data sets for SWISS- PROT and TrEMBL are available on the ExPASy ftp server (ftp://ftp.expasy.org/databases/sp_tr_nrdb/), along with a more detailed description of the project and information on how to obtain a local copy of the varsplic.pl program. The additional isoform entries have been added to the SWISS-PROT/TrEMBL databases underlying the BLAST server at SIB Switzerland, ScanProsite, and PeptIdent. Gradually, all other tools on ExPASy will be modified to handle splice isoforms. The NiceProt view of SWISS-PROT/TrEMBL provides links from the isoform name in the feature table (example: Q01432) to a page displaying the sequence of the corresponding isoform. 9. In the framework of the HAMAP project (see section 2.3), several new features and tools have been implemented on ExPASy: o The keyword "Complete Proteome" has been introduced to all SWISS-PROT/TrEMBL entries describing a protein which is thought to be expressed by an organism whose genome has been completely sequenced. This keyword is so far only used for microbial (bacterial and archaeal) proteins. A complete set of proteins from a microbial genome can therefore be obtained using this keyword across SWISS-PROT and TrEMBL. o We provide clean non-redundant SWISS-PROT/TrEMBL data sets for all completely sequenced microbial genomes. These files are available on the ExPASy ftp server in SWISS-PROT and Fasta format (ftp://ftp.expasy.org/databases/complete_proteomes/), and can also be used for similarity searches on the SIB Blast server ("microbial proteomes"). o A Genomic Proximity Viewer is available for those microbial genomes where an ORF numbering system exists. For those organisms, it is possible to click on the ORF name in the SWISS-PROT/TrEMBL GN lines to obtain a list of proteins encoded by genes in proximity. The tool is also accessible from the HAMAP complete proteome pages of those organisms. Example: Borrelia Burgdorferi, http://www.expasy.org/cgi-bin/genomeview.pl?bn=BORBU. 10. A year ago we have launched Protein Spotlight (http://www.expasy.org/spotlight/); a periodical review centered on a specific protein or group of proteins. It is published on a monthly basis. You can subscribe to receive each issue, free of charge, in HTML or PDF format. 6 TrEMBL - a supplement to SWISS-PROT The ongoing genome sequencing and mapping projects have dramatically increased the number of protein sequences to be incorporated into SWISS-PROT. Since we do not want to dilute the quality standards of SWISS-PROT by incorporating sequences into the database without proper sequence analysis and annotation, we cannot speed up the incorporation of new incoming data indefinitely. But as we also want to make the sequences available as fast as possible, we have introduced with SWISS-PROT a computer annotated supplement. This supplement consists of entries in SWISS-PROT-like format derived from the translation of all coding sequences (CDS) in the EMBL nucleotide sequence database, except those already included in SWISS-PROT. This supplement is named TrEMBL (Translation from EMBL). It can be considered as a preliminary section of SWISS-PROT. This SWISS-PROT release is supplemented by TrEMBL release 18. TrEMBL is available by FTP from the EBI and ExPASy servers in the directory 'databases/trembl'. It can be queried on WWW by the EBI and ExPASy SRS servers. It is distributed with its own set of release notes. 7 FTP access to SWISS-PROT and TrEMBL 7.1 Generalities SWISS-PROT is available for download on the following anonymous FTP servers: Organization Swiss Institute of Bioinformatics (SIB) Address ftp.expasy.org, au.expasy.org/ftp/, ca.expasy.org/ftp/, cn.expasy.org/ftp/, kr.expasy.org/ftp/, tw.expasy.org/ftp/ Directory /databases/swiss-prot/ Organization European Bioinformatics Institute (EBI) Address ftp.ebi.ac.uk Directory /pub/databases/swissprot/ 7.2 Non-redundant database We distribute on the ExPASy and EBI FTP servers, files that make up a non-redundant (see further) and complete protein sequence database consisting of three components: 1) SWISS-PROT 2) TrEMBL 3) New entries to be later integrated into TrEMBL (hereafter known as TrEMBL_New) Every week three files are completely rebuilt. These files are named: sprot. dat.gz, trembl.dat.gz and trembl_new.dat.gz. As indicated by their '. gz' extension, these are gzip-compressed files which, when decompressed, will produce ASCII files in SWISS-PROT format. Three other files are also available (sprot.fas.gz, trembl.fas.gz and trembl_new.fas.gz) which are compressed 'fasta' format sequence files useful for building the databases used by FASTA, BLAST and other sequence similarity search programs. Please do not use these files for any other purpose, as you will lose all annotations by using this very ' primitive' format. The files for the non-redundant database are stored in the directory '/databases/sp_tr_nrdb' on the ExPASy FTP server (ftp.expasy.org) and in the directory '/pub/databases/sp_tr_nrdb' on the EBI FTP server (ftp.ebi.ac.uk). Additional notes: * The SWISS-PROT file continuously grows as new annotated sequences are added. * The TrEMBL file decreases in size as sequences are moved out of that section after being annotated and moved into SWISS-PROT. Four times a year a new release of TrEMBL is built at EBI, at this point the TrEMBL file increases in size as it then includes all of the new data (see next section) that has accumulated since the last release. * The TrEMBL_New file starts as a very small file and grows in size until a new release of TrEMBL is available. * SWISS-PROT and TrEMBL share the same system of accession numbers. Therefore you will not find any primary accession number duplicated between the two sections. A TrEMBL entry (and its associated accession number(s)) can either move to SWISS-PROT as new entry or be merged with an existing SWISS-PROT entry. In the latter case, the accession number(s) of that TrEMBL entry are added to that of the SWISS-PROT entry. * TrEMBL_New does not have real accession numbers. However it was necessary to have an 'AC' line so as to be able to use it with different software products. This AC line contains a temporary identifier which consists of the protein_ID (protein sequence identifier) of the coding sequence in the parent nucleotide sequence. * TrEMBL_New is quite messy! You will of course find new sequence entries but you will also encounter sequences that are going to be used to update existing TrEMBL or SWISS-PROT entries. None of the "cleaning" steps that are applied to produce a TrEMBL release are run on TrEMBL_New nor are any of the computer-annotation software tools that are used to enhance the information content of TrEMBL. TrEMBL_New is provided only so that users can be sure not to miss any important new sequences when they run similarity searches. * While these three files allow you to build what we call a 'non-redundant' database, it must be noted that this is not completely a true statement. Without going into a long explanation we can say that this is currently the best attempt in providing a complete selection of protein sequence entries while trying to eliminate redundancies. Also SWISS-PROT is completely (well 99.994% !) non-redundant, TrEMBL is far from being non-redundant and the addition of SWISS-PROT + TrEMBL is even less. * To describe to your users the version of the non-redundant database that you are providing them with, you should use a statement of the form: SWISS-PROT release 40.0 of 17-Oct-2001; TrEMBL release 18.0 of 22-Oct-2001; TrEMBL_New of 22-Oct-2001. 7.3 Weekly updates of SWISS-PROT documents Whilst the ExPASy FTP server so far only allowed FTP access to the SWISS-PROT documents and indexes in their versions at the time of the last full release, all documents are now updated with every weekly release of SWISS-PROT. They are available for FTP download from the directory /databases/swiss-prot/updated_doc/. 7.4 Weekly updates of SWISS-PROT Weekly updates of SWISS-PROT are available by anonymous FTP. Three files are generated at each update: new_seq.dat Contains all the new entries since the last full release; upd_seq.dat Contains the entries for which the sequence data has been updated since the last release; upd_ann.dat Contains the entries for which one or more annotation fields have been updated since the last release. Important notes * Although we try to follow a regular schedule, we do not promise to update these files every week. In most cases two weeks may elapse between two updates. * Instead of using the above files, you can, every week, download an updated copy of the SWISS-PROT database. This file is available in the directory containing the non-redundant database (see section 7.2). 8 ENZYME and PROSITE 8.1 The ENZYME nomenclature database Release 27.0 of the ENZYME nomenclature database is distributed with release 40 of SWISS-PROT. ENZYME release 27.0 contains information relative to 3'870 enzymes. In this release, we have added a significant number of new entries and we also updated many entries. 8.2 The PROSITE database Release 17.0 of the PROSITE database will be available in a few weeks. PROSITE will now come with its own set of release notes. 9 We need your help! We welcome feedback from our users. We would especially appreciate that you notify us if you find that sequences belonging to your field of expertise are missing from the database. We also would like to be notified about annotations to be updated, if, for example, the function of a protein has been clarified or if new information about post-translational modifications has become available. To facilitate this feedback we offer, on the ExPASy WWW server, a form that allows the submission of updates and/or corrections to SWISS-PROT: http://www.expasy.org/sprot/sp_update_form.html It is also possible, from any entry in SWISS-PROT displayed by the ExPASy server, to submit updates and/or corrections for that particular entry. Finally, you can also send your comments by electronic mail to the address: email@example.com Note that all update requests are assigned a unique identifier of the form UR-Xnnnn (example: UR-A0123). This identifier is used internally by the SWISS-PROT staff at SIB and EBI to track down the fate of requests and is also be used in email exchanges with the persons having submitted a request. APPENDIX A: Some statistics A.1 Amino acid composition A.1.1 Composition in percent for the complete database Ala (A) 7.61 Gln (Q) 3.93 Leu (L) 9.53 Ser (S) 7.08 Arg (R) 5.19 Glu (E) 6.47 Lys (K) 5.97 Thr (T) 5.58 Asn (N) 4.36 Gly (G) 6.85 Met (M) 2.37 Trp (W) 1.21 Asp (D) 5.25 His (H) 2.24 Phe (F) 4.10 Tyr (Y) 3.16 Cys (C) 1.63 Ile (I) 5.85 Pro (P) 4.89 Val (V) 6.61 Asx (B) 0.000 Glx (Z) 0.000 Xaa (X) 0.01 A.1.2 Classification of the amino acids by their frequency Leu, Ala, Ser, Gly, Val, Glu, Lys, Ile, Thr, Asp, Arg, Pro, Asn, Phe, Gln, Tyr, Met, His, Cys, Trp A.2 Taxonomic origin Total number of species represented in this release of SWISS-PROT: 7'188 The first twenty species represent 45'181 sequences: 44.5 % of the total number of entries. A.2.1 Table of the frequency of occurrence of species Species represented 1x: 3396 2x: 1086 3x: 589 4x: 366 5x: 267 6x: 251 7x: 169 8x: 137 9x: 125 10x: 61 11- 20x: 308 21- 50x: 231 51-100x: 78 >100x: 124 A.2.2 Table of the most represented species ------ --------- -------------------------------------------- Number Frequency Species ------ --------- -------------------------------------------- 1 7471 Homo sapiens (Human) 2 4859 Saccharomyces cerevisiae (Baker's yeast) 3 4816 Mus musculus (Mouse) 4 4741 Escherichia coli 5 3091 Rattus norvegicus (Rat) 6 2260 Bacillus subtilis 7 2184 Caenorhabditis elegans 8 1782 Schizosaccharomyces pombe (Fission yeast) 9 1769 Haemophilus influenzae 10 1514 Drosophila melanogaster (Fruit fly) 11 1472 Methanococcus jannaschii 12 1409 Arabidopsis thaliana (Mouse-ear cress) 13 1321 Mycobacterium tuberculosis 14 1295 Bos taurus (Bovine) 15 1004 Gallus gallus (Chicken) 16 883 Synechocystis sp. (strain PCC 6803) 17 872 Escherichia coli O157:H7 18 846 Salmonella typhimurium 19 798 Archaeoglobus fulgidus 20 794 Xenopus laevis (African clawed frog) 21 765 Sus scrofa (Pig) 22 680 Aquifex aeolicus 23 671 Oryctolagus cuniculus (Rabbit) 24 662 Mycoplasma pneumoniae 25 594 Pseudomonas aeruginosa 26 588 Treponema pallidum 27 557 Buchnera aphidicola (subsp. Acyrthosiphon pisum) 28 523 Rickettsia prowazekii 29 522 Helicobacter pylori (Campylobacter pylori) 30 505 Helicobacter pylori J99 (Campylobacter pylori J99) 31 503 Mycobacterium leprae 32 486 Mycoplasma genitalium 33 481 Zea mays (Maize) 34 450 Methanobacterium thermoautotrophicum 35 403 Rhizobium sp. (strain NGR234) 36 395 Borrelia burgdorferi (Lyme disease spirochete) 37 390 Oryza sativa (Rice) 38 387 Chlamydia trachomatis 39 375 Thermotoga maritima 40 374 Streptomyces coelicolor 41 371 Chlamydia pneumoniae (Chlamydophila pneumoniae) 42 368 Canis familiaris (Dog) 43 364 Chlamydia muridarum 44 356 Rhizobium meliloti (Sinorhizobium meliloti) 45 353 Vibrio cholerae 46 333 Nicotiana tabacum (Common tobacco) 47 323 Pasteurella multocida 48 322 Ovis aries (Sheep) 49 320 Pyrococcus horikoshii 50 311 Dictyostelium discoideum (Slime mold) 51 301 Lactococcus lactis (subsp. lactis) (Streptococcus lactis) 52 284 Pyrococcus abyssi 53 276 Pisum sativum (Garden pea) 54 272 Bacteriophage T4 55 260 Staphylococcus aureus 56 256 Candida albicans (Yeast) 57 255 Neurospora crassa 58 254 Vaccinia virus (strain Copenhagen) 59 247 Triticum aestivum (Wheat) 60 247 Bacillus halodurans 61 244 Glycine max (Soybean) 62 243 Hordeum vulgare (Barley) 63 242 Aeropyrum pernix 64 241 Rhodobacter capsulatus (Rhodopseudomonas capsulata) 65 231 Pseudomonas putida 66 227 Lycopersicon esculentum (Tomato) 67 221 Cavia porcellus (Guinea pig) 68 220 Porphyra purpurea 69 219 Solanum tuberosum (Potato) 70 214 Spinacia oleracea (Spinach) 71 214 Klebsiella pneumoniae 72 213 Bacillus stearothermophilus 73 210 Neisseria meningitidis (serogroup B) 74 204 Neisseria meningitidis (serogroup A) 75 193 Human cytomegalovirus (strain AD169) 76 188 Campylobacter jejuni 77 187 Vaccinia virus (strain WR) 78 183 Deinococcus radiodurans 79 180 Agrobacterium tumefaciens 80 179 Sulfolobus solfataricus 81 179 Brachydanio rerio (Zebrafish) (Zebra danio) 82 173 Equus caballus (Horse) 83 171 Mesocricetus auratus (Golden hamster) 84 171 Chlamydomonas reinhardtii 85 170 Thermoplasma acidophilum 86 168 Emericella nidulans (Aspergillus nidulans) 87 158 Halobacterium sp. (strain NRC-1) 88 154 Autographa californica nuclear polyhedrosis virus (AcMNPV) 89 153 Cyanidium caldarium 90 152 Thermus aquaticus (subsp. thermophilus) 91 151 Marchantia polymorpha (Liverwort) 92 151 Cyanophora paradoxa 93 149 Xylella fastidiosa 94 148 Fowlpox virus (FPV) 95 148 Guillardia theta (Cryptomonas phi) 96 147 Synechococcus sp. (strain PCC 7942) (Anacystis nidulans R2) 97 147 Variola virus 98 143 Caulobacter crescentus 99 142 Ureaplasma parvum (Ureaplasma urealyticum biotype 1) 100 142 Kluyveromyces lactis (Yeast) A.2.3 Taxonomic distribution of the sequences Kingdom Sequences (% of the database) Archaea 5032 ( 5%) Bacteria 34782 ( 34%) Eukaryota 53357 ( 53%) Viruses 8431 ( 8%) A.3 Sequence size A.3.1 Repartition of the sequences by size (excluding fragments) From To Number From To Number 1- 50 1950 1001-1100 915 51- 100 7099 1101-1200 708 101- 150 10484 1201-1300 471 151- 200 9010 1301-1400 318 201- 250 8978 1401-1500 268 251- 300 8130 1501-1600 172 301- 350 7894 1601-1700 150 351- 400 7945 1701-1800 105 401- 450 5869 1801-1900 116 451- 500 5485 1901-2000 87 501- 550 4190 2001-2100 47 551- 600 2852 2101-2200 87 601- 650 2249 2201-2300 89 651- 700 1651 2301-2400 50 701- 750 1457 2401-2500 48 751- 800 1240 >2500 273 801- 850 985 851- 900 965 901- 950 700 951-1000 593 A.3.2 Longest and shortest sequences The shortest sequence is GRWM_HUMAN (P24272) : 3 amino acids. The longest sequence is NEBU_HUMAN (P20929) : 6669 amino acids. A.4 Journal citations Note: the following citation statistics reflect the number of distinct journal citations. Total number of journals cited in this release of SWISS-PROT: 1'190 A.4.1 Table of the frequency of journal citations Journals cited 1x: 443 2x: 157 3x: 87 4x: 58 5x: 51 6x: 27 7x: 24 8x: 19 9x: 21 10x: 11 11- 20x: 83 21- 50x: 88 51-100x: 31 >100x: 90 A.4.2 List of the most cited journals in SWISS-PROT Nb Citations Journal name -- --------- ------------------------------------------------------------- 1 8033 Journal of Biological Chemistry 2 4615 Proceedings of the National Academy of Sciences of the U.S.A. 3 3554 Nucleic Acids Research 4 3295 Journal of Bacteriology 5 3144 Gene 6 2492 FEBS Letters 7 2293 Biochemical and Biophysical Research Communications 8 2255 European Journal of Biochemistry 9 2144 Biochemistry 10 1998 The EMBO Journal 11 1894 Nature 12 1833 Biochimica et Biophysica Acta 13 1682 Journal of Molecular Biology 14 1503 Genomics 15 1477 Cell 16 1434 Molecular and Cellular Biology 17 1096 Biochemical Journal 18 1085 Molecular and General Genetics 19 1078 Plant Molecular Biology 20 1024 Science 21 982 Molecular Microbiology 22 814 Virology 23 808 Journal of Biochemistry 24 637 Human Molecular Genetics 25 592 Journal of Cell Biology 26 573 Journal of Virology 27 525 Human Mutation 28 520 Plant Physiology 29 518 Genes and Development 30 510 Yeast 31 505 Nature Genetics 32 494 Oncogene 33 486 Journal of General Virology 34 477 Infection and Immunity 35 461 Journal of Immunology 36 441 The American Journal of Human Genetics 37 424 Structure 38 420 Archives of Biochemistry and Biophysics 39 391 FEMS Microbiology Letters 40 366 Microbiology 41 358 Current Genetics 42 346 Development 43 333 Nature Structural Biology 44 331 Molecular and Biochemical Parasitology 45 320 Human Genetics 46 293 Genetics 47 280 Molecular Endocrinology 48 277 Journal of Clinical Investigation 49 270 Biological Chemistry Hoppe-Seyler 50 267 Applied and Environmental Microbiology 51 265 Blood 52 263 Journal of Molecular Evolution 53 253 Protein Science 54 249 DNA and Cell Biology 55 243 Developmental Biology 56 229 Journal of General Microbiology 57 224 Journal of Experimental Medicine 58 213 Neuron 59 213 Hoppe-Seyler's Zeitschrift fur Physiologische Chemie 60 211 Cancer Research 61 210 Immunogenetics 62 208 Mammalian Genome 63 197 Endocrinology 64 182 Mechanisms of Development 65 180 DNA Sequence 66 170 Acta Crystallographica, Section D 67 164 The Plant Cell 68 161 Brain Research. Molecular Brain Research 69 159 Journal of Neurochemistry 70 158 Molecular Biology and Evolution 71 156 DNA 72 155 Molecular Biology of the Cell 73 147 The Plant Journal 74 146 Journal of Cell Science 75 145 Journal of Neuroscience 76 135 Comparative Biochemistry and Physiology 77 133 Bioscience, Biotechnology, and Biochemistry 78 130 Antimicrobial Agents and Chemotherapy 79 125 Biochimie 80 123 Virus Research 81 122 Bioorganicheskaia Khimiia 82 120 Molecular Pharmacology 83 117 Hemoglobin 84 116 The Journal of Clinical Endocrinology and Metabolism 85 113 Agricultural and Biological Chemistry 86 112 Cytogenetics and Cell Genetics 87 112 American Journal of Physiology 88 110 Molecular Plant-Microbe Interactions 89 105 Proteins 90 102 Peptides 91 100 DNA Research A.5 Statistics for some line types The following table summarizes the total number of some SWISS-PROT lines, as well as the number of entries with at least one such line, and the frequency of the lines. Total Number of Average Line type / subtype number entries per entry --------------------------------- -------- --------- --------- References (RL) 182326 1.79 Journal 152419 89829 1.50 Submitted to EMBL/GenBank/DDBJ 27607 24142 0.27 Unpublished observations 500 496 <0.01 Book citation 438 428 <0.01 Submitted to SWISS-PROT 437 435 <0.01 Plant Gene Register 385 378 <0.01 Submitted to other databases 185 183 <0.01 Thesis 160 159 <0.01 Unpublished results 114 112 <0.01 Patent 79 77 <0.01 Worm Breeder's Gazette 2 2 <0.01 Comments (CC) 309232 3.04 SIMILARITY 91246 81758 0.90 FUNCTION 61984 61049 0.61 SUBCELLULAR LOCATION 42010 42010 0.41 CATALYTIC ACTIVITY 27896 26508 0.27 SUBUNIT 25865 25864 0.25 PATHWAY 11464 11431 0.11 TISSUE SPECIFICITY 10070 10070 0.10 COFACTOR 7811 7811 0.08 MISCELLANEOUS 6942 6352 0.07 PTM 5829 5447 0.06 INDUCTION 2971 2971 0.03 DEVELOPMENTAL STAGE 2811 2811 0.03 ALTERNATIVE PRODUCTS 2755 2754 0.03 DOMAIN 2658 2471 0.03 CAUTION 2169 2099 0.02 DISEASE 1865 1620 0.02 ENZYME REGULATION 1473 1473 0.01 MASS SPECTROMETRY 548 506 0.01 DATABASE 503 465 <0.01 POLYMORPHISM 295 287 <0.01 PHARMACEUTICAL 38 38 <0.01 BIOTECHNOLOGY 29 29 <0.01 Features (FT) 471213 4.64 DOMAIN 76115 22381 0.75 TRANSMEM 64913 14473 0.64 CARBOHYD 40298 9840 0.40 CONFLICT 36638 12924 0.36 DISULFID 34856 9355 0.34 METAL 27931 6801 0.27 CHAIN 20956 16975 0.21 VARIANT 18980 3544 0.19 ACT_SITE 18495 11839 0.18 REPEAT 17543 3013 0.17 SIGNAL 12976 12975 0.13 NP_BIND 12514 8916 0.12 MOD_RES 11665 6503 0.11 NON_TER 10234 7849 0.10 BINDING 7710 6160 0.08 TURN 7330 633 0.07 STRAND 7077 562 0.07 ZN_FING 5911 2061 0.06 INIT_MET 4892 4868 0.05 HELIX 4644 587 0.05 VARSPLIC 4211 2068 0.04 SITE 4151 3019 0.04 PROPEP 3842 3488 0.04 DNA_BIND 3796 3589 0.04 MUTAGEN 2797 963 0.03 LIPID 2684 2174 0.03 TRANSIT 2300 2284 0.02 PEPTIDE 2202 830 0.02 CA_BIND 2106 840 0.02 NON_CONS 732 387 0.01 UNSURE 255 117 <0.01 SIMILAR 242 203 <0.01 SE_CYS 104 64 <0.01 THIOETH 90 31 <0.01 THIOLEST 23 23 <0.01 Cross-references (DR) 718458 7.07 EMBL 179318 95610 1.76 InterPro 128566 81051 1.27 Pfam 101086 77741 0.99 PROSITE 83189 53484 0.82 PIR 47057 35789 0.46 HSSP 33548 33548 0.33 PRINTS 30494 27899 0.30 SMART 30434 22855 0.30 ProDom 16772 16337 0.17 PDB 10380 3124 0.10 TIGR 9378 9343 0.09 MIM 6755 6024 0.07 SGD 4903 4849 0.05 MGD 4408 4397 0.04 EcoGene 4134 4132 0.04 Mendel 3041 2942 0.03 MEROPS 2348 2260 0.02 SubtiList 2234 2233 0.02 WormPep 2071 2034 0.02 FlyBase 1936 1883 0.02 GCRDb 1661 972 0.02 TRANSFAC 1612 1494 0.02 TubercuList 1350 1313 0.01 StyGene 799 798 0.01 SWISS-2DPAGE 746 745 0.01 Leproma 501 497 <0.01 MaizeDB 402 398 <0.01 HIV 370 354 <0.01 REBASE 352 347 <0.01 ECO2DBASE 351 299 <0.01 DictyDb 313 310 <0.01 GlycoSuiteDB 249 249 <0.01 ZFIN 154 154 <0.01 YEPD 129 120 <0.01 Aarhus/Ghent-2DPAGE 128 98 <0.01 PHCI-2DPAGE 128 128 <0.01 Siena-2DPAGE 104 104 <0.01 HSC-2DPAGE 85 85 <0.01 COMPLUYEAST-2DPAGE 50 50 <0.01 CarbBank 41 21 <0.01 Maize-2DPAGE 39 39 <0.01 PMMA-2DPAGE 26 26 <0.01 MypuList 21 21 <0.01 ANU-2DPAGE 13 13 <0.01 A.6 Miscellaneous statistics Total number of distinct authors cited in SWISS-PROT: 146'936 Total number of entries encoded on a chloroplast : 2'609 Total number of entries encoded on a mitochondrion : 2'262 Total number of entries encoded on a cyanelle : 145 Total number of entries encoded on a plasmid : 2'344 Number of additional sequences encoded on splice variants : 3'505 --End of document--