You are using a version of Internet Explorer that may not display all features of this website. Please upgrade to a modern browser.
Swiss-Prot release 38.0
Published July 1, 1999
SWISS-PROT RELEASE 38.0 RELEASE NOTES 1. INTRODUCTION Release 38.0 of SWISS-PROT contains 80'000 sequence entries, comprising 29'085'265 amino acids abstracted from 64'965 references. This represents an increase of 3% over release 37. The growth of the data bank is summarized below. Release Date Number of Number of amino entries acids 2.0 09/86 3939 900 163 3.0 11/86 4160 969 641 4.0 04/87 4387 1 036 010 5.0 09/87 5205 1 327 683 6.0 01/88 6102 1 653 982 7.0 04/88 6821 1 885 771 8.0 08/88 7724 2 224 465 9.0 11/88 8702 2 498 140 10.0 03/89 10008 2 952 613 11.0 07/89 10856 3 265 966 12.0 10/89 12305 3 797 482 13.0 01/90 13837 4 347 336 14.0 04/90 15409 4 914 264 15.0 08/90 16941 5 486 399 16.0 11/90 18364 5 986 949 17.0 02/91 20024 6 524 504 18.0 05/91 20772 6 792 034 19.0 08/91 21795 7 173 785 20.0 11/91 22654 7 500 130 21.0 03/92 23742 7 866 596 22.0 05/92 25044 8 375 696 23.0 08/92 26706 9 011 391 24.0 12/92 28154 9 545 427 25.0 04/93 29955 10 214 020 26.0 07/93 31808 10 875 091 27.0 10/93 33329 11 484 420 28.0 02/94 36000 12 496 420 29.0 06/94 38303 13 464 008 30.0 10/94 40292 14 147 368 31.0 02/95 43470 15 335 248 32.0 11/95 49340 17 385 503 33.0 02/96 52205 18 531 384 34.0 10/96 59021 21 210 389 35.0 11/97 69113 25 083 768 36.0 07/98 74019 26 840 295 37.0 12/98 77977 28 268 293 38.0 07/99 80000 29 085 965 2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 37 2.1 Sequences and annotations 2'106 sequences have been added since release 37, the sequence data of 400 existing entries has been updated and the annotations of 12'576 entries have been revised. 2.2 What's happening with the model organisms We have selected a number of organisms that are the target of genome sequencing and/or mapping projects and for which we intend to: o Be as complete as possible. All sequences available at a given time should be immediately included in SWISS-PROT. This also includes sequence corrections and updates; o Provide a higher level of annotation; o Provide cross-references to specialized database(s) that contain, among other data, some genetic information about the genes that code for these proteins; o Provide specific indices or documents. Here is the current status of the model organisms in SWISS-PROT: Organism Database Index file Number of cross-referenced sequences -------------- ---------------- -------------- --------- A.thaliana None yet In preparation 821 B.subtilis SubtiList SUBTILIS.TXT 2069 C.albicans None yet CALBICAN.TXT 221 C.elegans Wormpep CELEGANS.TXT 2202 D.discoideum DictyDB DICTY.TXT 292 D.melanogaster FlyBase FLY.TXT 1088 E.coli EcoGene ECOLI.TXT 4516 H.influenzae HiDB (TIGR) HAEINFLU.TXT 1698 H.sapiens MIM MIMTOSP.TXT 5406 H.pylori HpDB (TIGR) HPYLORI.TXT 382 M.genitalium MgDB (TIGR) MGENITAL.TXT 469 M.musculus MGD MGDTOSP.TXT 3549 M.jannaschii MjDB (TIGR) MJANNASC.TXT 1312 M.tuberculosis None yet None yet 928 S.cerevisiae SGD YEAST.TXT 4811 S.typhimurium StyGene SALTY.TXT 727 S.pombe None yet POMBE.TXT 1438 S.solfataricus None yet None yet 86 -------------- ---------------- -------------- --------- Collectively the entries from the above model organisms represent 38.5% of all SWISS-PROT entries. We plan to finish as quickly as possible the annotation of the Escherichia coli, Haemophilus influenzae, Methanococcus jannaschii and yeast (S.cerevisiae) sequence entries which are not yet part of SWISS-PROT. Please also see the description of the Human Proteomics Initiative in section 10 of these release notes. 2.3 First steps in the conversion of SWISS-PROT to mixed-case characters We are gradually converting SWISS-PROT entries from all UPPER CASE to MiXeD CaSe. The line-types that have been converted between release 37 and 38 are: DT (DaTe), OS (Organism Species), OC (Organism Classification), OG (OrGanelle), RL (Reference Location) and KW (KeyWord). The RT (Reference Title) lines were already introduced in mixed-case at release 37. As described in section 3.1, the process of converting all of SWISS-PROT to mixed case is continuing. 2.4 Small change in the format of RL lines for submissions to the DNA databases Along with the conversion of the RL to mixed-case (see 2.3) we have also made a small change to the format of RL lines for submissions to the DNA databases. What used to be: RL SUBMITTED (MMM-YEAR) TO EMBL/GENBANK/DDBJ DATA BANKS. is now: RL Submitted (MMM-YEAR) to the EMBL/GenBank/DDBJ databases. This change was made to follow more closely the format used by the EMBL nucleotide sequence database. 2.5 Introduction of a new CC line-type topic: MISCELLANEOUS We have introduced in this release a new 'topic' for the comments (CC) line type: MISCELLANEOUS. This topic is used for all comments which do not belong to any other already defined topic. This means that starting with the current release all comments are now assigned to a topic. Example, what was previously: CC -!- BINDS TO BACITRACIN. is now: CC -!- MISCELLANEOUS: BINDS TO BACITRACIN. 2.6 Cleaning up of the SIMILARITY comment line (CC) topic We are continuing a major overhaul of the SIMILARITY topic. We would like the majority of the information stored in this topic to be usable by computer programs (while being human-readable). We are therefore standardizing the format of this topic using two different subformats. One to describe to which family a protein belongs: CC -!- SIMILARITY: BELONGS TO THE <Name1> FAMILY [OF <Name2>]. CC [<Name3> SUBFAMILY.] Examples: CC -!- SIMILARITY: BELONGS TO THE 14-3-3 FAMILY. CC -!- SIMILARITY: BELONGS TO THE 6-PHOSPHOGLUCONATE DEHYDROGENASE CC FAMILY. CC -!- SIMILARITY: BELONGS TO THE AAA FAMILY OF ATPASES. CC -!- SIMILARITY: BELONGS TO THE IRON/ASCORBATE-DEPENDENT FAMILY OF CC OXIDOREDUCTASES. CC -!- SIMILARITY: BELONGS TO THE ANTP FAMILY OF HOMEOBOX PROTEINS. CC "DEFORMED" SUBFAMILY. CC -!- SIMILARITY: BELONGS TO THE KINESIN-LIKE PROTEIN FAMILY. CC KINESIN SUBFAMILY. And one to describe which domains are found in a given protein: CC -!- SIMILARITY: CONTAINS n <Name> [DOMAIN|REPEAT][S]. Examples: CC -!- SIMILARITY: CONTAINS 1 FHA DOMAIN. CC -!- SIMILARITY: CONTAINS 45 EGF-LIKE DOMAINS. CC -!- SIMILARITY: CONTAINS 2 SH3 DOMAINS. CC -!- SIMILARITY: CONTAINS 2 SUSHI (SCR) REPEATS. We have already updated many entries in this and the previous releases and plan to complete this change for the next release. 2.7 Changes concerning cross-references (DR line) We have added cross-references from SWISS-PROT to the Zebrafish Information Network (ZFIN) database available at http://zfish.uoregon.edu/ZFIN/ (see: Westerfield M., Doerry E., Kirkpatrick A.E. and Douglas S.A.; Meth. Cell Biol. 60:339-355(1999)). These cross-references are present in the DR lines: Data bank identifier: ZFIN Primary identifier : The ZFIN identifiers for a given gene. Secondary identifier: The gene designation Example : DR ZFIN; ZDB-GENE-980526-290; hoxa1. We have started to add cross-references from SWISS-PROT to the CarbBank Complex Carbohydrate Structure Database (CCSD) (http://22.214.171.124/carbbank/). These cross-references are present in the DR lines: Data bank identifier: CARBBANK Primary identifier : The CarbBank identifier for a given carbohydrate structure. Secondary identifier: A dash (-). Example : DR CARBBANK; CCSD:27494; -. In this release, we have also updated all the DR lines pointing to the MIM and Pfam databases. 2.8 Switching from pID to protein_ID in cross-references to the DNA sequence databases The DNA sequence databases (EMBL/GenBank/DDBJ) recently changed their referencing system for CDS (CoDing Sequence). They used to associate every CDS in the database with what was called a pID. The pID was a string of variable length composed of a letter (D, E or G) followed by a number (example: E345673). Whenever the protein sequence coded by a CDS would change due to a sequence or annotation revision, a new pID was attributed to that CDS. This system made it difficult to track down changes. pID have therefore been replaced by what is now called protein_ID' (protein sequence IDentifier). The protein_ID consists of a stable ID portion (8 characters: 3 letters followed by 5 numbers) plus a version number after a decimal point (example: AAA03208.1). The version number only changes when the protein sequence coded by the CDS changes, while the stable part remains unchanged. In release 38, we have converted the cross-references to EMBL/GenBank/DDBJ to use the protein_ID instead of the pID as the secondary identifier in these DR lines. Example, what was previously: DR EMBL; Z75208; E1165324; -. is now: DR EMBL; Z75208; CAA99603.1; -. For a number of technical reasons, there are still 732 pID referenced in release 38, they will gradually be replaced by the corresponding protein_ID for release 39. 2.9 Introduction of a unique identifier in the VARIANT feature description of human sequence entries We have introduced in release 38 a unique identifier for all VARIANT feature keys in human sequence entries. This change is the first step toward providing a unique identifier to all SWISS-PROT features. Human sequence variants were chosen as a prototype for this improvement. It is now possible to directly link specific sequence variants to the relevant entries in disease mutation databases as well as to provide these databases with a method to implement reciprocal links. The unique identifier is of the form of /FTId=VAR_nnnnnn and is added as the last part of the description field of 'VARIANT' feature keys. Example, what was previously: FT VARIANT 6 6 E -> V (IN S; SICKLE CELL ANEMIA). FT VARIANT 11 11 V -> D (IN WINDSOR; O2 AFFINITY UP; FT UNSTABLE). is now: FT VARIANT 6 6 E -> V (IN S; SICKLE CELL ANEMIA). FT /FTId=VAR_002863. FT VARIANT 11 11 V -> D (IN WINDSOR; O2 AFFINITY UP; FT UNSTABLE). FT /FTId=VAR_002873. 3. FORTHCOMING CHANGES 3.1 Continuation of the conversion of SWISS-PROT to mixed-case characters We will continue to convert SWISS-PROT entries from all UPPER CASE to MiXeD CaSe. In release 39 we are planning to convert the RA (Reference Author) and RC (Reference Comment) line types. We will also convert the gene designations in the DR (Database cross-Reference) lines for MGD, EcoGene, StyGene, SubtiList and DictyDb to mixed case. Further lines will be converted in release 40. Here is an example of what a SWISS-PROT entry will look like in release 39: ID HXC4_MOUSE STANDARD; PRT; 264 AA. AC Q08624; DT 01-OCT-1994 (Rel. 30, Created) DT 01-OCT-1994 (Rel. 30, Last sequence update) DT 15-DEC-1999 (Rel. 39, Last annotation update) DE HOMEOBOX PROTEIN HOX-C4 (HOX-3.5). GN HOXC4 OR HOXC-4 OR HOX-3.5. OS Mus musculus (Mouse). OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Mammalia; OC Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus. RN  RP SEQUENCE FROM N.A. RC STRAIN=Balb/C; TISSUE=Liver; RX MEDLINE; 93288004. RA Goto J., Miyabayashi T., Wakamatsu Y., Takahashi N., Muramatsu M.; RT "Organization and expression of mouse Hox3 cluster genes."; RL Mol. Gen. Genet. 239:41-48(1993). RN  RP SEQUENCE FROM N.A. RC TISSUE=Embryo; RX MEDLINE; 93161956. RA Geada A.M.C., Gaunt S.J., Azzawi M., Shimeld S.M., Pearce J., RA Sharpe P.T.; RT "Sequence and embryonic expression of the murine Hox-3.5 gene."; RL Development 116:497-506(1992). RN  RP SEQUENCE OF 177-201 FROM N.A. RC STRAIN=C57BL/6; TISSUE=Spleen; RX MEDLINE; 92073357. RA Murtha M.T., Leckman J.F., Ruddle F.H.; RT "Detection of homeobox genes in development and evolution."; RL Proc. Natl. Acad. Sci. U.S.A. 88:10711-10715(1991). CC -!- FUNCTION: SEQUENCE-SPECIFIC TRANSCRIPTION FACTOR WHICH IS PART OF CC A DEVELOPMENTAL REGULATORY SYSTEM THAT PROVIDES CELLS WITH CC SPECIFIC POSITIONAL IDENTITIES ON THE ANTERIOR-POSTERIOR AXIS. CC -!- SUBCELLULAR LOCATION: NUCLEAR. CC -!- SIMILARITY: BELONGS TO THE ANTP FAMILY OF HOMEOBOX PROTEINS. CC "DEFORMED" SUBFAMILY. DR EMBL; D11328; BAA01947.1; -. DR EMBL; S62287; AAB27153.1; -. DR EMBL; X69019; CAA48784.1; -. DR EMBL; M81660; AAA63313.1; -. DR PIR; S35219; S35219. DR HSSP; P02833; 1SAN. DR MGD; MGI:96195; Hoxc4. DR PFAM; PF00046; homeobox; 1. DR PROSITE; PS00027; HOMEOBOX_1; 1. DR PROSITE; PS00032; ANTENNAPEDIA; 1. DR PROSITE; PS50071; HOMEOBOX_2; 1. KW Homeobox; DNA-binding; Developmental protein; Nuclear protein; KW Transcription regulation. FT DOMAIN 54 60 POLY-PRO. FT DOMAIN 135 140 ANTP-TYPE HEXAPEPTIDE (BY SIMILARITY). FT DNA_BIND 156 215 HOMEOBOX (BY SIMILARITY). FT DOMAIN 183 186 POLY-ARG. FT CONFLICT 80 80 A -> G (IN REF. 2). FT CONFLICT 96 96 P -> S (IN REF. 2). SQ SEQUENCE 264 AA; 29865 MW; 611C069F CRC32; MIMSSYLMDS NYIDPKFPPC EEYSQNSYIP EHSPEYYGRT RESGFQHHHQ ELYPPPPPRP SYPERQYSCT SLQGPGNSRA HGPAQAGHHH PEKSQPLCEP APLSGTSASP SPAPPACSQP APDHPSSAAS KQPIVYPWMK KIHVSTVNPN YNGGEPKRSR TAYTRQQVLE LEKEFHYNRY LTRRRRIEIA HSLCLSERQI KIWFQNRRMK WKKDHRLPNT KVRSAPPAGA APSTLSAATP GTSEDHSQSA TPPEQQRAED ITRL // 3.2 Extension of the accession number system With the creation of the TrEMBL database (see section 6) and the rapid increase in the amount of sequence data, we are faced with a problem of availability of accession numbers. Currently we use a system based on a one-letter prefix followed by 5 digits. This system was also used by the nucleotide sequence databases which had originally reserved for SWISS-PROT the prefix letters O, 'P' and 'Q'. The nucleotide databases, having run out of space (due mainly to EST's), have been forced to start using a new format based on a two-letter prefix followed by 6 digits. We have now used up all possible numbers with O, 'P' and 'Q'. As we believe that changing the format of the accession numbers to that used now by the nucleotide database would create havoc on the numerous software packages using SWISS-PROT, we have decided to keep a system of accession numbers based on a six-character code, but with the following format extension: 1 2 3 4 5 6 [O,P,Q] [0-9] [A-Z, 0-9] [A-Z, 0-9] [A-Z, 0-9] [0-9] What the above means is that we will keep a six-character code, but that in positions 3, 4 and 5 of this code any combination of letters and numbers can be present. This format allows a total of 14 million accession numbers (up from 300'000 with the current system). We only allow numbers in positions 2 and 6 so that the SWISS-PROT accession numbers can not be mistaken with gene names, acronyms, other type of accession numbers or any type of words! Examples: P0A3S2, Q2ASD4, O13YX2, P9B123 3.3 Introduction of a new FT key: SE_CYS Selenocysteine is the 21st natural amino acid. It is now known to occur in several dozen proteins. Its mRNA codon is UGA, which usually serves as a stop codon, but with a specific downstream sequence forming a loop and a specific translational elongation factor. It is recognized as the site of selenocysteine incorporation into proteins. Very recently the joint nomenclature committee of the IUPAC/IUBMB (see http:// www.chem.qmw.ac.uk/iupac/jcbn/) officially recommended (http://www.chem.qmw.ac.uk/iubmb/newsletter/1999/item3.html) a three-letter and a one-letter symbol for selenocysteine, namely Sec and U. We recognize that introducing a new one-letter code in the sequence records would disrupt most, if not all, sequence analysis software. We therefore decided to change, in SWISS-PROT, the rules used to annotate the presence of selenocysteine residues in sequence entries in the manner described below. Currently selenocysteines are stored, in the sequence records, using the one-letter symbol C for cysteine and are indicated in the feature table (FT) by a line of the type: FT BINDING x x SELENIUM. The one-letter code will not be changed (for the reason explained above), but we will introduce a specific feature key (SE_CYS) to indicate the presence of a selenocysteine at a given sequence position. The above example will therefore be changed to: FT SE_CYS x x We also want to remind users that the keyword Selenocysteine is and will continue to be used to tag sequence entries that contain at least one such residue. 3.4 Introduction of a new CC line-type topic: PHARMACEUTICAL We will introduce in the next release a new 'topic' for the comments (CC) line type: PHARMACEUTICAL. This topic will describe the use of a specific protein as a pharmaceutical drug. The information provided by such a topic will include the brand name(s) under which a protein is available, the name(s) of the compani(es) that produce it as well as a short description of the therapeutic usage of the protein. Examples: CC -!- PHARMACEUTICAL: Available under the names Avonex (Biogen), CC Betaseron (Berlex) and Rebif (Serono). Used in the treatment CC of multiple sclerosis (MS). Betaseron is a slightly modified CC form of IFNB1 with two residue substitutions. CC -!- PHARMACEUTICAL: Available under the name Proleukin (Chiron). CC Used in patients with renal cell carcinoma or metastatic CC melanoma. It should be noted that any entries containing such a comment field will also be tagged with the keyword Pharmaceutical. 3.5 Multiple AC lines Starting with release 39, there can be more than one AC (ACcession) line per SWISS-PROT entry. Strictly speaking this is not a format change and the users manual of SWISS-PROT always indicated that there could be more than one AC line per entry. Until recently, a single line was sufficient and the majority of entries contained only a single accession number. But, in the process of providing an optimally non-redundant database we are merging information from TrEMBL entries into SWISS-PROT entries. When we merge a TrEMBL entry to a SWISS-PROT one, we add to that SWISS-PROT entry the accession number(s) of the TrEMBL entry. The repetition of such a process sometimes produces an accession number list which can no longer fit in a single AC line. Therefore there will now be some entries with two, three (as shown below) or more AC lines. AC P16070; P22511; Q04858; Q13419; Q13957; Q13958; Q13959; Q13960; AC Q13961; Q13967; Q13968; Q13980; Q15861; Q16064; Q16065; Q16066; AC Q16208; Q16522; 3.6 Change in the syntax of the SQ line The SQ (SeQuence header) line marks the beginning of the sequence data and gives a quick summary of its content. The format of the SQ line is currently: SQ SEQUENCE XXXX AA; XXXXXX MW; XXXXXXXX CRC32; The last information item in the SQ line is a 32-bit CRC (Cyclic Redundancy Check) value which is computed from the sequence. As the number of available sequences is increasing rapidly, there are now a few cases where two sequences can share the same CRC32 (but none, which also share the same molecular weight MW or number of amino acids AA). To address this issue we will, starting with the next release, replace the 32-bit CRC value by a 64- bit CRC. The format of the SQ line will therefore be changed to: SQ SEQUENCE XXXX AA; XXXXXX MW; XXXXXXXXXXXXXXXX CRC64; Example: SQ SEQUENCE 233 AA; 25630 MW; 146A1B48A1475C86 CRC64; 4. STATUS OF THE DOCUMENTATION FILES SWISS-PROT is distributed with a large number of documentation files. Some of these files have been available for a long time (the user manual, release notes, the various indices for authors, citations, keywords, etc.), but many have been created recently and we are continuously adding new files. The following table lists all the documents that are currently available. USERMAN.TXT User manual RELNOTES.TXT Release notes for current release (38) OLDRLNOT.TXT Release notes for previous release (37) SHORTDES.TXT Short description of entries in SWISS-PROT JOURLIST.TXT List of abbreviations for journals cited KEYWLIST.TXT List of keywords in use SPECLIST.TXT List of organism identification codes TISSLIST.TXT List of tissues [See 1] EXPERTS.TXT List of on-line experts for PROSITE and SWISS-PROT SUBMIT.TXT Submission of sequence data to SWISS-PROT ACINDEX.TXT Accession number index AUTINDEX.TXT Author index CITINDEX.TXT Citation index KEYINDEX.TXT Keyword index SPEINDEX.TXT Species index DELETEAC.TXT Deleted accession number index 7TMRLIST.TXT List of 7-transmembrane G-linked receptors entries AATRNASY.TXT List of aminoacyl-tRNA synthetases ALLERGEN.TXT Nomenclature and index of allergen sequences ANNBIOCH.TXT SWISS-PROT annotation: how is biochemical information assigned to sequence entries [See 2] BLOODGRP.TXT List of blood group antigen proteins CALBICAN.TXT Index of Candida albicans entries and their corresponding gene designations CDLIST.TXT CD nomenclature for surface proteins of human leucocytes CELEGANS.TXT Index of Caenorhabditis elegans entries and their corresponding gene Wormpep cross-references DICTY.TXT Index of Dictyostelium discoideum entries and their corresponding gene designations and DictyDb cross-references EC2DTOSP.TXT Index of Escherichia coli Gene-protein database entries referenced in SWISS-PROT ECOLI.TXT Index of Escherichia coli K12 chromosomal entries and their corresponding EcoGene cross-references EMBLTOSP.TXT Index of EMBL Database entries referenced in SWISS-PROT EXTRADOM.TXT Nomenclature of extracellular domains FLY.TXT Index of Drosophila entries and FlyBase cross- references GLYCOSID.TXT Classification of glycosyl hydrolase families and index of glycosyl hydrolase entries HAEINFLU.TXT Index of Haemophilus influenzae RD chromosomal entries HOXLIST.TXT Vertebrate homeotic Hox proteins: nomenclature and index HPYLORI.TXT Index of Helicobacter pylori strain 26695 chromosomal entries HUMCHR16.TXT Index of protein sequence entries encoded on human chromosome 16 [See 2] HUMCHR17.TXT Index of protein sequence entries encoded on human chromosome 17 HUMCHR18.TXT Index of protein sequence entries encoded on human chromosome 18 HUMCHR19.TXT Index of protein sequence entries encoded on human chromosome 19 HUMCHR20.TXT Index of protein sequence entries encoded on human chromosome 20 HUMCHR21.TXT Index of protein sequence entries encoded on human chromosome 21 HUMCHR22.TXT Index of protein sequence entries encoded on human chromosome 22 HUMCHRX.TXT Index of protein sequence entries encoded on human chromosome X HUMCHRY.TXT Index of protein sequence entries encoded on human chromosome Y HUMPVAR.TXT Index of human proteins with sequence variants INITFACT.TXT List and index of translation initiation factors MIMTOSP.TXT Index of MIM entries referenced in SWISS-PROT METALLO.TXT Classification of metallothioneins and index of entries in SWISS-PROT MGDTOSP.TXT Index of MGD entries referenced in SWISS-PROT MGENITAL.TXT Index of Mycoplasma genitalium chromosomal entries MJANNASC.TXT Index of Methanococcus jannaschii entries NGR234.TXT Table of putative genes in Rhizobium plasmid pNGR234a NOMLIST.TXT List of nomenclature related references for proteins PCC6803.TXT Index of Synechocystis strain PCC 6803 entries PDBTOSP.TXT Index of X-ray crystallography Protein Data Bank (PDB) entries referenced in SWISS-PROT PEPTIDAS.TXT Classification of peptidase families and index of peptidase entries PLASTID.TXT List of chloroplast and cyanelle encoded proteins POMBE.TXT Index of Schizosaccharomyces pombe entries in SWISS-PROT and their corresponding gene designations RESTRIC.TXT List of restriction enzyme and methylase entries RIBOSOMP.TXT Index of ribosomal proteins classified by families on the basis of sequence similarities SALTY.TXT Index of Salmonella typhimurium LT2 chromosomal entries and their corresponding StyGene cross- references SUBTILIS.TXT Index of Bacillus subtilis 168 chromosomal entries and their corresponding SubtiList cross-references UPFLIST.TXT UPF (Uncharacterized Protein Families) list and index of members YEAST.TXT Index of Saccharomyces cerevisiae entries and their corresponding gene designations YEAST1.TXT Yeast Chromosome I entries YEAST2.TXT Yeast Chromosome II entries YEAST3.TXT Yeast Chromosome III entries YEAST5.TXT Yeast Chromosome V entries YEAST6.TXT Yeast Chromosome VI entries YEAST7.TXT Yeast Chromosome VII entries YEAST8.TXT Yeast Chromosome VIII entries YEAST9.TXT Yeast Chromosome IX entries YEAST10.TXT Yeast Chromosome X entries YEAST11.TXT Yeast Chromosome XI entries YEAST13.TXT Yeast Chromosome XIII entries YEAST14.TXT Yeast Chromosome XIV entries 1. The tissue list (tisslist.txt) has been converted to mixed-case characters; 2. The annbioch.txt and humchr16.txt files are new documents introduced in this release. We have continued to include in some SWISS-PROT documentation files the references of Web sites relevant to the subject under consideration. There are now 42 documents that include such links. 5. THE EXPASY WORLD-WIDE WEB SERVER 5.1 Background information The most efficient and user-friendly way to browse interactively in SWISS- PROT, PROSITE, ENZYME, SWISS-2DPAGE and other databases is to use the World-Wide Web (WWW) molecular biology server ExPASy. The ExPASy server was made available to the public in September 1993 and is reachable at the following address: http://www.expasy.ch/ The ExPASy WWW server allows access, using the user-friendly hypertext model, to the SWISS-PROT, PROSITE, ENZYME, SWISS-2DPAGE, SWISS-3DIMAGE and CD40Lbase databases. And, through any SWISS-PROT protein sequence entry, to other databases such as EMBL, Eco2DBASE, EcoCyc, EcoGene, FlyBase, GCRDb, MaizeDB, Mendel, OMIM, PDB, HSSP, Pfam, ProDom, REBASE, SGD, SubtiList/NRSub, TRANSFAC, YPD, ZFIN and Medline. ExPASy also offers many tools for the analysis of protein sequences and 2D gels. 5.2 Swiss-Shop We provide, on ExPASy, a service called Swiss-Shop (http://www.expasy.ch/swiss-shop/). Swiss-Shop is an automated sequence alerting system which allows users to obtain, by email, new sequence entries relevant to their field(s) of interest. Various criteria can be combined: - By entering one or more words that should be present in the description line; - By entering one or more species name(s) or taxonomic division(s); - By entering one or more keywords; - By entering one or more author names; - By entering the accession number (or entry name) of a PROSITE pattern or a user-defined sequence pattern; - By entering the accession number (or entry name) of an existing SWISS- PROT entry or by entering a private sequence. Every week, the new sequences entered in SWISS-PROT are automatically compared with all the criteria that have been defined by the users. If a sequence corresponds to the selection criteria defined by a user, that sequence is sent by electronic mail. 5.3 What is new on ExPASy ExPASy is constantly modified and improved. If you wish to be informed on the changes made to the server you can either: - Read the document History of changes, improvements and new features which is available at the address: http://www.expasy.ch/history.html - Subscribe to Swiss-Flash, a service that reports news of databases, software and service developments. By subscribing to this service, you will automatically get Swiss-Flash bulletins by electronic mail. To subscribe use the address: http://www.expasy.ch/ swiss-flash/ Among all the improvements and the new features introduced during the last three months, here are those that we believe are specifically useful to SWISS-PROT users: 1. We have switched our default view of SWISS-PROT entry to that provided by the NiceProt tool. NiceProt offers a user-friendly tabular view of SWISS-PROT entries. Access to the original SWISS-PROT format is maintained and is directly available from the NiceProt view. Tools with similar functionalities have been developed to display the ENZYME and PROSITE databases (see section 8.1 and 8.2). 2. We have revised the ExPASy file and directory structure, in order to have the vast amount of data that has accumulated on the server since September 1993 available in a more structured manner, and to facilitate replication on our mirror sites. This has caused certain changes in html links, and you should update your bookmarks and links accordingly. If in doubt, please refer to the document 'How to create html links to ExPASy' (http://www.expasy.ch/expasy_urls.html). At the same time we wish to reiterate our announcement of the ExPASy mirror sites in Australia (http://expasy.proteome.org.au/) and Taiwan (http://expasy.nhri.org.tw/). For your own convenience, please use the mirror site closest to you. Please also make sure to update all bookmarks or links that use the old domain expasy.hcuge.ch, which was replaced by www.expasy.ch in March 1997! The 'expasy.hcuge.ch' address might be disabled in the near future. 3. WWW links have been implemented between SWISS-PROT and CarbBank, EcoGene and ZFIN. 6. TREMBL - A SUPPLEMENT TO SWISS-PROT The ongoing genome sequencing and mapping projects have dramatically increased the number of protein sequences to be incorporated into SWISS- PROT. Since we do not want to dilute the quality standards of SWISS-PROT by incorporating sequences into the database without proper sequence analysis and annotation, we cannot speed up the incorporation of new incoming data indefinitely. But as we also want to make the sequences available as fast as possible, we have introduced with SWISS-PROT a computer annotated supplement. This supplement consists of entries in SWISS-PROT-like format derived from the translation of all coding sequences (CDS) in the EMBL nucleotide sequence database, except those already included in SWISS-PROT. This supplement is named TrEMBL (Translation from EMBL). It can be considered as a preliminary section of SWISS-PROT. This SWISS-PROT release is supplemented by TrEMBL release 11. TrEMBL is split in two main sections; SP-TrEMBL and REM-TrEMBL: SP-TrEMBL (SWISS-PROT TrEMBL) contains the entries (199'794 in release 11) which should be incorporated into SWISS-PROT. SWISS-PROT accession numbers have been assigned for all SP-TrEMBL entries. REM-TrEMBL (REMaining TrEMBL) contains the entries (45'967 in release 11) that we do not want to include in SWISS-PROT for a variety of reasons (synthetic sequences, pseudogenes, translations of incorrect open reading frames, fragments with less than eight amino acids, patent-derived sequences, immunoglobulins and T-cell receptors, etc.) TrEMBL is available by FTP from the EBI and ExPASy servers in the directory databases/trembl'. It can be queried on WWW by the EBI and ExPASy SRS servers. It is also searchable on the FASTA, BIC_SW and BLAST servers of the EBI. 7. FTP ACCESS TO SWISS-PROT AND TREMBL 7.1 Generalities SWISS-PROT is available for download on the following anonymous FTP servers: Organization Swiss Institute of Bioinformatics (SIB) Address ftp.expasy.ch Directory /databases/swiss-prot/ Organization European Bioinformatics Institute (EBI) Address ftp.ebi.ac.uk Directory /pub/databases/swissprot/ 7.2 Weekly updates of SWISS-PROT Weekly updates of SWISS-PROT are available by anonymous FTP. Three files are generated at each update: new_seq.dat Contains all the new entries since the last full release; upd_seq.dat Contains the entries for which the sequence data has been updated since the last release; upd_ann.dat Contains the entries for which one or more annotation fields have been updated since the last release. Important notes o Although we try to follow a regular schedule, we do not promise to update these files every week. In most cases two weeks may elapse between two updates. o Instead of using the above files, you can, every week, download an updated copy of the SWISS-PROT database. This file is available in the directory containing the non-redundant database (see next section). 7.3 Non-redundant database More than a year ago, we started to distribute on the ExPASy and EBI FTP servers, files that make up a non-redundant (see further) and complete protein sequence database consisting of three components: 1) SWISS-PROT 2) TrEMBL 3) New entries to be later integrated into TrEMBL (hereafter known as TrEMBL_New) Every week three files are completely rebuilt. These files are named: sprot.dat.Z, trembl.dat.Z and trembl_new.dat.Z. As indicated by their .Z extension these are Unix compress format files which, when decompressed, will produce ASCII files in SWISS-PROT format. Three other files are also available (sprot.fas.Z, trembl.fas.Z and trembl_new.fas.Z) which are compressed fasta format sequence files useful for building the databases used by FASTA, BLAST and other sequence similarity search programs. Please do not use these files for any other purpose, as you will lose all annotations by using this very primitive format. The files for the non-redundant database are stored in the directory /databases/sp_tr_nrdb on the ExPASy FTP server (ftp.expasy.ch) and in the directory /pub/databases/sp_tr_nrdb on the EBI FTP server (ftp.ebi.ac.uk). Additional notes o The SWISS-PROT file continuously grows as new annotated sequences are added. o The TrEMBL file decreases in size as sequences are moved out of that section after being annotated and moved into SWISS-PROT. Four times a year a new release of TrEMBL is built at EBI, at this point the TrEMBL file increases in size as it then includes all of the new data (see next section) that has accumulated since the last release. o The TrEMBL_New file starts as a very small file and grows in size until a new release of TrEMBL is available. o SWISS-PROT and TrEMBL share the same system of accession numbers. Therefore you will not find any primary accession number duplicated between the two sections. A TrEMBL entry (and its associated accession number(s)) can either move to SWISS-PROT as new entry or be merged with an existing SWISS-PROT entry. In the latter case, the accession number(s) of that TrEMBL entry are added to that of the SWISS-PROT entry. o TrEMBL_New does not have real accession numbers. However it was necessary to have an AC line so as to be able to use it with different software products. This AC line contains a temporary identifier which consists of the protein_ID (protein sequence identifier) of the coding sequence in the parent nucleotide sequence. o TrEMBL_New is quite messy! You will of course find new sequence entries but you will also encounter sequences that are going to be used to update existing TrEMBL or SWISS-PROT entries. None of the "cleaning" steps that are applied to produce a TrEMBL release are run on TrEMBL_New nor are any of the computer-annotation software tools that are used to enhance the information content of TrEMBL. TrEMBL_New is provided only so that users can be sure not to miss any important new sequences when they run similarity searches. o While these three files allow you to build what we call a non-redundant database, it must be noted that this is not completely a true statement. Without going into a long explanation we can say that this is currently the best attempt in providing a complete selection of protein sequence entries while trying to eliminate redundancies. Also SWISS-PROT is completely (well 99.994% !) non-redundant, TrEMBL is far from being non- redundant and the addition of SWISS-PROT + TrEMBL is even less. o To describe to your users the version of the non-redundant database that you are providing them with, you should use a statement of the form: SWISS-PROT release 38 and updates until <current_date>; TrEMBL release 11 minus data integrated into SWISS-PROT as of <current_date>; New preliminary TrEMBL entries created since release 11 of TrEMBL 8. ENZYME AND PROSITE 8.1 The ENZYME nomenclature database Release 25.0 of the ENZYME nomenclature database is distributed with release 38 of SWISS-PROT. ENZYME release 25.0 contains information relative to 3704 enzymes. In this release, we have added a significant number of synonyms (AN lines) to a number of entries. The WWW version of ENZYME on ExPASy now provides a more user-friendly tabular view of enzyme entries through a new tool called NiceZyme. NiceZyme also provides direct links, through Medline, to literature references relevant to a specific enzyme. You can use this tool to link to any ENZYME entry by using the following type of URL: http://www.expasy.ch/cgi- bin/nicezyme.pl?a.b.c.d (where a.b.c.d is any valid enzyme EC number; example: 126.96.36.199). Please also note that the URL of the top page of ENZYME has moved to: http://www.expasy.ch/enzyme/ 8.2 The PROSITE database Release 16.0 of the PROSITE database is distributed with release 38 of SWISS-PROT. This release of PROSITE contains 1034 documentation entries that describe 1'374 different patterns, rules and profiles/matrices. Since release 15.0, 20 entries have been added and 180 entries have been updated. The WWW version of PROSITE on ExPASy now provides a more user-friendly tabular view of enzyme entries through a new tool called NiceSite. You can use this tool to link to any PROSITE entry by using the following types of URL: http://www.expasy.ch/cgi-bin/nicesite.pl?PSxxxxx (where PSxxxxx is any valid PROSITE pattern or matrix entry) and http://www.expasy.ch/cgi- bin/nicedoc.pl?PDOCxxxxx (where PDOCxxxxx is any valid PROSITE document entry). Please also note that the URL of the top page of PROSITE has moved to: http://www.expasy.ch/prosite/ 9. WE NEED YOUR HELP! We welcome feedback from our users. We would especially appreciate that you notify us if you find that sequences belonging to your field of expertise are missing from the database. We also would like to be notified about annotations to be updated, if, for example, the function of a protein has been clarified or if new information about post-translational modifications has become available. To facilitate this feedback we offer, on the ExPASy WWW server, a form that allows the submission of updates and/or corrections to SWISS-PROT: http://www.expasy.ch/sprot/sp_update_form.html It is also possible, from any entry in SWISS-PROT displayed by the ExPASy server, to submit updates and/or corrections for that particular entry. Finally, you can also send your comments by electronic mail to the address: email@example.com Note that since January 1999, all update requests are assigned a unique identifier of the form UR-Xnnnn (example: UR-A0123). This identifier is used internally by the SWISS-PROT staff at SIB and EBI to track down the fate of requests and is also be used in email exchanges with the persons having submitted a request. 10. JULY 1999 ANNOUNCEMENT: THE HUMAN PROTEOMICS INITIATIVE In a few months the combined efforts of a number of sequencing centers and companies will produce a first draft of the human genome sequence. Such an endeavor is only a very preliminary step in the understanding of human biological processes. The first pitfall to overcome is the detection of all coding regions on the genomic sequence. Current algorithms, while being very powerful, are not capable of detecting with certainty all exons, are not well equipped to distinguish different splice variants and are unable to detect small proteins (which are numerous and crucial to many biological processes). Even when all potential coding regions have been predicted, the user community will have at its disposition the sequence of from 80000 to 100000 naked proteins. We call these proteins naked because genomic information does not allow the efficient prediction of all the post- translational modifications (PTM) of which the majority of proteins are the target. Proteins, once synthesized on the ribosomes, are subject to a multitude of modification steps. The complexity due to all these modifications is compounded by the high level of diversity that alternative splicing can produce at the level of sequence. Thus the number of different protein molecules expressed by the human genome is probably closer to a million than to the hundred thousand generally considered by genome scientists. Another factor of complexity to take into account is the amount of polymorphism at the protein sequence level. While some of these polymorphisms are linked to disease states, most are not, yet have in many cases a direct or indirect effect on the activities of the proteins. We therefore are initiating a major project to annotate all known human sequences according to the quality standards of SWISS-PROT. This means providing, for each known protein, a wealth of information that includes the description of its function, its domain structure, subcellular location, post-translational modifications, variants, similarities to other proteins, etc. There are currently slightly more than 5400 annotated human sequences in SWISS-PROT. These entries are associated with about 14500 literature references; 16000 experimental or predicted PTMs, 800 splice variants and 8000 polymorphisms (most of which are linked with disease states). We will use the current information as the ground basis for what we call the Human Proteomics Initiative (HPI). The HPI project contains a number of sub-components, which are briefly described below: - Annotation of all known human proteins. In the course of the next nine months (from July 1999 to end of March 2000) the human protein sequences that are not yet in SWISS-PROT will be fully annotated. We will also review and complete the annotation of the human sequences currently in SWISS-PROT. At the end of this nine-month period we expect to be complete and up-to-date and to hereafter keep up with the appearance of new data relevant to human proteins. - Annotation of mammalian orthologs of human proteins. We will make sure that for any human proteins, existing orthologs in other mammalian species will also be annotated at a level equivalent to that of the cognate human sequences. - Annotation of all known human polymorphisms at the protein sequence level. As mentioned above, SWISS-PROT already holds information on a sizeable amount of such polymorphisms, and it will significantly expand its effort to store and annotate all small variations at the protein level. - Annotation of all known post-translational modifications in human proteins. During the next nine months a major effort will be made to supplement the already quite comprehensive description of known post- translational modifications in human proteins currently provided in SWISS-PROT. - Tight links to structural information. SWISS-PROT is tightly linked to the PDB/RCSB 3D-structure database and already includes many features useful to structural biologists. These tight links will be further expanded by providing homology-derived models for all human proteins for which such an approach is scientifically relevant. For all aspects of the HPI projects, we would appreciate the help and collaboration of the scientific community. Information concerning the human proteome is highly critical to a large section of the life science community. We therefore appeal to the user community to fully participate in this initiative by providing all the necessary information to help and to speed up the comprehensive annotation of the human proteome. The HPI project has two different time-related aspects: one of which is a nine-month "marathon" to catch up with the current state of research, the other one is a long-term commitment to keep such a project alive as long as it is necessary. For a detailed description of the HPI project and its current status please consult: http://www.expasy.ch/sprot/hpi/ 11. JULY 1998 ANNOUNCEMENT: NEW SWISS-PROT FUNDING SCHEME It became obvious in the last years that the tremendous increase in data flow has created a requirement for resources which cannot be addressed in full by public funding. This is causing databases to fall behind the research. We believe that the only solution to the resource shortfall is to ask commercial users to participate by paying a license fee. No fee is or will be charged to academic users, nor is any restriction be imposed on their use or reuse of the data. Both SWISS-PROT and PROSITE are concerned by these changes, while this is not the case of ENZYME. A document fully describing what will be the impact of this change for SWISS-PROT is available with the SWISS-PROT distribution files on FTP (sp_info.txt). You can also access the document as well as other relevant ones from: http://www.expasy.ch/announce/ http://www.ebi.ac.uk/swissprot/Information/Announcement/announcement.html If you do not have the time to read this document, the most important take- home message is that these changes do not have any impact on the way SWISS- PROT or PROSITE are accessed or redistributed. Academic users are not affected by these changes. Industrial end-users are also not directly affected as long as their employer pays the license fee. The same holds true for bioinformatics companies. Academic software or database developers as well as providers of database distribution services are only minimally affected by these changes. We hope to be able to keep the spirit of SWISS- PROT and PROSITE alive and at the same time ensure their long-term financial survival. We sincerely hope and believe that in the next two years the only change that will matter will be the increase in scope and timeliness of the databases. ======================================================================== APPENDIX A: SOME STATISTICS A.1 Amino acid composition A.1.1 Composition in percent for the complete data bank Ala (A) 7.58 Gln (Q) 3.97 Leu (L) 9.43 Ser (S) 7.13 Arg (R) 5.16 Glu (E) 6.36 Lys (K) 5.94 Thr (T) 5.67 Asn (N) 4.44 Gly (G) 6.84 Met (M) 2.37 Trp (W) 1.24 Asp (D) 5.27 His (H) 2.24 Phe (F) 4.10 Tyr (Y) 3.19 Cys (C) 1.66 Ile (I) 5.81 Pro (P) 4.92 Val (V) 6.58 Asx (B) 0.001 Glx (Z) 0.001 Xaa (X) 0.01 A.1.2 Classification of the amino acids by their frequency Leu, Ala, Ser, Gly, Val, Glu, Lys, Ile, Thr, Asp, Arg, Pro, Asn, Phe, Gln, Tyr, Met, His, Cys, Trp A.2 Repartition of the sequences by their organism of origin Total number of species represented in this release of SWISS-PROT: 6580 The first twenty species represent 37741 sequences: 47.2 % of the total number of entries. A.2.1 Table of the frequency of occurrence of species Species represented 1x: 3122 2x: 1013 3x: 509 4x: 363 5x: 243 6x: 225 7x: 154 8x: 127 9x: 105 10x: 62 11- 20x: 304 21- 50x: 191 51-100x: 73 >100x: 89 A.2.2 Table of the most represented species ------ --------- -------------------------------------------- Number Frequency Species ------ --------- -------------------------------------------- 1 5406 Homo sapiens (Human) 2 4811 Saccharomyces cerevisiae (Baker's yeast) 3 4516 Escherichia coli 4 3549 Mus musculus (Mouse) 5 2630 Rattus norvegicus (Rat) 6 2069 Bacillus subtilis 7 2002 Caenorhabditis elegans 8 1698 Haemophilus influenzae 9 1438 Schizosaccharomyces pombe (Fission yeast) 10 1313 Methanococcus jannaschii 11 1149 Bos taurus (Bovine) 12 1088 Drosophila melanogaster (Fruit fly) 13 928 Mycobacterium tuberculosis 14 894 Gallus gallus (Chicken) 15 821 Arabidopsis thaliana (Mouse-ear cress) 16 729 Xenopus laevis (African clawed frog) 17 727 Salmonella typhimurium 18 699 Synechocystis sp. (strain PCC 6803) 19 670 Sus scrofa (Pig) 20 604 Oryctolagus cuniculus (Rabbit) 21 490 Mycoplasma pneumoniae 22 469 Mycoplasma genitalium 23 446 Zea mays (Maize) 24 403 Rhizobium sp. (strain NGR234) 25 382 Helicobacter pylori (Campylobacter pylori) 26 368 Pseudomonas aeruginosa 27 337 Oryza sativa (Rice) 28 308 Canis familiaris (Dog) 29 296 Nicotiana tabacum (Common tobacco) 30 292 Dictyostelium discoideum (Slime mold) 31 277 Treponema pallidum 32 272 Bacteriophage T4 33 269 Ovis aries (Sheep) 269 Mycobacterium leprae 35 266 Borrelia burgdorferi (Lyme disease spirochete) 36 263 Pisum sativum (Garden pea) 37 255 Methanobacterium thermoautotrophicum 38 253 Vaccinia virus (strain Copenhagen) 39 239 Glycine max (Soybean) 40 228 Staphylococcus aureus 41 227 Neurospora crassa 42 226 Hordeum vulgare (Barley) 43 221 Candida albicans (Yeast) 44 219 Porphyra purpurea 45 216 Archaeoglobus fulgidus 46 211 Lycopersicon esculentum (Tomato) 47 209 Triticum aestivum (Wheat) 48 205 Solanum tuberosum (Potato) 49 204 Rhodobacter capsulatus (Rhodopseudomonas capsulata) 50 199 Klebsiella pneumoniae 51 196 Pseudomonas putida 52 193 Human cytomegalovirus (strain AD169) 53 192 Bacillus stearothermophilus 54 186 Vaccinia virus (strain WR) 55 172 Cavia porcellus (Guinea pig) 56 170 Agrobacterium tumefaciens 57 169 Spinacia oleracea (Spinach) 58 159 Chlamydomonas reinhardtii 59 158 Rhizobium meliloti 60 154 Autographa californica nuclear polyhedrosis virus 61 153 Emericella nidulans (Aspergillus nidulans) 62 152 Mesocricetus auratus (Golden hamster) 63 151 Marchantia polymorpha (Liverwort) 64 150 Streptomyces coelicolor 150 Equus caballus (Horse) 66 148 Guillardia theta (Cryptomonas phi) 67 147 Cyanophora paradoxa 68 146 Variola virus 69 142 Lactococcus lactis (subsp. lactis) (Streptococcus lactis) 70 139 Odontella sinensis 71 134 Orgyia pseudotsugata multicapsid polyhedrosis virus 72 133 Kluyveromyces lactis (Yeast) 73 128 Brachydanio rerio (Zebrafish) (Zebra danio) 74 127 Trypanosoma brucei brucei 127 Synechococcus sp. (strain PCC 7942) 76 126 Thermus aquaticus (subsp. thermophilus) 77 120 Alcaligenes eutrophus 118 Anabaena sp. (strain PCC 7120) 79 116 Bombyx mori (Silk moth) 80 115 Bradyrhizobium japonicum 81 113 Yersinia enterocolitica 82 112 Oncorhynchus mykiss (Rainbow trout) (Salmo gairdneri) 83 111 Aquifex aeolicus 108 Streptococcus pneumoniae 85 107 Brassica napus (Rape) 86 104 Neisseria gonorrhoeae 87 103 Macaca mulatta (Rhesus macaque) 103 Felis silvestris catus (Cat) 89 102 Rhodobacter sphaeroides (Rhodopseudomonas sphaeroides) A.3 Repartition of the sequences by size From To Number From To Number 1- 50 3213 1001-1100 722 51- 100 6704 1101-1200 553 101- 150 9719 1201-1300 377 151- 200 7640 1301-1400 251 201- 250 7202 1401-1500 210 251- 300 6703 1501-1600 133 301- 350 6294 1601-1700 117 351- 400 6438 1701-1800 89 401- 450 4831 1801-1900 94 451- 500 4566 1901-2000 65 501- 550 3444 2001-2100 37 551- 600 2308 2101-2200 80 601- 650 1801 2201-2300 75 651- 700 1326 2301-2400 40 701- 750 1159 2401-2500 42 751- 800 956 >2500 232 801- 850 762 851- 900 798 901- 950 552 951-1000 467 A.4 Longest sequences The longest sequences (>=4000 residues) are listed here: BACA_BACLI 5255 HTS1_COCCA 5217 MUC2_HUMAN 5179 FAT_DROME 5147 RYNR_RABIT 5037 RYNR_PIG 5035 RYNR_HUMAN 5032 RYNC_RABIT 4969 LRP_CAEEL 4753 DYHC_DICDI 4725 PLEC_RAT 4687 LRP2_RAT 4660 LRP2_HUMAN 4655 DYHC_RAT 4644 DYHC_DROME 4639 DYHC_CAEEL 4568 DYHB_CHLRE 4568 APB_HUMAN 4563 APOA_HUMAN 4548 LRP1_HUMAN 4544 LRP1_CHICK 4543 DYHC_PARTE 4540 RRPA_CVMJH 4488 DYHG_CHLRE 4485 DYHC_ANTCR 4466 DYHC_TRIGR 4466 GRSB_BACBR 4451 PKSK_BACSU 4447 PKSL_BACSU 4427 PGBM_HUMAN 4393 YP73_CAEEL 4385 DYHC_NEUCR 4367 DYHC_FUSSO 4349 DYHC_EMENI 4344 PKD1_HUMAN 4303 DYHC_SCHPO 4196 DYHC_YEAST 4092 RRPA_CVH22 4085 RRPL_DUGBV 4036 A.5 Statistics for journal citations Total number of journals cited in this release of SWISS-PROT: 1011 A.5.1 Table of the frequency of journal citations Journals cited 1x: 381 2x: 130 3x: 84 4x: 46 5x: 39 6x: 23 7x: 15 8x: 15 9x: 14 10x: 14 11- 20x: 75 21- 50x: 71 51-100x: 24 >100x: 80 A.5.2 List of the most cited journals in SWISS-PROT Nb Citations Journal abbreviation -- --------- ---------------------------------- 1 6683 J. Biol. Chem. 2 4031 Proc. Natl. Acad. Sci. U.S.A. 3 3434 Nucleic Acids Res. 4 2868 J. Bacteriol. 5 2714 Gene 6 2162 FEBS Lett. 7 2046 Eur. J. Biochem. 8 1915 Biochem. Biophys. Res. Commun. 9 1888 Biochemistry 10 1788 EMBO J. 11 1684 Nature 12 1542 Biochim. Biophys. Acta 13 1462 J. Mol. Biol. 14 1321 Cell 15 1240 Mol. Cell. Biol. 16 1042 Genomics 17 999 Mol. Gen. Genet. 18 987 Plant Mol. Biol. 19 956 Biochem. J. 20 867 Science 21 828 Mol. Microbiol. 22 786 Virology 23 714 J. Biochem. 24 534 J. Virol. 25 487 Yeast 26 485 J. Cell Biol. 27 465 Plant Physiol. 28 465 J. Gen. Virol. 29 437 Hum. Mol. Genet. 30 427 Genes Dev. 31 398 Hum. Mutat. 32 371 J. Immunol. 33 367 Arch. Biochem. Biophys. 34 348 Infect. Immun. 35 346 Oncogene 36 336 Structure 37 329 Curr. Genet. 38 311 Mol. Biochem. Parasitol. 39 307 FEMS Microbiol. Lett. 40 307 Am. J. Hum. Genet. 41 301 Nat. Genet. 42 267 Development 43 265 Biol. Chem. Hoppe-Seyler 44 256 Microbiology 45 252 J. Clin. Invest. 46 250 Mol. Endocrinol. 47 249 Nat. Struct. Biol. 48 234 J. Mol. Evol. 49 233 Hum. Genet. 50 231 Genetics 51 222 J. Gen. Microbiol. 52 213 Hoppe-Seyler's Z. Physiol. Chem. 53 206 DNA Cell Biol. 54 204 Appl. Environ. Microbiol. 55 196 Protein Sci. 56 193 J. Exp. Med. 57 193 Blood 58 189 Dev. Biol. 59 184 Neuron 60 164 Immunogenetics 61 152 DNA Seq. 62 152 DNA 63 151 Endocrinology 64 140 Plant Cell 65 132 Cancer Res. 66 125 Plant J. 67 119 Mol. Biol. Evol. 68 118 Brain Res. Mol. Brain Res. 69 117 Mech. Dev. 70 117 J. Neurochem. 71 117 Biochimie 72 116 Hemoglobin 73 116 Bioorg. Khim. 74 115 Acta Crystallogr. D 75 113 Comp. Biochem. Physiol. 76 111 Virus Res. 77 110 Agric. Biol. Chem. 78 106 Mamm. Genome 79 106 J. Neurosci. 80 103 Biosci. Biotechnol. Biochem. ======================================================================== APPENDIX B: RELATIONSHIPS BETWEEN SWISS-PROT AND SOME BIOMOLECULAR DATABASES The current status of the relationships (cross-references) between SWISS-PROT and some biomolecular databases is shown in the following schematic: *********************** * EMBL Nucleotide * * Sequence Database * * [EBI] * *********************** ^ ^ ^ ^ ^ ^ ^ ^ ^ ****************** | | | I | | | | | ********************** * FlyBase * <-------+ | | I | | | | +-------> * MGD [Mouse] * ****************** | | | I | | | | | ********************** | | | I | | | | | ****************** | | | I | | | | | ********************** * SubtiList * <---------+ | I | | | +---------> * GCRDb [7TM recep.] * * [B.subtilis] * | | | I | | | | | ********************** ****************** | | | I | | | | | | | | I | | | | | ********************** ****************** | | | I | | +-----------> * EcoGene [E.coli] * * Mendel [Plant] * <-----+ | | | I | | | | | ********************** ****************** | | | | I | | | | | | | | | I | | | | | ********************** ****************** | | | | I +---------------> * SGD [Yeast] * * MaizeDb * <-----------+ I | | | | | ********************** * [Zea mays] * | | | | I | | | | | ****************** | | | | I | | | | | ********************** | | | | I | +-------------> * DictyDB [D.disco.] * ****************** | | | | I | | | | | ********************** * WormPep * | | | | I | | | | | * [C.elegans] * <---+ | | | | I | | | | | ********************** ****************** | | | | | I | | | | | +-----> * ENZYME [Nomencl.] * | | | | | I | | | | | | ********************** ****************** | v v v v v v v v v v v v * REBASE * ************************* ********************** * [Restriction * <-- * SWISS-PROT * ----> * OMIM [Human] * * enzymes] * * Protein Sequence * ********************** ****************** * Data Bank * ************************* ********************** ****************** ^ ^ ^ ^ ^ ^ ^ | ^ ^ ^ * ECO2DBASE [2D] * * StyGene * | | | | | | | | | | +--------> ********************** * [S.typhimurium]* <----+ | | | | | | | | | ****************** | | | | | | | | | ********************** | | | | | | | | +----------> * Maize-2DPAGE [2D] * ****************** | | | | | | | | ********************** * TRANSFAC * <------+ | | | | | | | ****************** | | | | | | | ********************** | | | | | | +------------> * SWISS-2DPAGE [2D] * ****************** | | | | | | ********************** * Harefield [2D] * <--------+ | | | | | ****************** | | | | | ********************** | | | | +--------------> * Aarhus/Ghent [2D] * ****************** | | | | ********************** * PROSITE * | | | | * [Patterns and * <----------+ | | +----------------> ********************** * profiles] * | | * YEPD [Yeast] [2D] * ****************** | +----------------+ ********************** | v | | *********************** +-> ********************** +--------> * PDB [3D structures] * <----- * HSSP [3D similar.] * *********************** ********************** =End=of=SWISS-PROT=release=38=notes=====================================