Skip Header

You are using a version of browser that may not display all the features of this website. Please consider upgrading your browser.

Swiss-Prot release 40.0

Published October 1, 2001

                                           SWISS-PROT Protein Knowledgebase
                                                              Release Notes
                                                   Release 40, October 2001

                             Table of contents

 1   Introduction
 2   Description of the changes made to SWISS-PROT since release 38
 3   Forthcoming changes
 4   Status of the documentation files
 5   The ExPASy World-Wide Web server
 6   TrEMBL - a supplement to SWISS-PROT
 7   FTP access to SWISS-PROT and TrEMBL
 9   We need your help!
 A   Appendix A

                             1   Introduction

Release 40.0 of SWISS-PROT contains 101'602 sequence entries, comprising
37'315'215 amino acids abstracted from 91'880 references. This represents
an increase of 18% over release 39. The growth of the data bank is
summarized below.

      Release     Date    Number of   Number of
                           entries   amino acids
        2.0      09/86      3'939       900'163
        3.0      11/86      4'160       969'641
        4.0      04/87      4'387     1'036'010
        5.0      09/87      5'205     1'327'683
        6.0      01/88      6'102     1'653'982
        7.0      04/88      6'821     1'885'771
        8.0      08/88      7'724     2'224'465
        9.0      11/88      8'702     2'498'140
        10.0     03/89     10'008     2'952'613
        11.0     07/89     10'856     3'265'966
        12.0     10/89     12'305     3'797'482
        13.0     01/90     13'837     4'347'336
        14.0     04/90     15'409     4'914'264
        15.0     08/90     16'941     5'486'399
        16.0     11/90     18'364     5'986'949
        17.0     02/91     20'024     6'524'504
        18.0     05/91     20'772     6'792'034
        19.0     08/91     21'795     7'173'785
        20.0     11/91     22'654     7'500'130
        21.0     03/92     23'742     7'866'596
        22.0     05/92     25'044     8'375'696
        23.0     08/92     26'706     9'011'391
        24.0     12/92     28'154     9'545'427
        25.0     04/93     29'955    10'214'020
        26.0     07/93     31'808    10'875'091
        27.0     10/93     33'329    11'484'420
        28.0     02/94     36'000    12'496'420
        29.0     06/94     38'303    13'464'008
        30.0     10/94     40'292    14'147'368
        31.0     02/95     43'470    15'335'248
        32.0     11/95     49'340    17'385'503
        33.0     02/96     52'205    18'531'384
        34.0     10/96     59'021    21'210'389
        35.0     11/97     69'113    25'083'768
        36.0     07/98     74'019    26'840'295
        37.0     12/98     77'977    28'268'293
        38.0     07/99     80'000    29'085'965
        39.0     05/00     86'593    31'411'114
        40.0     10/01    101'602    37'315'215

    2   Description of the changes made to SWISS-PROT since release 38

The name of the database changed from 'SWISS-PROT protein sequence
database' to 'SWISS-PROT knowledgebase' to emphasize the fact that
SWISS-PROT collects, by far, more than just information on protein
sequences and that it is a central linking and linked database which
connects the various findings in the diverse fields of proteomics research.

We apologize that due to technical problems we never posted the release
notes of release 39. Therefore this document describes the changes that
took place not only since release 39 but also those between releases 38 and

     2.1   Sequences and annotations

15'184 sequences have been added since release 39, the sequence data of
2'908 existing entries has been updated and the annotations of 44' 684
entries have been revised. With this release SWISS-PROT has passed the
symbolic mark of 100 thousand entries.

     2.2   The HPI project

The Human Proteomics Initiative (HPI) has been introduced to put a major
effort on the annotation of all known human sequences according to the
quality standards of SWISS-PROT. This means that, for each known protein, a
wealth of information is provided, which includes the description of its
function, its domain structure, subcellular location, posttranslational
modifications, variants, similarities to other proteins, etc. This not only
implies the annotation of newly detected proteins, but also the integration
of new research data to the existing entries by specialized biologists, who
are in close contact with experts all over the world.

There are currently 7'471 annotated human sequences in SWISS-PROT. These
entries are associated with 19'922 literature references, 18' 974
experimental or predicted PTM's, 1'697 splice variants and 12'061
polymorphisms (most of which are linked with disease states).

Simultaneously, two further efforts were increased: the description of
human diseases associated with deficiency(ies) in the protein and mammalian
orthologs of human proteins are annotated at a level equivalent to that of
the cognate human sequences.

For all aspects of the HPI projects, we would appreciate the help and
collaboration of the scientific community. Information concerning the human
proteome is highly critical to a large section of the life science
community. We therefore appeal to the user community to fully participate
in this initiative by providing all the necessary information to help and
to speed up the comprehensive annotation of the human proteome.

For a detailed description of the HPI project and its current status please

     2.3   The HAMAP project

The first complete microbial genomic sequence was that of the bacterium
Haemophilus influenzae, which became available in 1995. Since then more
than 50 bacterial and archaeal genomes have been sequenced and many more
sequencing projects of pathogenic as well as nonpathogenic microbes are in
progress. To date, the publicly available microbial genomes collectively
encode more than 100'000 different proteins.

In order to handle the large amount of "raw" data coming from the microbial
genomic sequencing, the High quality Automated Microbial Annotation of
Proteomes (HAMAP) project was initiated. The latter aims to automatically
annotate a significant percentage of proteins which originate from
microbial genome sequencing projects.

To maintain a high level quality of annotation, specific tools are
developed to deal with two completely separate subsets of bacterial and
archaeal proteins: proteins that have no recognizable similarity to any
other microbial or non-microbial proteins ("ORFans") and proteins that are
part of well-defined families or subfamilies. This is done by using a rule
system that describes the level and extent of annotations that can be
assigned by similarity with a prototype manually-annotated entry. The
result is a curated entry whose quality is identical to that produced
manually by an expert annotator.

The programs in development are designed to recognize protein
peculiarities, and only proteins which match the defined criteria will be
processed automatically. Protein sequences which fail to fit into that rule
system will be further analyzed by SWISS-PROT expert annotators.

For a detailed description of the HAMAP project and its current status
please consult:

     2.4   What's happening with the model organisms?

We have selected a number of organisms that are the target of genome
sequencing and/or mapping projects and for which we intend to:

   * be as complete as possible. All sequences available at a given time
     should be immediately included in SWISS-PROT. This also includes
     sequence corrections and updates;
   * provide a higher level of annotation;
   * provide cross-references to specialized database(s) that contain,
     among other data, some genetic information about the genes that code
     for these proteins;
   * provide specific indices or documents.

From our efforts to annotate human sequence entries as complete as possible
arose the HPI project (see 2.2), and the bacterial model organisms became
part of the HAMAP project (see 2.3). Here is the current status of the
model organisms which are not covered by these two projects:

      Organism        Database          Index file     Number of
                      cross-references                 sequences
      ------------    ----------------  -------------- ---------
      A.thaliana      None yet          In preparation     1'409
      C.albicans      None yet          CALBICAN.TXT         256
      C.elegans       Wormpep           CELEGANS.TXT       2'184
      D.discoideum    DictyDB           DICTY.TXT            311
      D.melanogaster  FlyBase           FLY.TXT            1'514
      M.musculus      MGD               MGDTOSP.TXT        4'816
      S.cerevisiae    SGD               YEAST.TXT          4'859
      S.pombe         None yet          POMBE.TXT          1'782

     2.5   Progress in the conversion of SWISS-PROT to mixed-case

We are gradually converting SWISS-PROT entries from all 'UPPER CASE' to
'MiXeD CaSe'. The line-types that have been converted between release 38
and 40 are: DE (DEscription), most RC (Reference Comment) topics (SPECIES,
TISSUE, PLASMID and TRANSPOSON) and DR (Database cross-Reference). The new
OX line (Organism cross-reference; see section 2.8) and the new CC topics
PHARMACEUTICAL and BIOTECHNOLOGY (described in section 2.11) have been
introduced in mixed case. The CC topic MASS SPECTROMETRY has been converted
to mixed case. As described in section 3.5, the process of converting all
of SWISS-PROT to mixed case continues.

     2.6   Extension of the accession number system

With the creation of the TrEMBL database and the rapid increase in the
amount of sequence data, we were faced with a problem of availability of
accession numbers. We used a system based on a one-letter prefix followed
by 5 digits. This system was also used by the nucleotide sequence databases
which had originally reserved for SWISS-PROT the prefix letters 'O', 'P'
and 'Q'. Having run out of space (due mainly to EST's), the nucleotide
sequence databases have been forced to choose a new format, which became a
two-letter prefix followed by 6 digits.

We have now used up all possible numbers with 'O', 'P' and 'Q'. As we
believe that changing the format of the accession numbers to that used now
by the nucleotide database would have created havoc on the numerous
software packages using SWISS-PROT, we decided to keep a system of
accession numbers based on a 6-character code, but with the following
format extension:

  1       2     3          4          5          6
  [O,P,Q] [0-9] [A-Z, 0-9] [A-Z, 0-9] [A-Z, 0-9] [0-9]

What the above means is that we kept a 6-character code, but that in
positions 3, 4 and 5 of this code any combination of letters and numbers
can be present. This format allows a total of 14 million accession numbers
(compared with only 300'000 with the former system).

We only allow numbers in positions 2 and 6 so that the SWISS-PROT accession
numbers cannot be mistaken with gene names, acronyms, other type of
accession numbers or any kind of word!

Examples: P0A3S2, Q2ASD4, O13YX2, P9B123.

     2.7   Multiple AC lines

Starting from release 39, there can be more than one AC (ACcession) line
per SWISS-PROT entry. Strictly speaking this was not a format change and
the SWISS-PROT user's manual always indicated that there could be more than
one AC line per entry. Until recently, a single line was sufficient and the
majority of entries contained only a single accession number. But, in the
process of providing an optimally non-redundant database, we are merging
information from TrEMBL entries into SWISS-PROT entries. When we merge a
TrEMBL entry to a SWISS-PROT one, we add to the latter the accession
number(s) of the TrEMBL entry. The repetition of such a process sometimes
produces an accession number list which can no longer fit in a single AC
line. Therefore there are now some entries with two, three (as shown below)
or more AC lines.

AC   P16070; P22511; Q04858; Q13419; Q13957; Q13958; Q13959; Q13960;
AC   Q13961; Q13967; Q13968; Q13980; Q15861; Q16064; Q16065; Q16066;
AC   Q16208; Q16522;

     2.8   Introduction of the new line type OX: Organism taxonomy

The OX (Organism taXonomy cross-reference) line has been introduced to
indicate the identifier to a specific organism in a taxonomic database. The
number of taxonomic codes is identical to the number of species given in
the OS line. There can be more than one OX line in an entry and its format

OX   Taxonomy-database_Qualifier=Taxonomic code[, Taxonomic code...];

There are cross-references to the taxonomic database of NCBI, which is
associated with the qualifier 'TaxID' and a one-to six-digit taxonomic

Examples of its usage:

OX   NCBI_TaxID=10116;

OX   NCBI_TaxID=9606, 10090, 9913, 9823, 10141, 10029, 10030, 10116, 9986,

OX   9031, 8355, 7227, 7213, 7108, 7130;

     2.9   Changes concerning the RC line

We are gradually implementing controlled vocabularies for the different
type of RC tokens. To complement the tissue list (TISSLIST.TXT), we have
now added a plasmid list (PLASMID.TXT) and are in the process of creating a
strain list. Controlled vocabularies are part of the SWISS-PROT
documentation files that are all described in section 4.

     2.10   Changes concerning the RX line

The RX line format changed, and it now provides identifiers also to the
bibliographic database PubMed.

The old format was:

RX   MEDLINE; unique_identifier.

The new format is:


Example of RX lines:

RX   PubMed=9145897;
RX   MEDLINE=79012484; PubMed=358200;

     2.11   Introduction of two new CC line topics: BIOTECHNOLOGY and

We have introduced two new 'topics' for the comments (CC) line type.

The topic 'BIOTECHNOLOGY' has been introduced to describe the use of a
specific protein in the biotechnological industry. This topic contains the
name(s) of the compani(es) that produce the protein or the genetically
manipulated organism as well as a short description of the biotechnological
function of the protein. The brand name(s), under which a protein is
available, is added, if applicable.

Examples of the usage:

CC   -!- BIOTECHNOLOGY: Introduced by genetic manipulation and
CC       expressed in improved ripening tomato by Monsanto. ACC is the
CC       immediate precursor of the phytohormone ethylene who is
CC       involved in the control of ripening. ACC deaminase reduces
CC       ethylene biosynthesis and thus extend the shelf life of fruits
CC       and vegetables.

CC   -!- BIOTECHNOLOGY: Used in the food industry for high temperature
CC       liquefaction of starch-containing mashes and in the detergent
CC       industry to remove starch. Sold under the name Termamyl by
CC       Novozymes.

The topic 'PHARMACEUTICAL' has been introduced to describe the use of a
specific protein as a pharmaceutical drug. The information provided by such
a topic will include the brand name(s) under which a protein is available,
the name(s) of the compani(es) that produce it as well as a short
description of the therapeutic usage of the protein. It should be noted
that any entries containing such a comment field will also be tagged with
the keyword 'Pharmaceutical'.

Examples of the usage:

CC   -!- PHARMACEUTICAL: Available under the names Avonex (Biogen),
CC       Betaseron (Berlex) and Rebif (Serono). Used in the treatment
CC       of multiple sclerosis (MS). Betaseron is a slightly modified
CC       form of IFNB1 with two residue substitutions.

CC   -!- PHARMACEUTICAL: Available under the name Proleukin (Chiron).
CC       Used in patients with renal cell carcinoma or metastatic
CC       melanoma.

     2.12   Cleaning up of comment line (CC) topics

We are continuing a major overhaul of various comment line topics. We would
like the majority of the information stored to be usable by computer
programs (while being human-readable). We are therefore standardizing the
format of the topics.

The two sub-formats of the topic ALTERNATIVE PRODUCTS:

CC   -!- ALTERNATIVE PRODUCTS:  isoforms;  (shown here),
CC       ,  and ; are produced by alternative splicing.
CC       [Comment.]

CC   -!- ALTERNATIVE PRODUCTS:  isoforms;  (shown here),
CC        and ; are produced by alternative
CC       initiation. [Comment.]


CC   -!- ALTERNATIVE PRODUCTS: At least 5 isoforms; 1 (shown here), 2, 3, 4
CC       and 5; are produced by alternative splicing. They differ in their
CC       acetylcholine receptor clustering activity.

CC   -!- ALTERNATIVE PRODUCTS: 3 isoforms; TRAC-2 (shown here), TRAC-3 and
CC       TRAC-4; are produced by alternative initiation.

We are gradually cleaning up the comment line topic SIMILARITY. To describe
the similarity of the protein to a protein family, we use the following

CC   -!- SIMILARITY: Belongs to the <family_name>[. <sub-family_name>].


CC   -!- SIMILARITY: Belongs to the 14-3-3 family.

CC   -!- SIMILARITY: Belongs to the glucosamine/galactosamine-6-phosphate
CC       isomerase family. 6-phosphogluconolactonase subfamily.

To describe conserved domains within a protein sequence, we use the

CC   -!- SIMILARITY: Contains n <domain_name>.


CC   -!- SIMILARITY: Contains 10 HEAT repeats.
CC   -!- SIMILARITY: Contains 1 FKBP-type PPIase domain.

     2.13   Changes concerning cross-references (DR line)

We have added cross-references from SWISS-PROT to the following databases:

     2.13.1   GlycoSuiteDB

GlycoSuiteDB, a database of glycan structures available at (see Cooper C.A., Harrison M.J., Wilkins M.R.
and Packer N.H.; Nucleic Acids Res. 29:332-335(2001)). The identifiers of
the appropriate DR line are:

 Data bank
 identifier:         GlycoSuiteDB
 Primary identifier: GlycoSuiteDB unique identifier for a glycoprotein,
                     which is identical to the SWISS-PROT primary AC
                     number of that protein.
 identifier:         None; a dash '-' is stored in that field.
 Example:            DR   GlycoSuiteDB; P05067; -.

     2.13.2   SMART

The Simple Modular Architecture Research Tool (SMART), a database of
functional sites available at (see Schultz
J., Copley R.R., Doerks T., Ponting C.P. and Bork P.; Nucleic Acids Res.
28:231-234(2000)). The cross-references for this database are composed of
the following items:

 Data bank identifier: SMART
 Primary identifier:   SMART unique identifier for a domain.
 Secondary identifier: Abbreviation for the name of a domain or module.
 Fourth item:          Number of hits of the domain in the entry.
 Example:              DR   SMART; SM00370; LRR; 6.

     2.13.3   Leproma

The Mycobacterium leprae genome database Leproma, which is available at The information is available in the DR

 Data bank identifier: Leproma
 Primary identifier:   Leproma unique identifer for an ORF.
 Secondary identifier: None; a dash '-' is stored in that field.
 Example:              DR   Leproma; ML0485; -.

     2.13.4   MEROPS

MEROPS, the protease database available at (see
Rawlings N.D. and Barrett A.J.; Nucleic Acids Res. 28:323-325(2000)). The
following information is available in the two qualifiers of the DR line:

 Data bank identifier: MEROPS
 Primary identifier:   The MEROPS unique identifier for a peptidase.
 Secondary identifier: None; a dash '-' is stored in that field.
 Example:              DR   MEROPS; M41.001; -.

     2.13.5   MypuList

The Mycoplasma pulmonis genome database MypuList, available at The following information is
available in the two identifiers of the DR line:

 Data bank identifier: MypuList
 Primary identifier:   The MypuList unique identifier for an ORF.
 Secondary identifier: None; a dash '-' is stored in that field.
 Example:              DR   MypuList; MYPU_4900; -.

     2.13.6   ProDom

Cross-references to the ProDom protein domain database used to be provided
as implicit links; links are now also available as explicit links:

 Data bank identifier:  ProDom
 Primary identifier:    The ProDom unique identifier for a domain.
 Secondary identifier:  The ProDom entry name.
 Fourth item:           Number of hits of the domain in the entry.
 Example for an         DR   ProDom; PD000600; 14-3-3; 1.
 explicit link:

     2.13.7   ANU-2DPAGE

The Australian National University Two-Dimensional Polyacrylamide Gel
Electrophoresis Database (ANU-2DPAGE) is available at (see Imin N., Kerim T., Weinman J.J.
and Rolfe B.G.; Proteomics 1:1149-1161(2001)). The following information is
available in the DR line:

 Data bank
 identifier:          ANU-2DPAGE
 Primary identifier:  ANU-2DPAGE unique identifier, which is identical to
                      the SWISS-PROT primary AC number of that protein.
 identifier:          None; a dash '-' is stored in that field.
 Example:             DR   ANU-2DPAGE; Q9XEA8; -.

     2.13.8   COMPLUYEAST-2DPAGE

Two-dimensional polyacrylamide gel electrophoresis database at Universidad
Complutense de Madrid (COMPLUYEAST-2DPAGE) is available at The following informaiton is
available in the DR line:

 Data bank
 identifier:        COMPLUYEAST-2DPAGE
 Primary            COMPLUYEAST-2DPAGE unique identifier, which is
 identifier:        identical to the SWISS-PROT primary AC number of that
 identifier:        None; a dash '-' is stored in that field.
 Example:           DR   COMPLUYEAST-2DPAGE; P43067; -.

     2.13.9   PHCI-2DPAGE

The Parasite Host Cell Interaction 2D-PAGE database (PHCI-2DPAGE) is
available at The cross-references for
this database are composed of the following items:

 Data bank
 identifier:          PHCI-2DPAGE
 Primary identifier:  PHCI-2DPAGE unique identifier, which is identical to
                      the SWISS-PROT primary AC number of that protein.
 identifier:          None; a dash '-' is stored in that field.
 Example:             DR   PHCI-2DPAGE; Q9Z6V3; -.

     2.13.10   PMMA-2DPAGE

The Purkyne Military Medical Academy 2D-PAGE database (PMMA-2DPAGE) is
available at The identifers of the
appropriate DR line are:

 Data bank
 identifier:          PMMA-2DPAGE
 Primary identifier:  PMMA-2DPAGE unique identifier, which is identical to
                      the SWISS-PROT primary AC number of that protein.
 identifier:          None; a dash '-' is stored in that field.
 Example:             DR   PMMA-2DPAGE; Q01995; -.

     2.13.11   Siena-2DPAGE

The 2D-PAGE database from the Department of Molecular Biology, University
of Siena, Italy, is available at
The components of the corresponding DR line are:

 Data bank
 identifier:         Siena-2DPAGE
 Primary identifier: Siena-2DPAGE unique identifier, which is identical to
                     the SWISS-PROT primary AC number of that protein.
 identifier:         None; a dash '-' is stored in that field.
 Example:            DR   Siena-2DPAGE; P01591; -.

     2.14   Introduction of a new FT key: SE_CYS

Selenocysteine is the 21st 'natural' amino acid. It is now known to occur
in several prokaryotic and eukaryotic proteins. Its mRNA codon is UGA,
which usually serves as a stop codon, but with a specific downstream
sequence forming a loop and a specific translational elongation factor. It
is recognized as the site of selenocysteine incorporation into proteins.

The joint nomenclature committee of the IUPAC/IUBMB (see officially recommended
( a three-letter
and a one-letter symbol for selenocysteine, namely 'Sec' and 'U'.

Introducing a new one-letter code in the sequence records would have
disrupt most, if not all, sequence analysis software. We therefore decided
to change, in SWISS-PROT, the rules used to annotate the presence of
selenocysteine residues in sequence entries in the manner described below.

Selenocysteines were stored, in the sequence records, using the one-letter
symbol 'C' for cysteine and are indicated in the feature table (FT) by a
line of the type:

FT   BINDING       x      x       SELENIUM.

The one-letter code has not been changed (for the reason explained above),
but we introduced a specific feature key (SE_CYS) to indicate the presence
of a selenocysteine at a given sequence position. The above example has
therefore been changed to:

FT   SE_CYS       x       x

We also want to remind users that the keyword ' Selenocysteine' continues
to be used to tag sequence entries that contain at least one such residue.

     2.15   Introduction of feature identifiers to the feature keys

We have introduced unique and stable feature identifiers (FTId) which allow
to construct links directly from position-specific annotation in the
feature table to specialized protein-related databases. Examples are
databases specialized in certain types of posttranslational modifications
of proteins, or in mutations. The FTId is always the last component in the
feature description.

     2.15.1   Feature identifiers in FT VARIANT lines of human sequence

The feature identifiers in the FT VARIANT lines of human sequence entries
allow to refer to a sequence variation and serve as anchors for
specifically directed links. A federated single human mutation database
(HmutDB; has been
proposed, and the complete set of all FT VARIANT lines has been indexed for
SRS at EBI (, under the name SWISSCHANGE. The
database SWISSCHANGE can be queried by SWISS-PROT FTIds.

The format of FT VARIANT lines with feature identifiers is:

FT   VARIANT       x      x        Description.
FT                                 /FTId=VAR_number.


FT   VARIANT       3      3        A -> L.
FT                                 /FTId=VAR_000001.

     2.15.2   Feature identifiers in FT CARBOHYD lines

The same principle is used to further enhance the links to GlycoSuiteDB, an
annotated database of glycan structures (see section 2.13.1). So in
addition the explicit global link in the DR line, we create unique feature
identifiers for each of the FT CARBOHYD lines, which will allow direct
access to the glycan structure.

The format of FT CARBOHYD lines with feature identifiers is:

FT   CARBOHYD      x        x       Description.
FT                                  /FTId=CAR_number.


FT   CARBOHYD    251      251       N-LINKED (GLCNAC...).
FT                                  /FTId=CAR_000070.

     2.16   Change in the syntax of the SQ line

The SQ (SeQuence header) line marks the beginning of the sequence data and
gives a quick summary of its content. The format of the SQ line was:


The last information item in the SQ line was a 32-bit CRC (Cyclic
Redundancy Check) value which is computed from the sequence. As the number
of available sequences is increasing rapidly, there are now a few cases
where two sequences can share the same CRC32 (but none, which also share
the same molecular weight 'MW' or number of amino acids 'AA' ). To address
this issue we replaced the 32-bit CRC value by a 64-bit CRC. The format of
the SQ line changed therefore to:



SQ   SEQUENCE   233 AA;  25630 MW;  146A1B48A1475C86 CRC64;

                          3   Forthcoming changes

     3.1   Version of SP in XML format

A distribution version of SWISS-PROT and TrEMBL in XML format is being
developed. The specifications of this new format will be described when it
will be first implemented in TrEMBL.

     3.2   Extension of the entry name format

We endeavor to assign meaningful entry names that facilitate the
identification of the proteins and the species of origin concerning an
entry. Currently the entry name consists of up to ten uppercase
alphanumeric characters. SWISS-PROT uses a general purpose naming
convention that can be symbolized as X_Y, where X is a mnemonic code of at
most 4 alphanumeric characters representing the protein name, the '_' sign
serves as a separator, and the Y is a mnemonic species identification code
of at most 5 alphanumeric characters representing the biological source of
the protein.

We are planning to elongate the mnemonic code for the protein name from up
to 4 characters to up to 5 characters. E.g. the mnemonic code for the
meiotic recombination protein rec10 is currently 'RE10'. After the
introduction of extended entry names it could be modified to the 5-letter
code 'REC10'.

     3.3   Multiple RP lines

Starting with release 41, there can be more than one RP (Reference
Position) line per reference in a SWISS-PROT entry. The RP line describes
the extent of the work carried out by the authors of the reference, e.g.
molecule type that has been sequenced, the characterization of the protein,
characterization of PTMs, analysis of the protein structure, detection of
variants, etc.

As the number of experimental results per publication increased over the
years the limitation of using a single RP line per reference became more
and more often insufficient to add all the information while being
consistent in format. So we decided to have multiple RP lines.



could become


     3.4   Cleaning up of comment line (CC) topics

We are continuing a major overhaul of various comment line topics. We would
like the majority of the information stored to be usable by computer
programs (while being human-readable). We are therefore standardizing the
format of the topics.

We are gradually cleaning up the comment line topic PATHWAY. To describe
the biochemical pathway in which the protein is involved, we use the
following format:

CC   -!- PATHWAY: biochemical pathway; nth step[. Comment].


CC   -!- PATHWAY: Coenzyme A (CoA) biosynthesis; first step.

The comment line topic COFACTOR will be modified gradually to the following

CC   -!- COFACTOR: cofactor1[, cofactor2 and cofactor3][. Comment].


CC   -!- COFACTOR: Magnesium.
CC   -!- COFACTOR: Copper, Manganese, and Nickel.

     3.5   Continuation of the conversion of SWISS-PROT to mixed-case

We will continue to convert SWISS-PROT entries from all 'UPPER CASE' to
'MiXeD CaSe'. In release 41 we are planning to convert the GN (Gene Name)
line, the RC (Reference Comment) line topic STRAIN, and the CC (Comment)

Here is an example of what a SWISS-PROT entry will look like in release 41:

ID   GSA_ECOLI      STANDARD;      PRT;   426 AA.
AC   P23893; P78277;
DT   01-NOV-1991 (Rel. 20, Created)
DT   01-NOV-1997 (Rel. 35, Last sequence update)
DT   01-MAR-2002 (Rel. 41, Last annotation update)
DE   Glutamate-1-semialdehyde 2,1-aminomutase (EC (GSA)
DE   (Glutamate-1-semialdehyde aminotransferase) (GSA-AT).
GN   hemL or gsa or popC or B0154.
OS   Escherichia coli.
OC   Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae;
OC   Escherichia.
OX   NCBI_TaxID=562;
RN   [1]
RX   MEDLINE=91155920; PubMed=1900346;
RA   Grimm B., Bull A., Breu V.;
RT   "Structural genes of glutamate 1-semialdehyde aminotransferase for
RT   porphyrin synthesis in a cyanobacterium and Escherichia coli.";
RL   Mol. Gen. Genet. 225:1-10(1991).
RN   [2]
RC   STRAIN=K12 / W3110;
RX   MEDLINE=94261430; PubMed=8202364;
RA   Fujita N., Mori H., Yura T., Ishihama A.;
RT   "Systematic sequencing of the Escherichia coli genome: analysis of
RT   the 2.4-4.1 min (110,917-193,643 bp) region.";
RL   Nucleic Acids Res. 22:1637-1639(1994).
RN   [3]
RC   STRAIN=K12 / MG1655;
RX   MEDLINE=97426617; PubMed=9278503;
RA   Blattner F.R., Plunkett G. III, Bloch C.A., Perna N.T., Burland V.,
RA   Riley M., Collado-Vides J., Glasner J.D., Rode C.K., Mayhew G.F.,
RA   Gregor J., Davis N.W., Kirkpatrick H.A., Goeden M.A., Rose D.J.,
RA   Mau B., Shao Y.;
RT   "The complete genome sequence of Escherichia coli K-12.";
RL   Science 277:1453-1474(1997).
RN   [4]
RA   Schramm S., Duncan M., Allen E., Araujo R., Aparicio A., Chung E.,
RA   Davis K., Federspiel N., Hyman R., Kalman S., Komp C., Kurdi O.,
RA   Lashkari D., Lew H., Lin D., Namath A., Oefner P., Roberts D.,
RA   Davis R.W.;
RL   Submitted (SEP-1996) to the EMBL/GenBank/DDBJ databases.
RN   [5]
RX   MEDLINE=91258321; PubMed=2045363;
RA   Ilag L.L., Jahn D., Eggertsson G., Soell D.;
RT   "The Escherichia coli hemL gene encodes glutamate 1-semialdehyde
RT   aminotransferase.";
RL   J. Bacteriol. 173:3408-3413(1991).
RN   [6]
RX   MEDLINE=92353044; PubMed=1643048;
RA   Ilag L.L., Jahn D.;
RT   "Activity and spectroscopic properties of the Escherichia coli
RT   glutamate 1-semialdehyde aminotransferase and the putative active
RT   site mutant K265R.";
RL   Biochemistry 31:7143-7151(1992).
CC   -!- CATALYTIC ACTIVITY: (S)-4-amino-5-oxopentanoate =
CC       5-aminolevulinate.
CC   -!- PATHWAY: Porphyrin biosynthesis by the C5 pathway; second step.
DR   EMBL; X53696; CAA37734.1; -.
DR   EMBL; D26562; CAB20274.1; -.
DR   EMBL; AE000125; AAC73265.1; -.
DR   EMBL; U70214; AAB08584.1; -.
DR   PIR; S13327; S13327.
DR   PIR; S45223; S45223.
DR   HSSP; P24630; 2GSA.
DR   EcoGene; EG10432; hemL.
DR   InterPro; IPR000954; Aminotran_3.
DR   Pfam; PF00202; aminotran_3; 1.
KW   Porphyrin biosynthesis; Isomerase; Pyridoxal phosphate;
KW   Complete proteome.
FT   MUTAGEN     265    265       K->R: 2% OF WILD-TYPE ACTIVITY.
FT   CONFLICT      2      2       S -> R (IN REF. 1 AND 2).
FT   CONFLICT      9      9       S -> Q (IN REF. 1 AND 2).
SQ   SEQUENCE   426 AA;  45366 MW;  BED817E100468CF2 CRC64;

                   4   Status of the documentation files

SWISS-PROT is distributed with a large number of documentation files. Some
of these files have been available for a long time (the user manual,
release notes, the various indices for authors, citations, keywords, etc.),
but many have been created recently and we are continuously adding new
files, and updating and modifying existing files. Please note that the
header in many documentaiton files changed. The following table lists all
the documents that are currently available.

See also section 7.3 for information on how to access updated versions of
all documents in-between major releases.

 USERMAN.TXT    User manual
 RELNOTES.TXT   Release notes for the current release (40)
 SHORTDES.TXT   Short description of entries in SWISS-PROT [see 1]

 JOURLIST.TXT   List of cited journals
 KEYWLIST.TXT   List of keywords
 PLASMID.TXT    List of plasmids [see 2]
 SPECLIST.TXT   List of organism (species) identification codes
 TISSLIST.TXT   List of tissues
 EXPERTS.TXT    List of on-line experts for PROSITE and SWISS-PROT
 DBXREF.TXT     List of databases cross-referenced in SWISS-PROT [see 2]
 SUBMIT.TXT     Submission of sequence data to SWISS-PROT

 ACINDEX.TXT    Accession number index
 AUTINDEX.TXT   Authors index
 CITINDEX.TXT   Citation index
 KEYINDEX.TXT   Keywords index
 SPEINDEX.TXT   Species index
 DELETEAC.TXT   Deleted accession number index

 7TMRLIST.TXT   List of 7-transmembrane G-linked receptor entries [see 1]
 AATRNASY.TXT   List of aminoacyl-tRNA synthetases
 ALLERGEN.TXT   Nomenclature and index of allergen sequences
 ANNBIOCH.TXT   SWISS-PROT annotation: how is biochemical information
                assigned to sequence entries
 BLOODGRP.TXT   Blood group antigen proteins
 CALBICAN.TXT   Index of Candida albicans entries and their corresponding
                gene designations
 CDLIST.TXT     CD nomenclature for surface proteins of human leucocytes
                Index of Caenorhabditis elegans entries and their
 CELEGANS.TXT   corresponding gene designations and WormPep
                Index of Dictyostelium discoideum entries and their
 DICTY.TXT      corresponding gene designations and DictyDB
 EC2DTOSP.TXT   Index of Escherichia coli Gene-protein database
                (ECO2DBASE) entries referenced in SWISS-PROT
 ECOLI.TXT      Index of Escherichia coli strain K12 chromosomal entries
                and their corresponding EcoGene cross-references
 EMBLTOSP.TXT   Index of EMBL Nucleotide Sequence Database entries
                referenced in SWISS-PROT
 EXTRADOM.TXT   Nomenclature of extracellular domains
 FLY.TXT        Index of Drosophila entries and their corresponding
                FlyBase cross-references
 GLYCOSID.TXT   Classification of glycosyl hydrolase families and index of
                glycosyl hydrolase entries in SWISS-PROT
 HAEINFLU.TXT   Index of Haemophilus influenzae strain Rd chromosomal
 HOXLIST.TXT    Vertebrate homeotic Hox proteins: nomenclature and index
 HPYLORI.TXT    Index of Helicobacter pylori strain 26695 chromosomal
 HUMCHR01.TXT   Index of proteins encoded on human chromosome 1 [see 2]
 HUMCHR02.TXT   Index of proteins encoded on human chromosome 2 [see 2]
 HUMCHR03.TXT   Index of proteins encoded on human chromosome 3 [see 2]
 HUMCHR04.TXT   Index of proteins encoded on human chromosome 4 [see 2]
 HUMCHR05.TXT   Index of proteins encoded on human chromosome 5 [see 2]
 HUMCHR06.TXT   Index of proteins encoded on human chromosome 6 [see 2]
 HUMCHR07.TXT   Index of proteins encoded on human chromosome 7 [see 2]
 HUMCHR08.TXT   Index of proteins encoded on human chromosome 8 [see 2]
 HUMCHR09.TXT   Index of proteins encoded on human chromosome 9 [see 2]
 HUMCHR10.TXT   Index of proteins encoded on human chromosome 10 [see 2]
 HUMCHR11.TXT   Index of proteins encoded on human chromosome 11 [see 2]
 HUMCHR12.TXT   Index of proteins encoded on human chromosome 12 [see 2]
 HUMCHR13.TXT   Index of proteins encoded on human chromosome 13
 HUMCHR14.TXT   Index of proteins encoded on human chromosome 14 [see 2]
 HUMCHR15.TXT   Index of proteins encoded on human chromosome 15 [see 2]
 HUMCHR16.TXT   Index of proteins encoded on human chromosome 16
 HUMCHR17.TXT   Index of proteins encoded on human chromosome 17
 HUMCHR18.TXT   Index of proteins encoded on human chromosome 18
 HUMCHR19.TXT   Index of proteins encoded on human chromosome 19
 HUMCHR20.TXT   Index of proteins encoded on human chromosome 20
 HUMCHR21.TXT   Index of proteins encoded on human chromosome 21
 HUMCHR22.TXT   Index of proteins encoded on human chromosome 22
 HUMCHRX.TXT    Index of proteins encoded on human chromosome X
 HUMCHRY.TXT    Index of proteins encoded on human chromosome Y
 HUMPVAR.TXT    Index of human proteins with sequence variants
 INITFACT.TXT   List and index of translation initiation factors
 INTEIN.TXT     Index of intein-containing entries referenced in
                SWISS-PROT [see 2]
 METALLO.TXT    Classification of metallothioneins and index of the
                entries in SWISS-PROT
 MGDTOSP.TXT    Index of MGD entries referenced in SWISS-PROT
 MGENITAL.TXT   Index of Mycoplasma genitalium strain G-37 chromosomal
 MIMTOSP.TXT    Index of MIM entries referenced in SWISS-PROT
 MJANNASC.TXT   Index of Methanococcus jannaschii entries
 NGR234.TXT     Table of predicted proteins in Rhizobium plasmid pNGR234a
 NOMLIST.TXT    List of nomenclature related references for proteins
 PCC6803.TXT    Index of Synechocystis strain PCC 6803 entries
 PDBTOSP.TXT    Index of Protein Data Bank (PDB) entries referenced in
 PEPTIDAS.TXT   Classification of peptidase families and index of
                peptidase entries in SWISS-PROT
 PLASTID.TXT    List of chloroplast and cyanelle encoded proteins
 POMBE.TXT      Index of Schizosaccharomyces pombe entries and their
                corresponding gene designations
 RESTRIC.TXT    List of restriction enzyme and methylase entries
 RIBOSOMP.TXT   Index of ribosomal proteins classified by families on the
                basis of sequence similarities
 RPROWAZE.TXT   Index of Rickettsia prowazekii strain Madrid E entries
                [see 2]
 SALTY.TXT      Index of Salmonella typhimurium strain LT2 chromosomal
                entries and their corresponding StyGene cross-references
 SUBTILIS.TXT   Index of Bacillus subtilis strain 168 chromosomal entries
                and their corresponding SubtiList cross-references
 UPFLIST.TXT    UPF (Uncharacterized Protein Families) list and index of
 YEAST.TXT      Index of Saccharomyces cerevisiae entries in SWISS-PROT
                and their corresponding gene designations
 YEAST1.TXT     Yeast Chromosome I entries
 YEAST2.TXT     Yeast Chromosome II entries
 YEAST3.TXT     Yeast Chromosome III entries
 YEAST5.TXT     Yeast Chromosome V entries
 YEAST6.TXT     Yeast Chromosome VI entries
 YEAST7.TXT     Yeast Chromosome VII entries
 YEAST8.TXT     Yeast Chromosome VIII entries
 YEAST9.TXT     Yeast Chromosome IX entries
 YEAST10.TXT    Yeast Chromosome X entries
 YEAST11.TXT    Yeast Chromosome XI entries
 YEAST13.TXT    Yeast Chromosome XIII entries
 YEAST14.TXT    Yeast Chromosome XIV entries


 1   The '7TMRLIST.TXT' and 'SHORTDES.TXT' files have been converted to
     mixed-case characters.
     'PLASMID.TXT', and 'RPROWAZE.TXT' files are new documents introduced
     since release 38.

We have continued to include in some SWISS-PROT documentation files the
references of Web sites relevant to the subject under consideration. There
are now 89 documents that include such links.

                   5   The ExPASy World-Wide Web server

     5.1   Background information

The most efficient and user-friendly way to browse interactively in
SWISS-PROT, PROSITE, ENZYME, SWISS-2DPAGE and other databases is to use the
World-Wide Web (WWW) molecular biology server ExPASy. The ExPASy server was
made available to the public in September 1993 and is reachable at the
following address:

The ExPASy WWW server allows access, using the user-friendly hypertext
SWISS-3DIMAGE and CD40Lbase databases. And, through any SWISS-PROT protein
sequence entry, to other databases such as EMBL, Eco2DBASE, EcoCyc,
EcoGene, FlyBase, GCRDb, GlycoSuiteDB, MaizeDB, OMIM, PDB, HSSP, Pfam,
ProDom, REBASE, SGD, SubtiList, TRANSFAC, YPD, ZFIN and Medline. ExPASy
also offers many tools for the analysis of protein sequences and 2D gels.

There are currently five mirror sites of ExPASy, i.e. exact copies of the
server. The ExPASy mirror sites are located in:

           at the Australian Proteome Analysis Facility (APAF), Sydney
           at the Canadian Bioinformatics Resource (CBR), Halifax
           at the Center of Bioinformatics, Peking University, Beijing
           at the Yonsei Proteome Research Center
           at the National Health Research Institutes (NHRI), Taipei

Explicit general and continuously updated documentation about the ExPASy
server is available at

     5.2   Swiss-Shop

We provide, on ExPASy, a service called Swiss-Shop
( Swiss-Shop is an automated sequence
alerting system which allows users to obtain, by email, new sequence
entries relevant to their field(s) of interest. Every week, the new
sequences entered in SWISS-PROT are automatically compared with all the
criteria that have been defined by the users. If a sequence corresponds to
the selection criteria defined by a user, that sequence is sent by
electronic mail. Various criteria can be combined:

   * By entering one or more words that should be present in the
     description line;
   * By entering one or more species name(s) or taxonomic division(s);
   * By entering one or more keywords;
   * By entering one or more author names;
   * By entering the accession number (or entry name) of a PROSITE pattern
     or a user-defined sequence pattern. In this case, all new SWISS-PROT
     entries matching this pattern will be reported;
   * By entering the accession number (or entry name) of an existing
     SWISS-PROT entry or by entering a 'private' sequence. In this case,
     all new SWISS-PROT entries similar to that sequence will be reported.

     5.3   What is new on ExPASy

ExPASy is constantly modified and improved. If you wish to be informed on
the changes made to the server you can either:

   * Read the document 'History of changes, improvements and new features'
     which is available at the address:
   * Subscribe to Swiss-Flash, a service that reports news of databases,
     software and service developments. By subscribing to this service, you
     will automatically get Swiss-Flash bulletins by electronic mail. To
     subscribe, use the address:

Among all the improvements and the new features introduced since the last
SWISS-PROT release, here are those that we believe are specifically useful
to SWISS-PROT users:

     1. A new and improved version of the NiceProt view of SWISS-PROT is
     available and offers the following new features: a link to a
     printer-friendly view of a SWISS-PROT entry, display of the length of
     certain features in the FT lines, and access to a new tool, the
     'Feature aligner' which allows to select features for submission to
     the ClustalW multiple alignment program.

     2. SWISS-PROT release statistics are now available for every update of
     the database (
     Among other parameters, statistics about database growth, average
     sequence lengths and amino acid composition, taxonomic origin, journal
     citations and database cross-references are presented, including some

     3. A new view is available within the SRS Sequence Retrieval System.
     It displays, for each protein corresponding to a user query, gene
     name(s) and organism (in addition to the parameters ID, AC,
     description and sequence length which are displayed by the default
     view "Short description"). This new view is entitled "Long
     description" and is available from the menu "Use view ..." in the SRS
     query form.

     4. The SIB Blast interface (accessible also via "Quick BLAST" or from
     the bottom of every SWISS-PROT/TrEMBL entry) now offers the
     possibility to restrict the similarity search by using taxonomic
     criteria. A "Taxonomic View" of the results can also be obtained via
     the BLAST result page. The user can also select a number of matching
     sequences and directly submit them to a ClustalW search, or retrieve
     and download the corresponding SWISS-PROT/TrEMBL entries. An
     alternative view of the results, NiceBlast, is available, which
     consists of an html table, detailing complete descriptions of all
     matching proteins, including the full protein name, gene name,
     sequence length and organism.

     5. Explicit cross-references have been implemented between SWISS-PROT
     and BLOCKS, GlycoSuiteDB, InterPro, Leproma, MEROPS, MypuList, SMART,
     and Siena-2DPAGE. Implicit links have been added to the resources DIP,
     GeneCensus, GeneLynx, HUGE and NucleaRDB.

     6. A new tool has been added to the ExPASy suite of proteomics tools:
     FindPept ( can identify
     peptides that result from unspecific cleavage of proteins from their
     experimental masses, taking into account artefactual chemical
     modifications, post-translational modifications (PTM) and protease
     autolytic cleavage. This new tool has been closely integrated with the
     other proteomics tools on ExPASy, such as PeptIdent and FindMod.

     7. The Sulfinator ( is a newly
     developed tool to predict tyrosine sulfation sites for a protein
     sequence, using four different Hidden Markov Models (HMM).

     8. Sequences of alternatively spliced isoforms of the same protein are
     documented in the feature table of that protein sequence record. In
     collaboration with the SWISS-PROT group at EBI, a program
     has been written to generate additional records from SWISS-PROT and
     TrEMBL, one for each splice isoform of each protein. The resulting
     data sets for SWISS- PROT and TrEMBL are available on the ExPASy ftp
     server (, along with a more
     detailed description of the project and information on how to obtain a
     local copy of the program.

     The additional isoform entries have been added to the
     SWISS-PROT/TrEMBL databases underlying the BLAST server at SIB
     Switzerland, ScanProsite, and PeptIdent. Gradually, all other tools on
     ExPASy will be modified to handle splice isoforms. The NiceProt view
     of SWISS-PROT/TrEMBL provides links from the isoform name in the
     feature table (example: Q01432) to a page displaying the sequence of
     the corresponding isoform.

     9. In the framework of the HAMAP project (see section 2.3), several
     new features and tools have been implemented on ExPASy:
        o The keyword "Complete Proteome" has been introduced to all
          SWISS-PROT/TrEMBL entries describing a protein which is thought
          to be expressed by an organism whose genome has been completely
          sequenced. This keyword is so far only used for microbial
          (bacterial and archaeal) proteins. A complete set of proteins
          from a microbial genome can therefore be obtained using this
          keyword across SWISS-PROT and TrEMBL.
        o We provide clean non-redundant SWISS-PROT/TrEMBL data sets for
          all completely sequenced microbial genomes. These files are
          available on the ExPASy ftp server in SWISS-PROT and Fasta format
          (, and can
          also be used for similarity searches on the SIB Blast server
          ("microbial proteomes").
        o A Genomic Proximity Viewer is available for those microbial
          genomes where an ORF numbering system exists. For those
          organisms, it is possible to click on the ORF name in the
          SWISS-PROT/TrEMBL GN lines to obtain a list of proteins encoded
          by genes in proximity. The tool is also accessible from the HAMAP
          complete proteome pages of those organisms. Example: Borrelia

     10. A year ago we have launched Protein Spotlight
     (; a periodical review centered on a
     specific protein or group of proteins. It is published on a monthly
     basis. You can subscribe to receive each issue, free of charge, in
     HTML or PDF format.

                   6 TrEMBL - a supplement to SWISS-PROT

The ongoing genome sequencing and mapping projects have dramatically
increased the number of protein sequences to be incorporated into
SWISS-PROT. Since we do not want to dilute the quality standards of
SWISS-PROT by incorporating sequences into the database without proper
sequence analysis and annotation, we cannot speed up the incorporation of
new incoming data indefinitely. But as we also want to make the sequences
available as fast as possible, we have introduced with SWISS-PROT a
computer annotated supplement. This supplement consists of entries in
SWISS-PROT-like format derived from the translation of all coding sequences
(CDS) in the EMBL nucleotide sequence database, except those already
included in SWISS-PROT.

This supplement is named TrEMBL (Translation from EMBL). It can be
considered as a preliminary section of SWISS-PROT. This SWISS-PROT release
is supplemented by TrEMBL release 18.

TrEMBL is available by FTP from the EBI and ExPASy servers in the directory
'databases/trembl'. It can be queried on WWW by the EBI and ExPASy SRS
servers. It is distributed with its own set of release notes.

                  7   FTP access to SWISS-PROT and TrEMBL

     7.1   Generalities

SWISS-PROT is available for download on the following anonymous FTP

 Organization Swiss Institute of Bioinformatics (SIB)
 Directory    /databases/swiss-prot/

 Organization European Bioinformatics Institute (EBI)
 Directory    /pub/databases/swissprot/

     7.2   Non-redundant database

We distribute on the ExPASy and EBI FTP servers, files that make up a
non-redundant (see further) and complete protein sequence database
consisting of three components:

3) New entries to be later integrated into TrEMBL (hereafter known as

Every week three files are completely rebuilt. These files are named:
sprot. dat.gz, trembl.dat.gz and trembl_new.dat.gz. As indicated by their
'. gz' extension, these are gzip-compressed files which, when decompressed,
will produce ASCII files in SWISS-PROT format.

Three other files are also available (sprot.fas.gz, trembl.fas.gz and
trembl_new.fas.gz) which are compressed 'fasta' format sequence files
useful for building the databases used by FASTA, BLAST and other sequence
similarity search programs. Please do not use these files for any other
purpose, as you will lose all annotations by using this very ' primitive'

The files for the non-redundant database are stored in the directory
'/databases/sp_tr_nrdb' on the ExPASy FTP server ( and in
the directory '/pub/databases/sp_tr_nrdb' on the EBI FTP server

Additional notes:

   * The SWISS-PROT file continuously grows as new annotated sequences are

   * The TrEMBL file decreases in size as sequences are moved out of that
     section after being annotated and moved into SWISS-PROT. Four times a
     year a new release of TrEMBL is built at EBI, at this point the TrEMBL
     file increases in size as it then includes all of the new data (see
     next section) that has accumulated since the last release.

   * The TrEMBL_New file starts as a very small file and grows in size
     until a new release of TrEMBL is available.

   * SWISS-PROT and TrEMBL share the same system of accession numbers.
     Therefore you will not find any primary accession number duplicated
     between the two sections. A TrEMBL entry (and its associated accession
     number(s)) can either move to SWISS-PROT as new entry or be merged
     with an existing SWISS-PROT entry. In the latter case, the accession
     number(s) of that TrEMBL entry are added to that of the SWISS-PROT

   * TrEMBL_New does not have real accession numbers. However it was
     necessary to have an 'AC' line so as to be able to use it with
     different software products. This AC line contains a temporary
     identifier which consists of the protein_ID (protein sequence
     identifier) of the coding sequence in the parent nucleotide sequence.

   * TrEMBL_New is quite messy! You will of course find new sequence
     entries but you will also encounter sequences that are going to be
     used to update existing TrEMBL or SWISS-PROT entries. None of the
     "cleaning" steps that are applied to produce a TrEMBL release are run
     on TrEMBL_New nor are any of the computer-annotation software tools
     that are used to enhance the information content of TrEMBL. TrEMBL_New
     is provided only so that users can be sure not to miss any important
     new sequences when they run similarity searches.

   * While these three files allow you to build what we call a
     'non-redundant' database, it must be noted that this is not completely
     a true statement. Without going into a long explanation we can say
     that this is currently the best attempt in providing a complete
     selection of protein sequence entries while trying to eliminate
     redundancies. Also SWISS-PROT is completely (well 99.994% !)
     non-redundant, TrEMBL is far from being non-redundant and the addition
     of SWISS-PROT + TrEMBL is even less.

   * To describe to your users the version of the non-redundant database
     that you are providing them with, you should use a statement of the

          SWISS-PROT release 40.0 of 17-Oct-2001;
          TrEMBL release 18.0 of 22-Oct-2001;
          TrEMBL_New of 22-Oct-2001.

     7.3   Weekly updates of SWISS-PROT documents

Whilst the ExPASy FTP server so far only allowed FTP access to the
SWISS-PROT documents and indexes in their versions at the time of the last
full release, all documents are now updated with every weekly release of
SWISS-PROT. They are available for FTP download from the directory

     7.4   Weekly updates of SWISS-PROT

Weekly updates of SWISS-PROT are available by anonymous FTP. Three files
are generated at each update:

 new_seq.dat Contains all the new entries since the last full

 upd_seq.dat Contains the entries for which the sequence data has
             been updated since the last release;

 upd_ann.dat Contains the entries for which one or more annotation
             fields have been updated since the last release.

Important notes

   * Although we try to follow a regular schedule, we do not promise to
     update these files every week. In most cases two weeks may elapse
     between two updates.
   * Instead of using the above files, you can, every week, download an
     updated copy of the SWISS-PROT database. This file is available in the
     directory containing the non-redundant database (see section 7.2).

                          8   ENZYME and PROSITE

     8.1   The ENZYME nomenclature database

Release 27.0 of the ENZYME nomenclature database is distributed with
release 40 of SWISS-PROT. ENZYME release 27.0 contains information relative
to 3'870 enzymes. In this release, we have added a significant number of
new entries and we also updated many entries.

     8.2   The PROSITE database

Release 17.0 of the PROSITE database will be available in a few weeks.
PROSITE will now come with its own set of release notes.

                          9   We need your help!

We welcome feedback from our users. We would especially appreciate that you
notify us if you find that sequences belonging to your field of expertise
are missing from the database. We also would like to be notified about
annotations to be updated, if, for example, the function of a protein has
been clarified or if new information about post-translational modifications
has become available. To facilitate this feedback we offer, on the ExPASy
WWW server, a form that allows the submission of updates and/or corrections

It is also possible, from any entry in SWISS-PROT displayed by the ExPASy
server, to submit updates and/or corrections for that particular entry.
Finally, you can also send your comments by electronic mail to the address:

Note that all update requests are assigned a unique identifier of the
form UR-Xnnnn (example: UR-A0123). This identifier is used internally by
the SWISS-PROT staff at SIB and EBI to track down the fate of requests
and is also be used in email exchanges with the persons having submitted
a request. 

                       APPENDIX A:   Some statistics

     A.1   Amino acid composition

     A.1.1   Composition in percent for the complete database

   Ala (A) 7.61   Gln (Q) 3.93   Leu (L) 9.53   Ser (S) 7.08
   Arg (R) 5.19   Glu (E) 6.47   Lys (K) 5.97   Thr (T) 5.58
   Asn (N) 4.36   Gly (G) 6.85   Met (M) 2.37   Trp (W) 1.21
   Asp (D) 5.25   His (H) 2.24   Phe (F) 4.10   Tyr (Y) 3.16
   Cys (C) 1.63   Ile (I) 5.85   Pro (P) 4.89   Val (V) 6.61

   Asx (B) 0.000  Glx (Z) 0.000  Xaa (X) 0.01

     A.1.2   Classification of the amino acids by their frequency

   Leu, Ala, Ser, Gly, Val, Glu, Lys, Ile, Thr, Asp, Arg, Pro, Asn, Phe,
   Gln, Tyr, Met, His, Cys, Trp

     A.2   Taxonomic origin

Total number of species represented in this release of SWISS-PROT: 7'188
The first twenty species represent 45'181 sequences: 44.5 % of the total
number of entries.

     A.2.1   Table of the frequency of occurrence of species

        Species represented 1x: 3396
                            2x: 1086
                            3x:  589
                            4x:  366
                            5x:  267
                            6x:  251
                            7x:  169
                            8x:  137
                            9x:  125
                           10x:   61
                       11- 20x:  308
                       21- 50x:  231
                       51-100x:   78
                         >100x:  124

     A.2.2   Table of the most represented species

  ------  ---------  --------------------------------------------
  Number  Frequency  Species
  ------  ---------  --------------------------------------------
       1       7471  Homo sapiens (Human)
       2       4859  Saccharomyces cerevisiae (Baker's yeast)
       3       4816  Mus musculus (Mouse)
       4       4741  Escherichia coli
       5       3091  Rattus norvegicus (Rat)
       6       2260  Bacillus subtilis
       7       2184  Caenorhabditis elegans
       8       1782  Schizosaccharomyces pombe (Fission yeast)
       9       1769  Haemophilus influenzae
      10       1514  Drosophila melanogaster (Fruit fly)
      11       1472  Methanococcus jannaschii
      12       1409  Arabidopsis thaliana (Mouse-ear cress)
      13       1321  Mycobacterium tuberculosis
      14       1295  Bos taurus (Bovine)
      15       1004  Gallus gallus (Chicken)
      16        883  Synechocystis sp. (strain PCC 6803)
      17        872  Escherichia coli O157:H7
      18        846  Salmonella typhimurium
      19        798  Archaeoglobus fulgidus
      20        794  Xenopus laevis (African clawed frog)
      21        765  Sus scrofa (Pig)
      22        680  Aquifex aeolicus
      23        671  Oryctolagus cuniculus (Rabbit)
      24        662  Mycoplasma pneumoniae
      25        594  Pseudomonas aeruginosa
      26        588  Treponema pallidum
      27        557  Buchnera aphidicola (subsp. Acyrthosiphon pisum)
      28        523  Rickettsia prowazekii
      29        522  Helicobacter pylori (Campylobacter pylori)
      30        505  Helicobacter pylori J99 (Campylobacter pylori J99)
      31        503  Mycobacterium leprae
      32        486  Mycoplasma genitalium
      33        481  Zea mays (Maize)
      34        450  Methanobacterium thermoautotrophicum
      35        403  Rhizobium sp. (strain NGR234)
      36        395  Borrelia burgdorferi (Lyme disease spirochete)
      37        390  Oryza sativa (Rice)
      38        387  Chlamydia trachomatis
      39        375  Thermotoga maritima
      40        374  Streptomyces coelicolor
      41        371  Chlamydia pneumoniae (Chlamydophila pneumoniae)
      42        368  Canis familiaris (Dog)
      43        364  Chlamydia muridarum
      44        356  Rhizobium meliloti (Sinorhizobium meliloti)
      45        353  Vibrio cholerae
      46        333  Nicotiana tabacum (Common tobacco)
      47        323  Pasteurella multocida
      48        322  Ovis aries (Sheep)
      49        320  Pyrococcus horikoshii
      50        311  Dictyostelium discoideum (Slime mold)
      51        301  Lactococcus lactis (subsp. lactis) (Streptococcus lactis)
      52        284  Pyrococcus abyssi
      53        276  Pisum sativum (Garden pea)
      54        272  Bacteriophage T4
      55        260  Staphylococcus aureus
      56        256  Candida albicans (Yeast)
      57        255  Neurospora crassa
      58        254  Vaccinia virus (strain Copenhagen)
      59        247  Triticum aestivum (Wheat)
      60        247  Bacillus halodurans
      61        244  Glycine max (Soybean)
      62        243  Hordeum vulgare (Barley)
      63        242  Aeropyrum pernix
      64        241  Rhodobacter capsulatus (Rhodopseudomonas capsulata)
      65        231  Pseudomonas putida
      66        227  Lycopersicon esculentum (Tomato)
      67        221  Cavia porcellus (Guinea pig)
      68        220  Porphyra purpurea
      69        219  Solanum tuberosum (Potato)
      70        214  Spinacia oleracea (Spinach)
      71        214  Klebsiella pneumoniae
      72        213  Bacillus stearothermophilus
      73        210  Neisseria meningitidis (serogroup B)
      74        204  Neisseria meningitidis (serogroup A)
      75        193  Human cytomegalovirus (strain AD169)
      76        188  Campylobacter jejuni
      77        187  Vaccinia virus (strain WR)
      78        183  Deinococcus radiodurans
      79        180  Agrobacterium tumefaciens
      80        179  Sulfolobus solfataricus
      81        179  Brachydanio rerio (Zebrafish) (Zebra danio)
      82        173  Equus caballus (Horse)
      83        171  Mesocricetus auratus (Golden hamster)
      84        171  Chlamydomonas reinhardtii
      85        170  Thermoplasma acidophilum
      86        168  Emericella nidulans (Aspergillus nidulans)
      87        158  Halobacterium sp. (strain NRC-1)
      88        154  Autographa californica nuclear polyhedrosis virus (AcMNPV)
      89        153  Cyanidium caldarium
      90        152  Thermus aquaticus (subsp. thermophilus)
      91        151  Marchantia polymorpha (Liverwort)
      92        151  Cyanophora paradoxa
      93        149  Xylella fastidiosa
      94        148  Fowlpox virus (FPV)
      95        148  Guillardia theta (Cryptomonas phi)
      96        147  Synechococcus sp. (strain PCC 7942) (Anacystis nidulans R2)
      97        147  Variola virus
      98        143  Caulobacter crescentus
      99        142  Ureaplasma parvum (Ureaplasma urealyticum biotype 1)
     100        142  Kluyveromyces lactis (Yeast)

     A.2.3   Taxonomic distribution of the sequences

   Kingdom       Sequences (% of the database)
   Archaea            5032 (  5%)
   Bacteria          34782 ( 34%)
   Eukaryota         53357 ( 53%)
   Viruses            8431 (  8%)

     A.3   Sequence size

     A.3.1   Repartition of the sequences by size (excluding fragments)

               From   To  Number             From   To   Number
                  1-  50    1950             1001-1100      915
                 51- 100    7099             1101-1200      708
                101- 150   10484             1201-1300      471
                151- 200    9010             1301-1400      318
                201- 250    8978             1401-1500      268
                251- 300    8130             1501-1600      172
                301- 350    7894             1601-1700      150
                351- 400    7945             1701-1800      105
                401- 450    5869             1801-1900      116
                451- 500    5485             1901-2000       87
                501- 550    4190             2001-2100       47
                551- 600    2852             2101-2200       87
                601- 650    2249             2201-2300       89
                651- 700    1651             2301-2400       50
                701- 750    1457             2401-2500       48
                751- 800    1240             >2500          273
                801- 850     985
                851- 900     965
                901- 950     700
                951-1000     593

     A.3.2   Longest and shortest sequences

   The shortest sequence is  GRWM_HUMAN (P24272) :     3 amino acids.
   The longest sequence is   NEBU_HUMAN (P20929) :  6669 amino acids.

     A.4   Journal citations

Note: the following citation statistics reflect the number of distinct
journal citations.

Total number of journals cited in this release of SWISS-PROT: 1'190

     A.4.1   Table of the frequency of journal citations

        Journals cited 1x:  443
                       2x:  157
                       3x:   87
                       4x:   58
                       5x:   51
                       6x:   27
                       7x:   24
                       8x:   19
                       9x:   21
                      10x:   11
                  11- 20x:   83
                  21- 50x:   88
                  51-100x:   31
                    >100x:   90

     A.4.2   List of the most cited journals in SWISS-PROT

   Nb    Citations   Journal name
   --    ---------   -------------------------------------------------------------
    1         8033   Journal of Biological Chemistry
    2         4615   Proceedings of the National Academy of Sciences of the U.S.A.
    3         3554   Nucleic Acids Research
    4         3295   Journal of Bacteriology
    5         3144   Gene
    6         2492   FEBS Letters
    7         2293   Biochemical and Biophysical Research Communications
    8         2255   European Journal of Biochemistry
    9         2144   Biochemistry
   10         1998   The EMBO Journal
   11         1894   Nature
   12         1833   Biochimica et Biophysica Acta
   13         1682   Journal of Molecular Biology
   14         1503   Genomics
   15         1477   Cell
   16         1434   Molecular and Cellular Biology
   17         1096   Biochemical Journal
   18         1085   Molecular and General Genetics
   19         1078   Plant Molecular Biology
   20         1024   Science
   21          982   Molecular Microbiology
   22          814   Virology
   23          808   Journal of Biochemistry
   24          637   Human Molecular Genetics
   25          592   Journal of Cell Biology
   26          573   Journal of Virology
   27          525   Human Mutation
   28          520   Plant Physiology
   29          518   Genes and Development
   30          510   Yeast
   31          505   Nature Genetics
   32          494   Oncogene
   33          486   Journal of General Virology
   34          477   Infection and Immunity
   35          461   Journal of Immunology
   36          441   The American Journal of Human Genetics
   37          424   Structure
   38          420   Archives of Biochemistry and Biophysics
   39          391   FEMS Microbiology Letters
   40          366   Microbiology
   41          358   Current Genetics
   42          346   Development
   43          333   Nature Structural Biology
   44          331   Molecular and Biochemical Parasitology
   45          320   Human Genetics
   46          293   Genetics
   47          280   Molecular Endocrinology
   48          277   Journal of Clinical Investigation
   49          270   Biological Chemistry Hoppe-Seyler
   50          267   Applied and Environmental Microbiology
   51          265   Blood
   52          263   Journal of Molecular Evolution
   53          253   Protein Science
   54          249   DNA and Cell Biology
   55          243   Developmental Biology
   56          229   Journal of General Microbiology
   57          224   Journal of Experimental Medicine
   58          213   Neuron
   59          213   Hoppe-Seyler's Zeitschrift fur Physiologische Chemie
   60          211   Cancer Research
   61          210   Immunogenetics
   62          208   Mammalian Genome
   63          197   Endocrinology
   64          182   Mechanisms of Development
   65          180   DNA Sequence
   66          170   Acta Crystallographica, Section D
   67          164   The Plant Cell
   68          161   Brain Research. Molecular Brain Research
   69          159   Journal of Neurochemistry
   70          158   Molecular Biology and Evolution
   71          156   DNA
   72          155   Molecular Biology of the Cell
   73          147   The Plant Journal
   74          146   Journal of Cell Science
   75          145   Journal of Neuroscience
   76          135   Comparative Biochemistry and Physiology
   77          133   Bioscience, Biotechnology, and Biochemistry
   78          130   Antimicrobial Agents and Chemotherapy
   79          125   Biochimie
   80          123   Virus Research
   81          122   Bioorganicheskaia Khimiia
   82          120   Molecular Pharmacology
   83          117   Hemoglobin
   84          116   The Journal of Clinical Endocrinology and Metabolism
   85          113   Agricultural and Biological Chemistry
   86          112   Cytogenetics and Cell Genetics
   87          112   American Journal of Physiology
   88          110   Molecular Plant-Microbe Interactions
   89          105   Proteins
   90          102   Peptides
   91          100   DNA Research

     A.5   Statistics for some line types

The following table summarizes the total number of some SWISS-PROT lines,
as well as the number of entries with at least one such line, and the
frequency of the lines.

                                   Total    Number of  Average
Line type / subtype                number   entries    per entry
---------------------------------  -------- ---------  ---------

References (RL)                     182326              1.79
   Journal                          152419     89829    1.50
   Submitted to EMBL/GenBank/DDBJ    27607     24142    0.27
   Unpublished observations            500       496   <0.01
   Book citation                       438       428   <0.01
   Submitted to SWISS-PROT             437       435   <0.01
   Plant Gene Register                 385       378   <0.01
   Submitted to other databases        185       183   <0.01
   Thesis                              160       159   <0.01
   Unpublished results                 114       112   <0.01
   Patent                               79        77   <0.01
   Worm Breeder's Gazette                2         2   <0.01

Comments (CC)                       309232              3.04
   SIMILARITY                        91246     81758    0.90
   FUNCTION                          61984     61049    0.61
   SUBCELLULAR LOCATION              42010     42010    0.41
   CATALYTIC ACTIVITY                27896     26508    0.27
   SUBUNIT                           25865     25864    0.25
   PATHWAY                           11464     11431    0.11
   TISSUE SPECIFICITY                10070     10070    0.10
   COFACTOR                           7811      7811    0.08
   MISCELLANEOUS                      6942      6352    0.07
   PTM                                5829      5447    0.06
   INDUCTION                          2971      2971    0.03
   DEVELOPMENTAL STAGE                2811      2811    0.03
   ALTERNATIVE PRODUCTS               2755      2754    0.03
   DOMAIN                             2658      2471    0.03
   CAUTION                            2169      2099    0.02
   DISEASE                            1865      1620    0.02
   ENZYME REGULATION                  1473      1473    0.01
   MASS SPECTROMETRY                   548       506    0.01
   DATABASE                            503       465   <0.01
   POLYMORPHISM                        295       287   <0.01
   PHARMACEUTICAL                       38        38   <0.01
   BIOTECHNOLOGY                        29        29   <0.01

Features (FT)                       471213              4.64
   DOMAIN                            76115     22381    0.75
   TRANSMEM                          64913     14473    0.64
   CARBOHYD                          40298      9840    0.40
   CONFLICT                          36638     12924    0.36
   DISULFID                          34856      9355    0.34
   METAL                             27931      6801    0.27
   CHAIN                             20956     16975    0.21
   VARIANT                           18980      3544    0.19
   ACT_SITE                          18495     11839    0.18
   REPEAT                            17543      3013    0.17
   SIGNAL                            12976     12975    0.13
   NP_BIND                           12514      8916    0.12
   MOD_RES                           11665      6503    0.11
   NON_TER                           10234      7849    0.10
   BINDING                            7710      6160    0.08
   TURN                               7330       633    0.07
   STRAND                             7077       562    0.07
   ZN_FING                            5911      2061    0.06
   INIT_MET                           4892      4868    0.05
   HELIX                              4644       587    0.05
   VARSPLIC                           4211      2068    0.04
   SITE                               4151      3019    0.04
   PROPEP                             3842      3488    0.04
   DNA_BIND                           3796      3589    0.04
   MUTAGEN                            2797       963    0.03
   LIPID                              2684      2174    0.03
   TRANSIT                            2300      2284    0.02
   PEPTIDE                            2202       830    0.02
   CA_BIND                            2106       840    0.02
   NON_CONS                            732       387    0.01
   UNSURE                              255       117   <0.01
   SIMILAR                             242       203   <0.01
   SE_CYS                              104        64   <0.01
   THIOETH                              90        31   <0.01
   THIOLEST                             23        23   <0.01

Cross-references (DR)               718458              7.07
   EMBL                             179318     95610    1.76
   InterPro                         128566     81051    1.27
   Pfam                             101086     77741    0.99
   PROSITE                           83189     53484    0.82
   PIR                               47057     35789    0.46
   HSSP                              33548     33548    0.33
   PRINTS                            30494     27899    0.30
   SMART                             30434     22855    0.30
   ProDom                            16772     16337    0.17
   PDB                               10380      3124    0.10
   TIGR                               9378      9343    0.09
   MIM                                6755      6024    0.07
   SGD                                4903      4849    0.05
   MGD                                4408      4397    0.04
   EcoGene                            4134      4132    0.04
   Mendel                             3041      2942    0.03
   MEROPS                             2348      2260    0.02
   SubtiList                          2234      2233    0.02
   WormPep                            2071      2034    0.02
   FlyBase                            1936      1883    0.02
   GCRDb                              1661       972    0.02
   TRANSFAC                           1612      1494    0.02
   TubercuList                        1350      1313    0.01
   StyGene                             799       798    0.01
   SWISS-2DPAGE                        746       745    0.01
   Leproma                             501       497   <0.01
   MaizeDB                             402       398   <0.01
   HIV                                 370       354   <0.01
   REBASE                              352       347   <0.01
   ECO2DBASE                           351       299   <0.01
   DictyDb                             313       310   <0.01
   GlycoSuiteDB                        249       249   <0.01
   ZFIN                                154       154   <0.01
   YEPD                                129       120   <0.01
   Aarhus/Ghent-2DPAGE                 128        98   <0.01
   PHCI-2DPAGE                         128       128   <0.01
   Siena-2DPAGE                        104       104   <0.01
   HSC-2DPAGE                           85        85   <0.01
   COMPLUYEAST-2DPAGE                   50        50   <0.01
   CarbBank                             41        21   <0.01
   Maize-2DPAGE                         39        39   <0.01
   PMMA-2DPAGE                          26        26   <0.01
   MypuList                             21        21   <0.01
   ANU-2DPAGE                           13        13   <0.01

     A.6   Miscellaneous statistics

Total number of distinct authors cited in SWISS-PROT: 146'936

Total number of entries encoded on a chloroplast : 2'609
Total number of entries encoded on a mitochondrion : 2'262
Total number of entries encoded on a cyanelle : 145
Total number of entries encoded on a plasmid : 2'344

Number of additional sequences encoded on splice variants : 3'505

--End of document--