Skip Header

You are using a version of Internet Explorer that may not display all features of this website. Please upgrade to a modern browser.

Swiss-Prot release 21.0

Published March 1, 1992

                    SWISS-PROT RELEASE 21.0 RELEASE NOTES

                               1. INTRODUCTION

   1.1  Evolution

   Release 21.0  of SWISS-PROT  contains 23742 sequence entries, comprising
   7'866'596 amino  acids abstracted from 23919 references. This represents
   an increase of 5% over release 20. The recent growth of the data bank is
   summarized below.

   Release    Date   Number of entries     Nb of amino acids

   3.0        11/86               4160               969 641
   4.0        04/87               4387             1 036 010
   5.0        09/87               5205             1 327 683
   6.0        01/88               6102             1 653 982
   7.0        04/88               6821             1 885 771
   8.0        08/88               7724             2 224 465
   9.0        11/88               8702             2 498 140
   10.0       03/89              10008             2 952 613
   11.0       07/89              10856             3 265 966
   12.0       10/89              12305             3 797 482
   13.0       01/90              13837             4 347 336
   14.0       04/90              15409             4 914 264
   15.0       08/90              16941             5 486 399
   16.0       11/90              18364             5 986 949
   17.0       02/91              20024             6 524 504
   18.0       05/91              20772             6 792 034
   19.0       08/91              21795             7 173 785
   20.0       11/91              22654             7 500 130
   21.0       03/92              23742             7 866 596

   1.2  Source of data

   Release 21.0  has been  updated using protein sequence data from release
   31.0 of  the PIR (Protein Identification Resource) protein data bank, as
   well as translation of nucleotide sequence data from release 29.0 of the
   EMBL Nucleotide Sequence Database.


   As an  indication to  the source  of the sequence data in the SWISS-PROT
   data bank we list here the statistics concerning the DR (Database cross-
   references) pointer lines:

   Entries with pointer(s) to only PIR entri(es):           4198
   Entries with pointer(s) to only EMBL entri(es):          3031
   Entries with pointer(s) to both EMBL and PIR entri(es): 16003
   Entries with no pointers lines:                           510


   2.1  Sequences and annotations

   About 1100 sequences have been added since release 20, the sequence data
   of 150  existing entries  has been  updated and  the annotations of 2860
   entries have  been revised.  In particular we have used reviews articles
   to update  the annotations  of  the  following  groups  or  families  of

   -  Acid phosphatases
   -  Acylphosphatases
   -  Bacterial regulatory proteins, luxR family
   -  Cyclins
   -  Cytochromes P450
   -  C-type lectin domain proteins
   -  Histidinol dehydrogenases
   -  Indole-3-glycerol phosphate synthases
   -  Microviridae sequences
   -  Myoglobins
   -  Osteonectin domain proteins
   -  Snakes venom phospholipases A2
   -  RecF proteins
   -  Scorpions venom beta-toxins
   -  PTN/MK heparin-binding protein family
   -  Tissue factor

   2.2  Change in the format of the entry names

   The dollar  sign `$'  in entry names has been replaced by the underscore
   character `_'.  This change  is made  on the behalf of users of sequence
   analysis software  running under  the Unix  operating system,  where the
   dollar sign  is a  reserved symbol.  Example: the entry name `CYC$HUMAN'
   has been changed to `CYC_HUMAN'.

   2.3  New line type GN

   The GN  (Gene Name)  line is  a new  line that  is used  to indicate the
   name(s) of  the gene(s)  that encodes  for the  protein being described.


   Previously this  information used to be found in the DE line as shown in
   the following example.

   In previous releases:


   In the current release:

        GN   ALB.

   The format of the GN line is:

        GN   NAME1[ AND|OR NAME2...].


        GN   ALB.
        GN   REX-1.

   It often  occurs that  more than  one gene  name has been assigned to an
   individual locus.  In that  case all  the synonyms  are listed. The word
   `OR' separates the different designations. The first name in the list is
   assumed to be the most correct (or most current) designation. Example:


   In a few cases, multiple genes encode for an identical protein sequence.
   In that  case all  the different  gene names  are listed. The word `AND'
   separates the designations. Example:

        GN   CECA1 AND CECA2.

   In very  rare cases  (only one  occurrence has been found in the current
   release) `AND'  and `OR' could be both present. In that case parenthesis
   are used as shown in the following example:

        GN   GVPA AND (GVPB OR GVPA2).

   2.4  New line type RM

   The RM  (Reference Medline)  line is used to indicate the Medline Unique
   Identifier (UID)  of a reference. Previously this information was listed
   in the  RC line  using the  `MEDLINE' token  as shown  in the  following

   In previous releases:

        RC   MEDLINE=90205618;


   In the current release:

        RM   90205618

   The format of the RM line is:

        RM   nnnnnnnn

   where `nnnnnnnn' is the eight digit Medline Unique Identifier (UID).

   2.5  Secondary structure information

   Thanks to  the help  of Chris  Sander  and  Reinhard  Schneider  of  the
   Biocomputing group  at EMBL  we have  added  to  the  feature  table  of
   sequence  entries   of  proteins   whose  tertiary  structure  is  known
   experimentally, the  secondary structure  information  corresponding  to
   that protein.  The secondary  structure assignment  is made according to
   DSSP (see Kabsch W., Sander C.; Biopolymers, 22:2577-2637(1983)) and the
   information is  extracted from  the coordinate  data sets of the Protein
   Data Bank (PDB).

   In the  feature table  only  three  types  of  secondary  structure  are
   specified :  helices (HELIX),  beta-strand (STRAND)  and  turns  (TURN).
   Residues not  specified in  one of  these classes  are in  a  `loop'  or
   `random-coil' structure).  Because the DSSP assignment has more than the
   three  common   secondary  structure  classes,  we  have  converted  the
   following DSSP assignments to HELIX, STRAND and TURN:

   DSSP   DSSP definition                                 SWISS-PROT
   code                                                   assignment
   ----   ---------------------------------------------   --------------
   H      Alpha-helix                                     HELIX
   G      3(10) helix                                     HELIX
   I      Pi-helix                                        HELIX
   E      Hydrogen bonded beta-strand (extended strand)   STRAND
   B      Residue in an isolated beta-bridge              STRAND
   T      H-bonded turn (3-turn, 4-turn or 5-turn)        TURN
   S      Bend (five-residue bend centered at residue i)  Not specified

   One should be aware of the following facts:

   a) Segment Length. For helices (alpha and 3-10), the residue just before
      and just after the helix as given by DSSP participates in the helical
      hydrogen bonding  pattern with  a single  H-bond. For  some practical
      purposes, one  can therefore extend the HELIX range by one residue on
      each side. E.g. HELIX 25-35 instead of HELIX 26-34. Also, the ends of
      secondary  structure   segments  are  less  well  defined  for  lower
      resolution structures. A fluctuation of +/- one residue is common.

   b) Missing segments.  In low resolution structures, badly formed helices
      or strands may be omitted in the DSSP definition.


   c) Special helices  and  strands.  Helices  of  length  three  are  3-10
      helices, those  of length four and longer are either alpha-helices or
      3-10 helices  (pi helices are extremely rare). A strand of length one
      corresponds to a residue in an isolated beta-bridge. Such bridges can
      be structurally important.

   d) Missing secondary  structure. No  secondary  structure  is  currently
      given in the feature table in the following cases:

      - No sequence data in the PDB entry.
      - Structure for which only C-alpha coordinates are in PDB.
      - NMR structure with more than one coordinate data set.
      - Model (i.e. theoretical) structure.

   2.6  Feature key name change

   The secondary  structure description feature key `BETA' has been renamed
   `STRAND' (see the section above for its current definition).

   2.7  Alu-derived warning entries

   Following the  advice and  in collaboration with Jean-Michel Claverie of
   the National  Center for  Biotechnology  Information  (NCBI,  Washington
   D.C.) we  have added  to SWISS-PROT Alu-derived "warning" entries. These
   entries are  provided in  order to  avoid  the  further  'pollution'  of
   protein sequence databases with Alu-derived amino acid sequences.

   Alu repetitive  sequences are  interspersed in human and primate genomes
   with an  average spacing  of 3 Kb. Some of them are actively transcribed
   by pol  III. Normal  transcripts may contain Alu-derived sequences in 5'
   or 3' untranslated regions. however, cDNA libraries also contain partial
   and/or  rearranged  cDNAs  ligated  with  Alu-derived  sequence  in  any
   orientation. This  has been  overlooked in  several occasions,  with the
   consequence  of   erroneous  Alu-derived   amino  acid  sequences  being

   Various analyses  indicate that  Alu repeats fall into six classes (A to
   F). Therefore  six "warning"  entries have been constituted with all six
   frames conceptual  translations of  one random  member of  each of these
   classes of Alu repeats. Any significant similarity of a putative protein
   sequence with  an Alu-translated entry must be taken as a warning that a
   part of  Alu repeat  may have  been artifactually included in the coding
   nucleotide sequence.

   These sequences have been assigned accession numbers P23959 (ALUA_HUMAN)
   to P23964 (ALUF_HUMAN).


   2.8  Feature lines `spring cleaning'

   We are  in the  process of  `cleaning' up  the comments  part of feature
   lines to homogenize the description of specific domains and sites.

   For example  regions enriched in one or more types of amino acid are now
   described using the general format:

        FT   DOMAIN      xxx    xxx       AA1[/AA2/.../AAN]-RICH.

   Where AA1,  AA2, ...  AAN are  valid amino-acid  three letter codes (the
   twenty standard  codes with  the addition  of `GLA' for gamma-carboxylic


        FT   DOMAIN       12     45       PRO-RICH.
        FT   DOMAIN      123    456       ASP/GLU-RICH (ACIDIC).
        FT   DOMAIN      246    678       SER/THR-RICH (LINKER REGION).

   Many other  changes of  this nature  have either  been completed in this
   release or are in the process of being carried out.

   Also  note   that  `non-experimental'  derived  features  are  now  only
   indicated by the qualifiers `PROBABLE', `POTENTIAL', or `BY SIMILARITY';
   the use  of qualifiers such as `PUTATIVE', `POSSIBLE', `TENTATIVE', etc.
   has been discontinued.

   This cleaning process will continue in the next two or three releases.

                            3. FORTHCOMING CHANGES

   The following changes will be implemented starting with release 22.

   3.1  A new feature table key: UNSURE

   The UNSURE  key will  be used  to describe  region(s) of  a sequence for
   which the authors are unsure about the sequence assignment.

   3.2. Others

   Other changes  are planned,  but we  are already  past our  deadline  to
   prepare this release so hare are some very brief notes!

   -  An ASN.1  version of  SWISS-PROT will  soon be  officially  available
      (thanks to  Mark Cavanaugh of the NCBI). Software developers that are
      interested in  such a  version can already obtain a beta-test release
      of  SWISS-PROT  21  in  ASN.1  format  (For  details  contact  me  at
   -  We are thinking of some new topics for the CC lines.
   -  New developments  concerning the integration of SWISS-PROT with other
      data banks is in the 'pipeline'.


                            4. ENZYME AND PROSITE

   4.1  The ENZYME data bank

   Release 8.0 of the ENZYME data bank is distributed along with release 21
   of SWISS-PROT.  ENZYME release 8.0 contains information relative to 3073
   enzymes. The  data bank  is complete  and up  to date.  Until new enzyme
   nomenclature data  is published  we only  plan to  update the SWISS-PROT
   pointers at  each release  of the  protein sequence  data bank,  correct
   eventual errors,  and complete  the information  concerning synonyms and
   cofactors using the literature.

   4.2  The PROSITE data bank

   Release 8.10  of the PROSITE data bank is distributed along with release
   21 of  SWISS-PROT. Release 8.10 contains 530 documentation chapters that
   describes 605 different patterns. Release 8.10 does not really represent
   a new  release; the  only changes  between  release  8.0  and  8.10  are
   updating of  the pointers to the SWISS-PROT entries whose name have been
   modified between  release 20  and 21.  The next release of PROSITE (9.0)
   will be distributed with release 22 of SWISS-PROT.

                            5. WE NEED YOUR HELP !

   We welcome  feedback from our users. We would especially appreciate that
   you notify  us if  you find  that sequences  belonging to  your field of
   expertise are  missing from  the data  bank. We  also would  like to  be
   notified about annotations to be updated, as for example if the function
   of a protein has been clarified or if new post-translational information
   has become available.


                         APPENDIX A: SOME STATISTICS

   A.1  Amino acid composition

        A.1.1  Composition in percent for the complete data bank

   Ala (A) 7.65   Gln (Q) 4.07   Leu (L) 9.15   Ser (S) 7.08
   Arg (R) 5.23   Glu (E) 6.26   Lys (K) 5.83   Thr (T) 5.84
   Asn (N) 4.45   Gly (G) 7.10   Met (M) 2.33   Trp (W) 1.30
   Asp (D) 5.24   His (H) 2.27   Phe (F) 3.97   Tyr (Y) 3.22
   Cys (C) 1.81   Ile (I) 5.46   Pro (P) 5.08   Val (V) 6.49

   Asx (B) 0.01   Glx (Z) 0.01   Xaa (X) 0.03

        A.1.2  Classification of the amino acids by their frequency

   Leu, Ala, Gly, Ser, Val, Glu, Thr, Lys, Ile, Asp, Arg, Pro, Asn, Gln,
   Phe, Tyr, Met, His, Cys, Trp

   A.2  Repartition of the sequences by their organism of origin

   Total number of species represented in this release of SWISS-PROT: 3159

        A.2.1 Table of the frequency of occurrence of species

        Species represented 1x: 1373
                            2x:  572
                            3x:  319
                            4x:  191
                            5x:  135
                            6x:  101
                            7x:   72
                            8x:   49
                            9x:   64
                           10x:   33
                       11- 20x:  128
                       21-100x:   95
                         >100x:   27


        A.2.2  Table of the most represented species

    Number   Frequency          Species
         1        1891          Human
         2        1697          Escherichia coli
         3        1116          Mouse
         4        1057          Rat
         5         803          Baker's yeast (Saccharomyces cerevisiae)
         6         519          Bovine
         7         443          Fruit fly (Drosophila melanogaster)
         8         392          Chicken
         9         347          Bacillus subtilis
        10         279          African clawed frog (Xenopus laevis)
        11         271          Rabbit
        12         253          Pig
        13         251          Vaccinia virus (strain Copenhagen)
        14         218          Salmonella typhimurium
        15         193          Human cytomegalovirus (strain AD169)
        16         170          Maize
        17         167          Bacteriophage T4
        18         151          Vaccinia virus (strain WR)
        19         135          Rice
        20         125          Tobacco
        21         121          Wheat
        22         113          Pea
        23         112          Staphylococcus aureus
        24         104          Pseudomonas aeruginosa
        25         103          Slime mold (Dictyostelium discoideum)
                   103          Barley
        27         101          Sheep
        28         100          Fission yeast (Schizosaccharomyces pombe)
        29          95          Spinach
                    95          Dog
                    95          Caenorhabditis elegans
        32          94          Soybean
        33          92          Neurospora crassa
        34          90          Pseudomonas putida
        35          89          Agrobacterium tumefaciens


   A.3  Repartition of the sequences by size

               From   To  Number             From   To   Number
                  1-  50    1539             1001-1100      220
                 51- 100    2602             1101-1200      138
                101- 150    3672             1201-1300      115
                151- 200    2281             1301-1400       68
                201- 250    1932             1401-1500       57
                251- 300    1745             1501-1600       32
                301- 350    1566             1601-1700       26
                351- 400    1537             1701-1800       27
                401- 450    1177             1801-1900       31
                451- 500    1265             1901-2000       26
                501- 550     926             2001-2100       10
                551- 600     644             2101-2200       26
                601- 650     463             2201-2300       31
                651- 700     329             2301-2400       11
                701- 750     310             2401-2500       12
                751- 800     236             >2500           54
                801- 850     194
                851- 900     201
                901- 950     122
                951-1000     117

   Currently the ten largest sequences are:

                            RYNR$RABIT  5037 a.a.
                            RYNR$HUMAN  5032 a.a.
                            APB$HUMAN   4563 a.a.
                            APOA$HUMAN  4548 a.a.
                            DYHC$TRIGR  4466 a.a.
                            POLG$BVDV   3988 a.a.
                            POLG$HCVA   3898 a.a.
                            POLG$HCVB   3898 a.a.
                            TRX$DROME   3759 a.a.
                            ACVA$PENCH  3746 a.a.


                         APPENDIX B: ON-LINE EXPERTS

   B.1  List of on-line experts for PROSITE and SWISS-PROT

Field of expertise            Name               Email address
---------------------------   ------------------ ----------------------------
African swine fever virus     Yanez R.J.
Alcohol dehydrogenases        Joernvall H.
                              Persson B.
Aldehyde dehydrogenases       Joernvall H.
                              Persson B.
Alpha-crystallins/HSP-20      Leunissen J.A.M.
                              de Jong W.         u629000@hnykun11.bitnet
Alpha-2-macroglobulins        Van Leuven F.      fred@blekul13.bitnet
Apolipoproteins               Boguski M.S.
Arrestins                     Kolakowski L.F.Jr.
Bacteriophage P4 proteins     Halling C.
Beta-lactamases               Brannigan J.
Chitinases                    Henrissat B.       cermav@frgren81.bitnet
Clusterin                     Peitsch M.C.
CTF/NF-I                      Mermod N.
Cytochromes P450              Holsztynska E.J.   ela@netcom.uucp
DEAD-box helicases            Linder P.
EF-hand calcium-binding       Cox J.A. 
                              Kretsinger R.H.    rhk5i@virginia.bitnet
Enoyl-CoA hydratase           Hofmann K.O.
fruR/lacI family HTH proteins Reizer J.
GATA-type zinc-fingers        Boguski M.S.
Glucanases                    Henrissat B.       cermav@frgren81.bitnet
                              Beguin P.          phycel@pasteur.bitnet
G-protein coupled receptors   Chollet A.
                              Attwood T.K.
GTPase-activating proteins    Boguski M.S.
HMG1/2 and HMG-14/17          Landsman D.
Inorganic pyrophosphatases    Kolakowski L.F.Jr.
Integrases                    Roy P.H.           2020000@lavalvx1.bitnet
Lipocalins                    Boguski M.S.
                              Peitsch M.C.
MAC components / perforin     Peitsch M.C.
Myelin proteolipid protein    Hofmann K.O.
PEP requiring enzymes         Reizer J.
Phytochromes                  Partis M.D.
Prokaryotic carbohydrate      Reizer J.
Protein kinases               Hanks S.           hanks@vuctrvax.bitnet
                              Hunter T.          hunter@salk.bitnet
PTS proteins                  Reizer J.
Restriction-modification      Bickle T.
            enzymes           Roberts R.J.


Ribosomal protein S3          Hallick R.
Ribosomal protein S15         Ellis S.R.         srelli01@ulkyvm.bitnet
Ring-cleavage dioxygenases    Harayama S.
Sodium symporters             Reizer J.
Subtilases                    Brannigan J.
Thiol proteases               Turk B.  
Thiol proteases inhibitors    Turk B.  
TPR repeats                   Boguski M.S.
Transit peptides              von Heijne G.
Type-II membrane antigens     Levy S.  
Uracil-DNA glycosylase        Aasland R.
Xylose isomerase              Jenkins J.
WAP-type domain               Claverie J.-M.

   B.2  Requirements to fulfill to become an on-line expert

   An expert  should be  a scientist  working with  specific famili(es)  of
   proteins (or specific domains) and which would:

   a) Review the  protein sequences in SWISS-PROT and the patterns/matrices
      in PROSITE relevant to their field of research.
   b) Agree to  be contacted  by people  that have obtained new sequence(s)
      which seem to belong to "their" familie(s) of proteins.
   c) Have access  to electronic  mail and be willing to use it to send and
      receive data.

   If you are willing to be part of this scheme please contact Amos Bairoch
   at one of the following electronic mail addresses:




   The current  status of the relationships (cross-references) between some
   biomolecular databases is shown in the following schematic:

                        *********************** <----- * EPD [Euk. Promot.] *
                        *  EMBL Nucleotide    * -----> **********************
                        *  Sequence Data      *
***************** ----> *  Library            *        **********************
* FLYBASE       * <---- *********************** <----- * ECD [E. coli map]  *
* [Drosophila   *                ^  |       ^          **********************
* genetic maps] * --------+      |  |       |
***************** <-----+ |      |  |       +--------- **********************
                        | |      |  |       +--------- * TFD [Trans. fact.] *
                        | |      |  |       | +------> **********************
                        | |      |  |       | |
*****************       | v      |  v       v |        **********************
* REBASE        *       ***********************        * ENZYME [Nomencl.]  *
* [Restriction  * <---- *  SWISS-PROT         * <----- **********************
*  enzymes]     *       *  Protein Sequence   *            |
*****************       *  Data Bank          *            v
                        ***********************        **********************
*****************         | ^  |  ^ |  ^ |  |          * OMIM   [Diseases]  *
* PROSITE       * <-------+ |  |  | |  | |  +--------> **********************
* [Patterns]    * ----------+  |  | |  | |
*****************              |  | |  | +-----------> **********************
             |                 |  | |  +-------------- * E. coli 2D gels    *
             |                 |  | |                  **********************
             |                 |  | |
             |                 |  | +----------------> **********************
             |                 |  +------------------- * EcoGene/EcoSeq     *
             |                 v                       **********************
             |          ***********************
             +--------> * PDB [3D structures] *