Skip Header

You are using a version of browser that may not display all the features of this website. Please consider upgrading your browser.

Swiss-Prot release 15.0

Published August 1, 1990



                    SWISS-PROT RELEASE 15.0 RELEASE NOTES


                               1. INTRODUCTION

   1.1  Evolution

Release 15.0 of SWISS-PROT contains 16941 sequence entries, comprising 5'486'399
amino  acids  abstracted  from 16223 references.  This represents an increase of
12% over release 14.  The recent growth of the data bank is summarized below:

   Release    Date   Number of entries     Nb of amino acids
 
   3.0        11/86               4160               969 641
   4.0        04/87               4387             1 036 010
   5.0        09/87               5205             1 327 683
   6.0        01/88               6102             1 653 982
   7.0        04/88               6821             1 885 771
   8.0        08/88               7724             2 224 465
   9.0        11/88               8702             2 498 140
   10.0       03/89              10008             2 952 613
   11.0       07/89              10856             3 265 966
   12.0       10/89              12305             3 797 482
   13.0       01/90              13837             4 347 336
   14.0       04/90              15409             4 914 264
   15.0       08/90              16941             5 486 399


Almost 1600 sequences have been added since release 14, the sequence data of 159
existing  entries has been updated and the annotations of 2660 entries have been
revised.  In particular we have used reviews articles to update the  annotations
of the following groups or families of proteins:

   -  43 Kd postsynaptic proteins
   -  Alanine racemases
   -  Antifreeze proteins
   -  Aplysia neuropeptides
   -  Biopterin-dependent aromatic amino acid hydroxylases
   -  Cell shape bacterial proteins
   -  Connexins
   -  Cytomegaloviruses proteins
   -  Disintegrins
   -  Eryf1-type proteins
   -  Flagellin and other flagellar proteins
   -  Fucosidases
   -  HMG proteins
   -  Inorganic pyrophosphatases
   -  Kinesin heavy chain family proteins
   -  Methylated-DNA--protein-cysteine methyltransferases
   -  Neuromodulins (GAP-43)
   -  Pectate lyases
   -  Peroxisomal proteins
   -  Potexviruses sequences
   -  PTS proteins
   -  Pyruvoyl-dependent enzymes
   -  Ribonucleotide reductases small subunit
   -  SRF-type transcription factors
   -  Synapsins
   -  TonB-dependent receptor proteins
   -  TPR repeat proteins
   -  Transcription factor TFIID
   -  Tyrosine specific protein phosphatases
   -  Uricases



2  DATA SOURCES

Release 15.0 has been updated using protein sequence data from release  24.0  of
the  PIR  (Protein  Identification  Resource)  protein  data  bank,  as  well as
translation of nucleotide sequence data from release 23.0 of the EMBL Nucleotide
Sequence Data Library.

As an indication to the source of the sequence data in the SWISS-PROT data  bank
we  list  here  the  statistics  concerning  the DR (Databank Reference) pointer
lines:

   Entries with pointer(s) to only PIR entri(es):            2868
   Entries with pointer(s) to only EMBL entri(es):           8049
   Entries with pointer(s) to both EMBL and PIR entri(es):   5083
   Entries with no pointers lines (entered in house):         941



3  CHANGES AT THIS RELEASE

3.1  DR Line Format

The DR line format has been extended  to  accept  cross-references  to  MIM  and
REBASE.   MIM  (Mendelian Inheritance in Man) is a data bank that holds clinical
data on a range of human genetic disease (1); REBASE is a data bank  that  holds
data concerning type 2 restriction enzymes (2).

For a MIM reference the primary identifier is the catalog number of the  disease
(or  phenotype)  and  the  secondary identifier is the last revision of the data
bank that we have used to derive the cross- reference.

For a REBASE reference  the  primary  identifier  is  the  entry  name  and  the
secondary  identifier  is the latest release number of the REBASE data bank that
we have used to derive the cross-reference.

Examples:
 
      DR   MIM; 24990; EIGHTH EDITION.
      DR   REBASE; BSURI; RELEASE 9007.


(1)  McKusick Victor A., Mendelian Inheritance in Man, Catalogs of autosomal
     dominant, autosomal recessive, and X-linked phenotypes, Eighth edition,
     Johns Hopkins University Press, Baltimore, (1988).
 
(2)  Roberts Rich, Cold Spring Harbor Laboratory, Box 100, Cold Spring Harbor,
     NY 11724, USA.


4  FORTHCOMING CHANGES

We plan to implement the following changes in release 16:


4.1  New Linetypes:  CA and CF

As we announced in the last release notes, the enzyme  entries  in  SWISS-  PROT
will have two new line-types:

      CA   Description_of_catalytic_activity.
      CF   Description_of_cofactor.

These lines will be automatically generated at each release of SWISS- PROT  from
the  information  stored  in  the  ENZYME  data  bank.   They  will  replace the
'CATALYTIC ACTIVITY` and 'COFACTORS` comment lines (CC) topics.  For example:

      CC   -!- CATALYTIC ACTIVITY: L-ASPARTATE + 2-OXOGLUTARATE =
               OXALOACETATE + L-GLUTAMATE.
      CC   -!- COFACTOR: PYRIDOXAL PHOSPHATE.

will be changed to:
 
      CA   L-ASPARTATE + 2-OXOGLUTARATE = OXALOACETATE + L-GLUTAMATE.
      CF   PYRIDOXAL PHOSPHATE.



4.2  OS Line Format

We will invert the order of the information in the OS line.  Currently  we  have
"English  common  name  (Latin  name)";  we  will switch to "Latin name (English
common name)".  For example:

     OS   HUMAN (HOMO SAPIENS).

will be changed to:

     OS   HOMO SAPIENS (HUMAN).



5  ENZYME AND PROASITE DATABASES

Release 2.0 of the ENZYME data bank is distributed  along  with  release  15  of
SWISS-PROT.   ENZYME  release  2.0  contains information relative to 3071 enzyme
entries.  As it is the case for SWISS-PROT, cross-references were added to  MIM.
See  the  User's  manual  of  ENZYME for a complete description of the syntax of
these cross-references.

Release 5.1 of the PROSITE data bank is distributed along  with  release  15  of
SWISS-PROT.   PROSITE  release  5.1 does not really represent a new release; the
only changes between release 5.0 and 5.1 are corrections of  two  format  errors
and  updating  of  the  pointers  to the SWISS-PROT entries whose name have been
modified between release 14 and 15.  The next release of PROSITE (6.0)  will  be
distributed with release 16.0 of SWISS- PROT.



6  DISTRIBUTION MEDIA

Data is available on magnetic tape, TK50 cassette and CD-ROM.  This  section  of
the  release  notes  applies to tape and TK50 cassette only; CD-ROM releases are
accompanied by their own release notes which detail the file  organisation  used
on CD.



6.1  Tape Formats

The distribution tapes are 9-track industry standard magnetic tapes.  Each  file
consists  of  fixed-length  80  byte  records,  padded  with  trailing blanks as
appropriate (except for VMS Backup  format  tapes  which  have  variable  length
records).   Tape  format details (density, blocksize, label type, character set)
are attached to each tape.

In many formats, a release requires more than one  tape  volume.   In  order  to
support  sequential volume serial numbers for multi-volume tape sets, the volume
labels are EMBL01 for the first tape, EMBL02 for the second tape, and so on.

VMS Backup format tapes (and all TK50 cassettes) contain the files listed below,
in the order shown, as a single save set called SWISS15.BCK.



6.2  Documentation

The documentation files on tape (those ending with a file extension of .DOC) are
designed to be easily printable.  As with all other tape files they have a fixed
record length of 80 bytes.  The page length of 63 lines per page was  chosen  so
that the pages will fit both on DIN A4 paper and on American 8-1/2" x 11" paper.

Page throws are indicated by lines with  the  six  character  string  <PAGE>  in
positions  1-6,  and  nothing  else.  If you wish to print any of these files we
suggest you copy them down onto disk, use your local  editor  to  replace  every
occurrence of <PAGE> in columns 1-6 by a formfeed (or whatever is appropriate to
force a page throw on your printer), and then print them.



6.3  Release 15 Files

The distribution tape(s) contain the files shown below,  in  the  order  listed.
Where more than one tape is required, subsequent volumes will continue where the
preceding volume left off.


   File Number    File Name       Description                      #Records
   -----------    ------------    -----------------------------    --------
             1    CONTENTS.DOC    Tape Contents (this table)             63
             2    SWISSPRT.USR    User Manual                          1861
             3    RELNOTES.DOC    Release Notes (this document)         934
             4    SPECIES.NDX     Species Index                        9588
             5    KEYWORD.NDX     Keyword Index                       18482
             6    AUTHOR.NDX      Author Index                        81319
             7    SHORTDIR.NDX    Short Directory Index               34794
             8    ACNUMBER.NDX    Accession Number Index              17088
             9    CITATION.NDX    Citation Index                      32937
            10    SPIDCODE.NDX    Species ID Code Index                3091
            11    EMBLTOSP.DOC    EMBL/SWISS-PROT Xreferences         15057
            12    ORGCODES.DOC    Organism Code List                   3223
            13    PDBTOSP.DOC     PDB/SWISS-PROT Xreferences            567
            14    JOURLIST.DOC    Journal Abbreviation List            1383
            15    DATASUB.TXT     Data Submission Form                  312
            16    SEQ.DAT         SWISS-PROT Sequence Entries        542448
            17    PROSITE.USR     PROSITE Database User Manual          915
            18    PROSITE.LIS     PROSITE Entry List                    424
            19    PROSITE.DOC     PROSITE Entry Documentation         12313
            20    PROSITE.DAT     PROSITE Database Entries             6399
            21    ENZYME.USR      ENZYME Database User Manual           487
            22    ENZYME.DAT      ENZYME Database                     17561



7  INDEX FILE FORMATS

The index key of each index file (keywords, authors, citations, etc.) is  sorted
alphabetically;  the  names  of  all entries containing the index key are listed
alphabetically after the key.  Each entry name is  accompanied  by  its  primary
accession number.

Except for the short directory, accession number and species  id  code  indices,
all  index  files have the same layout:  each value of the index key begins on a
new line in column 1, and the associated entry names begin  on  the  next  line.
Lines containing entry names are in fixed-format, layed out as follows:

                     Columns   Description
                     -------   ---------------------------
                     14-23     entry name (left-justified)
                     29-34     primary accession number
                
                     36-45     entry name (left-justified)
                     51-56     primary accession number

                     58-67     entry name (left-justified)
                     73-78     primary accession number


Up to three entry names fit on each such line; if a given  index  key  has  more
than  three  entries associated with it, additional lines are used (with exactly
the same layout).  This index file format is  identical  to  that  of  the  EMBL
Nucleotide Sequence database.


7.1  Species Index

This file lists all  species  which  appear  in  the  database.   It  is  sorted
alphabetically  on  (english)  common name.  The latin genus and species will be
listed, if present in  the  database  entries.   Mitochondrion  and  chloroplast
sequences  appear  under  separate  index  keys,  immediately  after the related
nuclear sequences.  An excerpt from the species index file is given  below  (the
ruler is presented for your convenience - it does not appear in the index file):

1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
CHILEAN POTATO-TREE (SOLANUM CRISPUM) 
             PLAS$SOLCR     P00297 
CHIMPANZEE IMMUNODEFICIENCY VIRUS (CIV) (SIV(CPZ)) 
             ENV$SIVCZ      P17281 GAG$SIVCZ      P17282 NEF$SIVCZ      P17664 
             POL$SIVCZ      P17283 REV$SIVCZ      P17280 TAT$SIVCZ      P17285 
             VIF$SIVCZ      P17284 VPR$SIVCZ      P17287 VPU$SIVCZ      P17286 
CHIMPANZEE (PAN SATYRUS) 
             HBA3$PANSA     P01935 
CHIMPANZEE (PAN TROGLODYTES) 
             CD4$PANTR      P16004 HA1A$PANTR     P13748 HA1B$PANTR     P13749 
             HA1C$PANTR     P16209 HA1D$PANTR     P16210 HA1E$PANTR     P16215 
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80



7.2  Keyword Index

This file lists all keywords which appear in the database (on the KW lines).  It
is  sorted alphabetically on keyword.  An excerpt from the keyword index file is
given below (the ruler is presented for your convenience - it does not appear in
the index file):

1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
ACETYLCHOLINE RECEPTOR INHIBITOR
             CXA1$CONGE     P01519 CXA1$CONMA     P01521 CXA1$CONST     P15471 
             CXA2$CONGE     P01520 
ACIDIC PROTEIN
             143E$BOVIN     P11576 B23$RAT        P13084 B231$HUMAN     P08693 
             B232$HUMAN     P06748 BAT$HALHA      P13260 CALQ$CANFA     P12637 
             CALQ$RABIT     P07221 CENB$HUMAN     P07199 CMGA$HUMAN     P10645 
             CMGA$RAT       P10354 GRPE$ECOLI     P09372 MK16$YEAST     P10962 
             NFL$BOVIN      P02548 NO38$CHICK     P16039 NU38$XENLA     P07222 
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80



7.3  Author Index

This file lists all author names which appear in citations (on the RA lines)  in
the database.  It is sorted alphabetically on name.  Names are presented as they
appear in the database entries (i.e.  as cited in  publications)  -  we  do  not
attempt  to  handle  multiple  surname spellings, or different initials, for the
same author.  An excerpt from the author index file is given below (the ruler is
presented for your convenience - it does not appear in the index file):

1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
AAN F.
             PT3G$SALTY     P02908 
AARDEN L.A.
             IL6$HUMAN      P05231 
AARONSON R.P.
             HEMA$INCCA     P03465 
AARONSON S.
             PGDS$HUMAN     P16234 
AARONSON S.A.
             3ORF$EIAV1     P11305 DBL$HUMAN      P10911 ENV$EIAV1      P11306 
             ENV$SMSAV      P03384 GAG$AVISN      P03342 GAG$MSVMO      P03334 
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80



7.4  Short Directory Index

This file contains summary  information  about  all  entries  in  the  database,
including  a  brief description of the entry, its sequence length, molecule type
and data management class.  The file is sorted  alphabetically  on  entry  name.
The lines are fixed-format, layed out as follows:

     Columns   Field Name          Description
     -------   ---------------     -------------------------------------------
     01-10     entry name          left-justified
     14-14     data class          s = standard
                                   u = unannotated
                                   p = preliminary
                                   r = unreviewed
     16-18     molecule type       PRT (protein)
     20-25     sequence length     right-justified
     27-80     description         left-justified


If an entry's description occupies more than 54 characters (cols 27-80), it will
be  continued  onto  one or more continuation lines.  Continuation lines contain
description text (left-justified) in cols  27-80;  cols  01-26  are  blank.   An
excerpt  from  the  short  description  index  file is given below (the ruler is
presented for your convenience - it does not appear in the index file):


1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
104K$THEPA   s PRT    924 104 KD MICRONEME-RHOPTRY ANTIGEN - THEILERIA PARVA 
10KA$MYCTU   s PRT    100 BCG-A HEATSHOCK PROTEIN (10 KD ANTIGEN) - 
                          MYCOBACTERIUM TUBERCULOSIS 
10KS$HUMAN   s PRT     91 10 KD SECRETORY PROTEIN PRECURSOR - HUMAN (HOMO 
                          SAPIENS) 
10KS$RAT     s PRT     18 10 KD SECRETORY PROTEIN (CC10) (FRAGMENT) - RAT 
                          (RATTUS NORVEGICUS) 
110K$PLAKN   s PRT    296 110 KD ANTIGEN (PK110) (FRAGMENT) - PLASMODIUM 
                          KNOWLESI 
11KD$ADE02   s PRT     79 11 KD CORE PROTEIN PRECURSOR (LATE L2 MU CORE PROTEIN)
                          (PROTEIN X) - ADENOVIRUS TYPE 2 
11SB$CUCMA   s PRT    480 11-S GLOBULIN BETA SUBUNIT PRECURSOR - PUMPKIN 
                          (CUCURBITA MAXIMA) 
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80



7.5  Accession Number Index

This  file  lists  all  accession  numbers  in  the  database.   It  is   sorted
alphabetically  on  accession  number.  Each accession number is followed by the
name and primary accession number of every entry in which it occurs.

The lines are fixed-format, layed out exactly the same as the other index files;
the only difference is that the index key (accession number) appears on the same
line (in cols 1-6) as the list of entries which contain the key.  In  the  other
index  files,  the  index  key  appears  on  a  line by itself.  This index key,
however, is short enough to fit on the same line as the  entries,  and  we  have
done this to save space.

Accession numbers which have been deleted from the database also appear in  this
index, containing the word DELETED (left-justified) in the entry name field.

An excerpt from the accession number index file is given  below  (the  ruler  is
presented for your convenience - it does not appear in the index file):

1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
P00836       ATPE$HORVU     P00836 
P00837       ATPG$ECOLI     P00837 
P00838       ATPG$ECOLI     P00837 
P00839       ATPL$BOVIN     P00839 ATPM$BOVIN     P07926 
P00840       ATP9$MAIZE     P00840 
P00841       ATP9$YEAST     P00841 
P00842       ATPL$NEUCR     P00842 
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80


7.6  Citation Index

This file lists all journal citations which  appear  in  the  database.   It  is
sorted  alphabetically  on citation.  An excerpt from the citation index file is
given below (the ruler is presented for your convenience - it does not appear in
the index file):

1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
ANN. GENET. SEL. ANIM. 4:515-521(1972)
             CASK$BOVIN     P02668 
ANN. INST. PASTEUR IMMUNOL. 127C:261-271(1976)
             KV3F$HUMAN     P01624 
ANN. INST. PASTEUR IMMUNOL. 132D:77-88(1981)
             HV41$MOUSE     P01811 
ANN. N.Y. ACAD. SCI. 165:360-377(1969)
             HBB$ATEGE      P02034 
ANN. N.Y. ACAD. SCI. 241:436-438(1974)
             HBB$RABIT      P02057 
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80


7.7  Species ID Code Index

This file lists all species id codes which appear in the database.  It is sorted
alphabetically  on species id code.  Each code is followed by a (sorted) list of
all sequence entry codes which are associated with the  species  id  code.   The
sequence  entry  code is the first component of each entry name (before the "$")
and the species id code is the second component of each entry  name  (after  the
"$").   For  example, the entry called COA3$ADEA2 has a species id code of ADEA2
and a sequence entry code of COA3.

Each species id code starts on a  new  line,  occupying  columns  1-5,  and  the
associated  sequence entry codes (each up to 4 characters long) start in columns
10, 15, 20, 25, 30, 35, ...  70, 75.  If there are more than 14  sequence  entry
codes  for  a  given species id, as many continuation lines as necessary will be
used, with columns 1-9 left blank.

An excerpt from the species id code index file is  given  below  (the  ruler  is
presented for your convenience - it does not appear in the index file):

1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
ACHKL    CALM 
ACHLY    API  
ACIBA    KKA  
ACICA    BEND CATA CATM DHGA DHGB ELH2 MURO PQQ1 PQQ2 PQQ3 PQQ5 PQQL PQQR TRPB 
         TRPC TRPD TRPF TRPG 
ACIFE    HGDA HGDB YHGD 
ACIGL    ASPQ 
ACIGU    PRT1 
ACISP    CYMO 
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80


8  WE NEED YOUR HELP !

We welcome  any  feedback  from  our  users.   We  would  especially  appreciate
information  about  any sequences belonging to your field of expertise which are
missing from the database.  We also would like to be notified about  annotations
which  should be updated (e.g.  if the function of a protein has been clarified,
or if new post-translational information has become available).


                                   APPENDIX A

                                SOME STATISTICS



A.1  AMINO ACID COMPOSITION

Composition in percent for the complete database:

   Ala (A) 7.70   Gln (Q) 4.09   Leu (L) 9.09   Ser (S) 7.07
   Arg (R) 5.25   Glu (E) 6.28   Lys (K) 5.84   Thr (T) 5.86
   Asn (N) 4.42   Gly (G) 7.15   Met (M) 2.30   Trp (W) 1.31
   Asp (D) 5.23   His (H) 2.27   Phe (F) 3.95   Tyr (Y) 3.20
   Cys (C) 1.83   Ile (I) 5.39   Pro (P) 5.11   Val (V) 6.49
   Asx (B) 0.01   Glx (Z) 0.01   Xaa (X) 0.04

Classification of the amino acids by their frequency:

   Leu, Ala, Gly, Ser, Val, Glu, Thr, Lys, Ile, Arg, Asp, Pro, Asn, Gln,
   Phe, Tyr, Met, His, Cys, Trp



A.2  DISTRIBUTION OF SEQUENCES BY SPECIES

Total number of species represented in this release of the database:  2363.

        Species represented 1x: 1051
                            2x:  435
                            3x:  237
                            4x:  135
                            5x:  108
                            6x:   84
                            7x:   46
                            8x:   31
                            9x:   48
                           10x:   20
                       11- 20x:   84
                       21-100x:   66
                         >100x:   18

Table of the most common species:

    Number   Frequency          Species

         1        1462          Human
         2        1246          Escherichia coli
         3         829          Mouse
         4         712          Rat
         5         549          Baker's yeast (Saccharomyces cerevisiae)
         6         404          Bovine
         7         297          Fruit fly (Drosophila melanogaster)
         8         267          Chicken
         9         220          Rabbit
        10         200          Bacillus subtilis
        11         191          Pig
        12         189          Human cytomegalovirus (strain AD169)
        13         159          African clawed frog (Xenopus laevis)
        14         157          Salmonella typhimurium
        15         142          Bacteriophage T4
        16         126          Maize
        17         113          Rice
        18         106          Tobacco
        19          98          Vaccinia virus
        20          92          Wheat
        21          86          Pea
        22          84          Liverwort (Marchantia polymorpha)
        23          80          Staphylococcus aureus
        24          78          Slime mold (Dictyostelium discoideum)
                    78          Spinach
        26          77          Soybean
        27          76          Barley
                    76          Sheep
        29          70          Herpes simplex virus type 1 (strain 17)
        30          67          Varicella-Zoster virus (strain Dumas)
 

A.3  DISTRIBUTION OF SEQUENCES BY LENGTH


          From   To  Number             From   To   Number

             1-  50    1079             1001-1100      145
            51- 100    1962             1101-1200       98
           101- 150    2833             1201-1300       79
           151- 200    1690             1301-1400       50
           201- 250    1345             1401-1500       41
           251- 300    1205             1501-1600       18
           301- 350    1039             1601-1700       21
           351- 400    1056             1701-1800       18
           401- 450     785             1801-1900       16
           451- 500     860             1901-2000       19
           501- 550     665             2001-2100        8
           551- 600     435             2101-2200       22
           601- 650     315             2201-2300       22
           651- 700     236             2301-2400       10
           701- 750     225             2401-2500       10
           751- 800     158             >2500           32
           801- 850     131
           851- 900     144
           901- 950      85
           951-1000      84

Currently the five largest sequences are:

     RYNR$RABIT  5037 a.a.
     APB$HUMAN   4563 a.a.
     APOA$HUMAN  4548 a.a.
     DMD$HUMAN   3685 a.a.
     DMD$CHICK   3660 a.a.