Skip Header

You are using a version of Internet Explorer that may not display all features of this website. Please upgrade to a modern browser.

Swiss-Prot release 16.0

Published November 1, 1990



                    SWISS-PROT RELEASE 16.0 RELEASE NOTES


                               1. INTRODUCTION

   1.1  Evolution

Release 16.0 of SWISS-PROT contains 18364 sequence entries, comprising 5'986'949
amino acids abstracted from 17763 references.  This represents an increase of 9%
over release 15.  The recent growth of the data bank is summarized below:

   Release    Date   Number of entries     Nb of amino acids

   3.0        11/86               4160               969 641
   4.0        04/87               4387             1 036 010
   5.0        09/87               5205             1 327 683
   6.0        01/88               6102             1 653 982
   7.0        04/88               6821             1 885 771
   8.0        08/88               7724             2 224 465
   9.0        11/88               8702             2 498 140
   10.0       03/89              10008             2 952 613
   11.0       07/89              10856             3 265 966
   12.0       10/89              12305             3 797 482
   13.0       01/90              13837             4 347 336
   14.0       04/90              15409             4 914 264
   15.0       08/90              16941             5 486 399
   16.0       11/90              18364             5 986 949


More than 1400 sequences have been added since release 15, the sequence data  of
271  existing  entries has been updated and the annotations of 3500 entries have
been revised.  In particular  we  have  used  reviews  articles  to  update  the
annotations of the following groups or families of proteins:

   -  Alpha and beta adrenergic receptors
   -  Arrestins
   -  Chromogranins / secretogranins
   -  CTF/NF-I family
   -  ClpP proteases
   -  ets family
   -  GABA(A) receptors
   -  Gram-positive cocci surface proteins
   -  Hexokinases
   -  Integrins alpha and beta chains
   -  NMePhe pili proteins
   -  p53 proteins
   -  Poly(ADP-ribose) polymerase
   -  Profilins
   -  S-Adenosylmethionine synthetases
   -  Site-specific recombinases
   -  Synaptobrevins
   -  Type-II membrane antigens
   -  UDP-glucoronosyl transferases
   -  Uteroglobin family
   -  LBP / BPI / CETP family



2  DATA SOURCES

Release 16.0 has been updated using protein sequence data from release  25.0  of
the  PIR  (Protein  Identification  Resource)  protein  data  bank,  as  well as
translation of nucleotide sequence data from release 24.0 of the EMBL Nucleotide
Sequence Data Library.

As an indication to the source of the sequence data in the SWISS-PROT data  bank
we  list  here  the  statistics  concerning  the DR (Databank Reference) pointer
lines:

   Entries with pointer(s) to only PIR entri(es):            3335
   Entries with pointer(s) to only EMBL entri(es):           5468
   Entries with pointer(s) to both EMBL and PIR entri(es):   8908
   Entries with no pointers lines:                            653



3  CHANGES AT THIS RELEASE

3.1  Cross-References To MIM

We have finished adding cross-references to all human protein  sequence  entries
which are represented in the latest edition of the MIM (Mendelian Inheritance in
Man) book [1].

There are currently 842 SWISS-PROT entries that have cross-references to one  or
more MIM catalog number.

A new document file, called MIMTOSP.DOC, is provided with SWISS-PROT,  it  is  a
sorted  list  of the MIM catalog entries cross-referenced in SWISS- PROT and the
corresponding protein sequence entry names.



4  FORTHCOMING CHANGES

We plan to implement the following changes in release  18  (these  changes  were
announced for release 16, but we are postponing their application so as to leave
more time to sequence analysis software developers to update their packages).



4.1  New Linetypes:  CA and CF

As we announced in the last release notes, the enzyme  entries  in  SWISS-  PROT
will have two new line-types:

      CA   Description_of_catalytic_activity.
      CF   Description_of_cofactor.

These lines will be automatically generated at each release of SWISS- PROT  from
the  information  stored  in  the  ENZYME  data  bank.   They  will  replace the
'CATALYTIC ACTIVITY` and 'COFACTORS` comment lines (CC) topics.  For example:

      CC   -!- CATALYTIC ACTIVITY: L-ASPARTATE + 2-OXOGLUTARATE =
               OXALOACETATE + L-GLUTAMATE.
--------------------------------------------------------------------------------
(1)  McKusick Victor A., Mendelian Inheritance in Man, Catalogs of autosomal
     dominant, autosomal recessive, and X-linked phenotypes, Ninth edition,
     Johns Hopkins University Press, Baltimore, (1990).


      CC   -!- COFACTOR: PYRIDOXAL PHOSPHATE.

will be changed to:

      CA   L-ASPARTATE + 2-OXOGLUTARATE = OXALOACETATE + L-GLUTAMATE.
      CF   PYRIDOXAL PHOSPHATE.



4.2  OS Line Format

We will invert the order of the information in the OS line.  Currently  we  have
"English  common  name  (Latin  name)";  we  will switch to "Latin name (English
common name)".  For example:

     OS   HUMAN (HOMO SAPIENS).

will be changed to:

     OS   HOMO SAPIENS (HUMAN).



5  ENZYME AND PROSITE DATABASES

Release 3.0 of the ENZYME data bank is distributed  along  with  release  16  of
SWISS-PROT.   ENZYME  release  3.0  contains information relative to 3071 enzyme
entries.  The  data  bank  is  complete  and  up  to  date.   Until  new  enzyme
nomenclature  data  is published we only plan to update the SWISS- PROT pointers
at each release of the protein sequence data bank, correct eventual errors,  and
complete the information concerning synonyms and cofactors using the literature.

Release 6.0 of the PROSITE data bank is distributed along  with  release  16  of
SWISS-PROT.  PROSITE release 6 contains 375 documentation chapters that describe
433 different patterns.  Since release 5.1 77 new chapters have been  added  and
131 have been updated.



6  DISTRIBUTION MEDIA

Data is available on magnetic tape, TK50 cassette and CD-ROM.  This  section  of
the  release  notes  applies to tape and TK50 cassette only; CD-ROM releases are
accompanied by their own release notes which detail the file  organisation  used
on CD.



6.1  Tape Formats

The distribution tapes are 9-track industry standard magnetic tapes.  Each  file
consists  of  fixed-length  80  byte  records,  padded  with  trailing blanks as
appropriate (except for VMS Backup  format  tapes  which  have  variable  length
records).   Tape  format details (density, blocksize, label type, character set)
are attached to each tape.

In many formats, a release requires more than one  tape  volume.   In  order  to
support  sequential volume serial numbers for multi-volume tape sets, the volume
labels are EMBL01 for the first tape, EMBL02 for the second tape, and so on.

VMS Backup format tapes (and all TK50 cassettes) contain the files listed below,
in the order shown, as a single save set called SWISS15.BCK.



6.2  Documentation

The documentation files on tape (those ending with a file extension of .DOC) are
designed to be easily printable.  As with all other tape files they have a fixed
record length of 80 bytes.  The page length of 63 lines per page was  chosen  so
that the pages will fit both on DIN A4 paper and on American 8-1/2" x 11" paper.

Page throws are indicated by lines with  the  six  character  string  <PAGE>  in
positions  1-6,  and  nothing  else.  If you wish to print any of these files we
suggest you copy them down onto disk, use your local  editor  to  replace  every
occurrence of <PAGE> in columns 1-6 by a formfeed (or whatever is appropriate to
force a page throw on your printer), and then print them.



6.3  Release 16 Files

The distribution tape(s) contain the files shown below,  in  the  order  listed.
Where more than one tape is required, subsequent volumes will continue where the
preceding volume left off.

   File Number    File Name       Description                      #Records
   -----------    ------------    -----------------------------    --------
             1    CONTENTS.DOC    Tape Contents (this table)             64
             2    SWISSPRT.USR    User Manual                          1830
             3    RELNOTES.DOC    Release Notes (this document)         895
             4    SPECIES.NDX     Species Index                       10297
             5    KEYWORD.NDX     Keyword Index                       20291
             6    AUTHOR.NDX      Author Index                        88407
             7    SHORTDIR.NDX    Short Directory Index               38209
             8    ACNUMBER.NDX    Accession Number Index              18543
             9    CITATION.NDX    Citation Index                      36092
            10    SPIDCODE.NDX    Species ID Code Index                3285
            11    EMBLTOSP.DOC    EMBL/SWISS-PROT Xreferences         17022
            12    ORGCODES.DOC    Organism Code List                   3518
            13    MIMTOSP.DOC     MIM/SWISS-PROT Xreferences           1022
            14    PDBTOSP.DOC     PDB/SWISS-PROT Xreferences            574
            15    JOURLIST.DOC    Journal Abbreviation List            1470
            16    DATASUB.TXT     Data Submission Form                  315
            17    SEQ.DAT         SWISS-PROT Sequence Entries        602162
            18    PROSITE.USR     PROSITE Database User Manual          915
            19    PROSITE.LIS     PROSITE Entry List                    508
            20    PROSITE.DOC     PROSITE Entry Documentation         15669
            21    PROSITE.DAT     PROSITE Database Entries             8229
            22    ENZYME.USR      ENZYME Database User Manual           487
            23    ENZYME.DAT      ENZYME Database                     20871


7  INDEX FILE FORMATS

The index key of each index file (keywords, authors, citations, etc.) is  sorted
alphabetically;  the  names  of  all entries containing the index key are listed
alphabetically after the key.  Each entry name is  accompanied  by  its  primary
accession number.

Except for the short directory, accession number and species  id  code  indices,
all  index  files have the same layout:  each value of the index key begins on a
new line in column 1, and the associated entry names begin  on  the  next  line.
Lines containing entry names are in fixed-format, layed out as follows:

                     Columns   Description
                     -------   ---------------------------
                     14-23     entry name (left-justified)
                     29-34     primary accession number

                     36-45     entry name (left-justified)
                     51-56     primary accession number

                     58-67     entry name (left-justified)
                     73-78     primary accession number


Up to three entry names fit on each such line; if a given  index  key  has  more
than  three  entries associated with it, additional lines are used (with exactly
the same layout).  This index file format is  identical  to  that  of  the  EMBL
Nucleotide Sequence database.



7.1  Species Index

This file lists all  species  which  appear  in  the  database.   It  is  sorted
alphabetically  on  (english)  common name.  The latin genus and species will be
listed, if present in  the  database  entries.   Mitochondrion  and  chloroplast
sequences  appear  under  separate  index  keys,  immediately  after the related
nuclear sequences.  An excerpt from the species index file is given  below  (the
ruler is presented for your convenience - it does not appear in the index file):

1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
CHILEAN POTATO-TREE (SOLANUM CRISPUM)
             PLAS$SOLCR     P00297
CHIMPANZEE IMMUNODEFICIENCY VIRUS (CIV) (SIV(CPZ))
             ENV$SIVCZ      P17281 GAG$SIVCZ      P17282 NEF$SIVCZ      P17664
             POL$SIVCZ      P17283 REV$SIVCZ      P17280 TAT$SIVCZ      P17285
             VIF$SIVCZ      P17284 VPR$SIVCZ      P17287 VPU$SIVCZ      P17286
CHIMPANZEE (PAN SATYRUS)
             HBA3$PANSA     P01935
CHIMPANZEE (PAN TROGLODYTES)
             CD4$PANTR      P16004 HA1A$PANTR     P13748 HA1B$PANTR     P13749
             HA1C$PANTR     P16209 HA1D$PANTR     P16210 HA1E$PANTR     P16215
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80



7.2  Keyword Index

This file lists all keywords which appear in the database (on the KW lines).  It
is  sorted alphabetically on keyword.  An excerpt from the keyword index file is
given below (the ruler is presented for your convenience - it does not appear in
the index file):

1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
ACETYLCHOLINE RECEPTOR INHIBITOR
             CXA1$CONGE     P01519 CXA1$CONMA     P01521 CXA1$CONST     P15471
             CXA2$CONGE     P01520
ACIDIC PROTEIN
             143E$BOVIN     P11576 B23$RAT        P13084 B231$HUMAN     P08693
             B232$HUMAN     P06748 BAT$HALHA      P13260 CALQ$CANFA     P12637
             CALQ$RABIT     P07221 CENB$HUMAN     P07199 CMGA$HUMAN     P10645
             CMGA$RAT       P10354 GRPE$ECOLI     P09372 MK16$YEAST     P10962
             NFL$BOVIN      P02548 NO38$CHICK     P16039 NU38$XENLA     P07222
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80



7.3  Author Index

This file lists all author names which appear in citations (on the RA lines)  in
the database.  It is sorted alphabetically on name.  Names are presented as they
appear in the database entries (i.e.  as cited in  publications)  -  we  do  not
attempt  to  handle  multiple  surname spellings, or different initials, for the
same author.  An excerpt from the author index file is given below (the ruler is
presented for your convenience - it does not appear in the index file):

1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
AAN F.
             PT3G$SALTY     P02908
AARDEN L.A.
             IL6$HUMAN      P05231
AARONSON R.P.
             HEMA$INCCA     P03465
AARONSON S.
             PGDS$HUMAN     P16234
AARONSON S.A.
             3ORF$EIAV1     P11305 DBL$HUMAN      P10911 ENV$EIAV1      P11306
             ENV$SMSAV      P03384 GAG$AVISN      P03342 GAG$MSVMO      P03334
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80



7.4  Short Directory Index

This file contains summary  information  about  all  entries  in  the  database,
including  a  brief description of the entry, its sequence length, molecule type
and data management class.  The file is sorted  alphabetically  on  entry  name.
The lines are fixed-format, layed out as follows:


     Columns   Field Name          Description
     -------   ---------------     -------------------------------------------
     01-10     entry name          left-justified
     14-14     data class          s = standard
                                   u = unannotated
                                   p = preliminary
                                   r = unreviewed
     16-18     molecule type       PRT (protein)
     20-25     sequence length     right-justified
     27-80     description         left-justified


If an entry's description occupies more than 54 characters (cols 27-80), it will
be  continued  onto  one or more continuation lines.  Continuation lines contain
description text (left-justified) in cols  27-80;  cols  01-26  are  blank.   An
excerpt  from  the  short  description  index  file is given below (the ruler is
presented for your convenience - it does not appear in the index file):

1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
104K$THEPA   s PRT    924 104 KD MICRONEME-RHOPTRY ANTIGEN - THEILERIA PARVA
10KA$MYCTU   s PRT    100 BCG-A HEATSHOCK PROTEIN (10 KD ANTIGEN) -
                          MYCOBACTERIUM TUBERCULOSIS
10KS$HUMAN   s PRT     91 10 KD SECRETORY PROTEIN PRECURSOR - HUMAN (HOMO
                          SAPIENS)
10KS$RAT     s PRT     18 10 KD SECRETORY PROTEIN (CC10) (FRAGMENT) - RAT
                          (RATTUS NORVEGICUS)
110K$PLAKN   s PRT    296 110 KD ANTIGEN (PK110) (FRAGMENT) - PLASMODIUM
                          KNOWLESI
11KD$ADE02   s PRT     79 11 KD CORE PROTEIN PRECURSOR (LATE L2 MU CORE PROTEIN)
                          (PROTEIN X) - ADENOVIRUS TYPE 2
11SB$CUCMA   s PRT    480 11-S GLOBULIN BETA SUBUNIT PRECURSOR - PUMPKIN
                          (CUCURBITA MAXIMA)
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80



7.5  Accession Number Index

This  file  lists  all  accession  numbers  in  the  database.   It  is   sorted
alphabetically  on  accession  number.  Each accession number is followed by the
name and primary accession number of every entry in which it occurs.

The lines are fixed-format, layed out exactly the same as the other index files;
the only difference is that the index key (accession number) appears on the same
line (in cols 1-6) as the list of entries which contain the key.  In  the  other
index  files,  the  index  key  appears  on  a  line by itself.  This index key,
however, is short enough to fit on the same line as the  entries,  and  we  have
done this to save space.

Accession numbers which have been deleted from the database also appear in  this
index, containing the word DELETED (left-justified) in the entry name field.

An excerpt from the accession number index file is given  below  (the  ruler  is
presented for your convenience - it does not appear in the index file):

1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
P00836       ATPE$HORVU     P00836
P00837       ATPG$ECOLI     P00837
P00838       ATPG$ECOLI     P00837
P00839       ATPL$BOVIN     P00839 ATPM$BOVIN     P07926
P00840       ATP9$MAIZE     P00840
P00841       ATP9$YEAST     P00841
P00842       ATPL$NEUCR     P00842
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80



7.6  Citation Index

This file lists all journal citations which  appear  in  the  database.   It  is
sorted  alphabetically  on citation.  An excerpt from the citation index file is
given below (the ruler is presented for your convenience - it does not appear in
the index file):

1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
ANN. GENET. SEL. ANIM. 4:515-521(1972)
             CASK$BOVIN     P02668
ANN. INST. PASTEUR IMMUNOL. 127C:261-271(1976)
             KV3F$HUMAN     P01624
ANN. INST. PASTEUR IMMUNOL. 132D:77-88(1981)
             HV41$MOUSE     P01811
ANN. N.Y. ACAD. SCI. 165:360-377(1969)
             HBB$ATEGE      P02034
ANN. N.Y. ACAD. SCI. 241:436-438(1974)
             HBB$RABIT      P02057
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80



7.7  Species ID Code Index

This file lists all species id codes which appear in the database.  It is sorted
alphabetically  on species id code.  Each code is followed by a (sorted) list of
all sequence entry codes which are associated with the  species  id  code.   The
sequence  entry  code is the first component of each entry name (before the "$")
and the species id code is the second component of each entry  name  (after  the
"$").   For  example, the entry called COA3$ADEA2 has a species id code of ADEA2
and a sequence entry code of COA3.

Each species id code starts on a  new  line,  occupying  columns  1-5,  and  the
associated  sequence entry codes (each up to 4 characters long) start in columns
10, 15, 20, 25, 30, 35, ...  70, 75.  If there are more than 14  sequence  entry
codes  for  a  given species id, as many continuation lines as necessary will be
used, with columns 1-9 left blank.

An excerpt from the species id code index file is  given  below  (the  ruler  is
presented for your convenience - it does not appear in the index file):

1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
ACHKL    CALM
ACHLY    API
ACIBA    KKA
ACICA    BEND CATA CATM DHGA DHGB ELH2 MURO PQQ1 PQQ2 PQQ3 PQQ5 PQQL PQQR TRPB
         TRPC TRPD TRPF TRPG
ACIFE    HGDA HGDB YHGD
ACIGL    ASPQ
ACIGU    PRT1
ACISP    CYMO
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80



8  WE NEED YOUR HELP !

We welcome  any  feedback  from  our  users.   We  would  especially  appreciate
information  about  any sequences belonging to your field of expertise which are
missing from the database.  We also would like to be notified about  annotations
which  should be updated (e.g.  if the function of a protein has been clarified,
or if new post-translational information has become available).



                                   APPENDIX A

                                SOME STATISTICS



A.1  AMINO ACID COMPOSITION

Composition in percent for the complete database:

   Ala (A) 7.67   Gln (Q) 4.09   Leu (L) 9.09   Ser (S) 7.09
   Arg (R) 5.25   Glu (E) 6.29   Lys (K) 5.85   Thr (T) 5.86
   Asn (N) 4.42   Gly (G) 7.14   Met (M) 2.31   Trp (W) 1.31
   Asp (D) 5.23   His (H) 2.27   Phe (F) 3.95   Tyr (Y) 3.20
   Cys (C) 1.83   Ile (I) 5.40   Pro (P) 5.11   Val (V) 6.48
   Asx (B) 0.01   Glx (Z) 0.01   Xaa (X) 0.03

Classification of the amino acids by their frequency:

   Leu, Ala, Gly, Ser, Val, Glu, Thr, Lys, Ile, Arg, Asp, Pro, Asn, Gln,
   Phe, Tyr, Met, His, Cys, Trp



A.2  DISTRIBUTION OF SEQUENCES BY SPECIES

Total number of species represented in this release of the database:  2492.

        Species represented 1x: 1097
                            2x:  459
                            3x:  238
                            4x:  153
                            5x:  114
                            6x:   86
                            7x:   52
                            8x:   37
                            9x:   51
                           10x:   17
                       11- 20x:   97
                       21-100x:   72
                         >100x:   19


Table of the most common species:

    Number   Frequency          Species
         1        1550          Human
         2        1326          Escherichia coli
         3         886          Mouse
         4         791          Rat
         5         591          Baker's yeast (Saccharomyces cerevisiae)
         6         422          Bovine
         7         342          Fruit fly (Drosophila melanogaster)
         8         311          Chicken
         9         229          Rabbit
        10         226          Bacillus subtilis
        11         220          African clawed frog (Xenopus laevis)
        12         205          Pig
        13         189          Human cytomegalovirus (strain AD169)
        14         168          Salmonella typhimurium
        15         154          Bacteriophage T4
        16         133          Maize
        17         118          Rice
        18         108          Tobacco
        19         105          Vaccinia virus
        20          95          Wheat
        21          94          Pea
        22          88          Staphylococcus aureus
        23          86          Slime mold (Dictyostelium discoideum)
        24          84          Liverwort (Marchantia polymorpha)
        25          83          Sheep
        26          81          Spinach
        27          80          Barley
                    80          Soybean
        29          70          Herpes simplex virus type 1 (strain 17)
                    70          Fission yeast (Schizosaccharomyces pombe)



A.3  DISTRIBUTION OF SEQUENCES BY LENGTH


               From   To  Number             From   To   Number
                  1-  50    1174             1001-1100      162
                 51- 100    2099             1101-1200      105
                101- 150    3021             1201-1300       88
                151- 200    1820             1301-1400       52
                201- 250    1468             1401-1500       45
                251- 300    1316             1501-1600       20
                301- 350    1147             1601-1700       22
                351- 400    1131             1701-1800       20
                401- 450     867             1801-1900       22
                451- 500     959             1901-2000       21
                501- 550     718             2001-2100        9
                551- 600     475             2101-2200       22
                601- 650     336             2201-2300       24
                651- 700     258             2301-2400       11
                701- 750     243             2401-2500       11
                751- 800     183             >2500           37
                801- 850     144
                851- 900     150
                901- 950     95
                951-1000     89


Currently the five largest sequences are:

                            RYNR$RABIT  5037 a.a.
                            APB$HUMAN   4563 a.a.
                            APOA$HUMAN  4548 a.a.
                            DMD$HUMAN   3685 a.a.
                            DMD$CHICK   3660 a.a.