SubmitCancel

Skip Header

You are using a version of browser that may not display all the features of this website. Please consider upgrading your browser.

Swiss-Prot release 14.0

Published April 1, 1990



                    SWISS-PROT RELEASE 14.0 RELEASE NOTES


                               1. INTRODUCTION

   1.1  Evolution

Release 14.0 of SWISS-PROT contains 15409 sequence entries,  comprising  4914264
amino  acids  abstracted  from 15054 references.  This represents an increase of
13% over release 13.0.  The recent growth of the database is summarized below:

     Release    Date   Number of entries     Nb of amino acids

     3.0        11/86               4160               969 641
     4.0        04/87               4387             1 036 010
     5.0        09/87               5205             1 327 683
     6.0        01/88               6102             1 653 982
     7.0        04/88               6821             1 885 771
     8.0        09/88               7724             2 224 465
     9.0        12/88               8702             2 498 140
     10.0       03/89              10008             2 952 613
     11.0       06/89              10856             3 265 966
     12.0       10/89              12305             3 797 482
     13.0       01/90              13837             4 347 336
     14.0       04/90              15409             4 914 264

Almost 1600 sequences have been added since release 13, the sequence data of 191
existing  entries has been updated and the annotations of 2650 entries have been
revised.  In particular we have used reviews articles to update the  annotations
of the following groups or families of proteins:

     2-S seed storage proteins
     3-hydroxyacyl-CoA dehydrogenases
     Acyl-CoA dehydrogenases
     Acylphosphatases
     Albumin / AFP / VDBP family
     Aldo/keto reductases
     Alkaline phosphatases
     Amino acid permeases
     Arthropod hemocyanins / insect LSPs
     Bacterial activator proteins, gntR family
     Calcitonin / CGRP / IAPP family
     Cecropin family
     Copper type II, ascorbate-dependent monooxygenases
     Cytochromes b/b6
     DNA ligases
     Enoyl-CoA hydratases
     Glutamine synthetases
     Growth factor and cytokines receptors
     Isocitrate lyases
     Lamins
     Leguminous lectins
     Nerve growth factor family
     Nitrogenase component 1 subunits
     Phosphoglycerate mutases
     Phosphoribosyl pyrophosphate synthetases
     Phosphoribosylglycinamide synthetases
     Phytochromes
     Platelet-derived growth factor (PDGF) family
     Proliferating cell nuclear antigen (PCNA)
     Protamine P1
     Rieske iron-sulfur proteins
     Rotaviruses proteins
     Sulfatases
     Thiolases
     Tryptophan synthetases
     Ubiquitin carboxyl-terminal hydrolases
     Ubiquitin-conjugating enzymes
     Ureases
     Urokinases and tissue plasminogen activators
     Zinc carboxypeptidases



2  DATA SOURCES

Release 14.0 has been updated using protein sequence data from release  23.0  of
the  PIR  (Protein  Identification  Resource)  protein  data  bank,  as  well as
translation of nucleotide sequence data from release 22.0 of the EMBL Nucleotide
Sequence Database.

As an indication to the source of the sequence data in the SWISS-PROT data  bank
we  list  here  the  statistics  concerning  the DR (Databank Reference) pointer
lines:

Entries with pointer(s) to only PIR entri(es):                  3079
Entries with pointer(s) to only EMBL entri(es):                 6896
Entries with pointer(s) to both EMBL and PIR entri(es):         4557
Entries with no pointers lines (entered in house):               877



3  CHANGES AT THIS RELEASE

3.1  OG Line Format

The OG (OrGanelle) line format has been extended to take  into  account  protein
sequences  from  genes  originating  from  the  cyanelle  of  bacteria  such  as
Cyanophora paradoxa.  The valid syntax is:

     OG   CYANELLE.



3.2  DR Line Format

The DR line format has been extended to accept cross-references to PROSITE,  the
data  bank  of sites and pattern in proteins which is now being distributed with
SWISS-PROT.  The primary identifier is the PROSITE  accession  number,  and  the
secondary identifier is the PROSITE entry name.  Examples:

     DR   PROSITE; PS00088; SOD_MN.
     DR   PROSITE; PS00021; KRINGLE.


3.3  New CC Line Topics

As of release 14 we have added two new topics for the  comments  (CC)  linetype:
ENZYME REGULATION, and TISSUE SPECIFICITY.  Example of their usage:

     CC   -!- ENZYME REGULATION: THE ACTIVITY OF THIS ENZYME IS CONTROLLED
     CC       BY ADENYLATION. THE FULLY ADENYLATED ENZYME COMPLEX IS
     CC       INACTIVE.

     CC   -!- TISSUE SPECIFICITY: KIDNEY; SUBMAXILLARY GLAND; URINE.



3.4  Documentation Files

The file ECNUMBER.DOC, which contained an index  of  SWISS-PROT  enzyme  entries
classified  by  EC  number,  is  no longer distributed.  This information can be
found in the new database ENZYME which is now distributed with SWISS-PROT.

The JOURLIST.DOC file now includes the ISSN numbers for all  journals  cited  in
SWISS-PROT and PROSITE.

For the sake of  consistency  with  the  newly  introduced  PROSITE  and  ENZYME
databases,  the  name  of  the SWISS-PROT User Manual file has been changed from
USRMAN.DOC to SWISSPRT.USR.  All three databases  now  have  .USR  as  the  file
extension for their User Manual files.



4  NEW PROSITE DATABASE

PROSITE is a compilation of sites and patterns found in protein sequences.  This
database consists of two files:  the first file contains the patterns as well as
the results of the scan of  SWISS-PROT  for  these  patterns,  the  second  file
contains  the documentation that fully describes each pattern.  A sample pattern
entry and its corresponding documentation entry are shown below.

The use of protein sequence patterns (or motifs) to determine the function(s) of
proteins  is  becoming  very  rapidly  one  of  the  essential tools of sequence
analysis.  PROSITE, as a stand-alone database is pertinent  for  such  purposes.
But  we  also  believe that PROSITE is an important addition to SWISS-PROT as it
allows the flexible classification of proteins into families.

PROSITE is distributed with  SWISS-PROT;  for  a  complete  description  of  the
content  and  format  of this database you should refer to the User Manual (file
PROSITE.USR).


------------------------ Start of Sample PROSITE Entry -------------------------

ID   CARBOXYPEPTIDASE_SER; PATTERN.
AC   PS00131;
DT   APR-1990 (CREATED); APR-1990 (DATA UPDATE); APR-1990 (INFO UPDATE).
DE   Serine carboxypeptidases, serine active site.
PA   G-E-S-Y-A-G.
NR   /RELEASE=14,15409;
NR   /TOTAL=7(7); /POSITIVE=7(7); /UNKNOWN=0(0); /FALSE_POS=0(0);
NR   /FALSE_NEG=0(0);
CC   /TAXO-RANGE=??E??; /MAX-REPEAT=1;
CC   /SITE=3,active_site;
DR   P10619, PRTP$HUMAN, T; P09620, KEX1$YEAST, T; P00729, CBPY$YEAST, T;
DR   P07519, CBP1$HORVU, T; P08818, CBP2$HORVU, T; P08819, CBP2$WHEAT, T;
DR   P11515, CBPG$WHEAT, T;
DO   PDOC00122;
//

{PDOC00122}
{PS00131; CARBOXYPEPTIDASE_SER}
{BEGIN}
************************************************
* Serine carboxypeptidases, serine active site *
************************************************

All  known  carboxypeptidases  are either  metallo carboxypeptidases or serine
carboxypeptidases (EC 3.4.16.-). The catalytic activity of the serine carboxy-
peptidases, like  that  of  the  serine  proteases  of  the trypsin family, is
provided by a charge relay system involving an aspartic acid residue hydrogen-
bonded to an histidine, which itself is hydrogen-bonded to a serine.  Proteins
known or proposed to be serine carboxypeptidases are:

   - Barley serine carboxypeptidases I and II [1,2].
   - Wheat serine carboxypeptidase II [3].
   - A probable wheat serine carboxypeptidase induced by gibberellin [4].
   - Yeast carboxypeptidase Y [5], a  vacuolar protease involved in the degra-
     dation of small peptides.
   - Yeast KEX1 protease [6], which  is  involved  in  killer toxin and alpha-
     factor precursor processing.
   - Human 'protective protein' [7], a  lysosomal  protein which appears to be
     essential for both the activity of beta-galactosidase and neuraminidase.

The sequence in the  vicinity of the  active site  serine residue is perfectly
conserved in all these serine carboxypeptidases.

-Consensus pattern: G-E-S-Y-A-G
                    [S is the active site residue]
-Sequences known to belong to this class detected by the pattern: ALL.
-Other sequence(s) detected in SWISS-PROT: NONE.
-Last update: October 1989 / Text revised.

[ 1] Sorensen S.B., Breddam K., Svendsen I.
     Carlsberg Res. Commun. 51:475-485(1986).
[ 2] Sorensen S.B., Svendsen I., Breddam K.
     Carlsberg Res. Commun. 52:285-295(1987).
[ 3] Breddam K., Sorensen S.B., Svendsen I.
     Carlsberg Res. Commun. 52:297-311(1987).
[ 4] Baulcombe D.C., Barker R.F., Jarvis M.G.
     J. Biol. Chem. 262:13726-13735(1987).
[ 5] Valls L.A., Hunter C.P., Rothman J.H., Stevens T.H.
     Cell 48:887-897(1987).
[ 6] Dmochowska A., Dignard D., Henning D., Thomas D.Y., Bussey H.
     Cell 50:573-584(1987).
[ 7] Galjart N.J., Gillemans N., Harris A., van de Horst G.T.J.,
     Verheijen F.W., Galjaard H., D'Azzo A.
     Cell 54:755-764(1988).
{END}

------------------------- End of Sample PROSITE Entry --------------------------



5  NEW ENZYME DATABASE

A new secondary database called ENZYME has been established.   It  contains  the
following  data for each type of characterized enzyme for which an EC number has
been assigned:

      o  EC number

      o  Recommended name

      o  Alternative names (if any)

      o  Catalytic activity

      o  Cofactors (if any)

      o  Pointers to the SWISS-PROT entry(ies) that correspond to the enzyme
         (if any)


We believe that the ENZYME database will  be  useful  to  anybody  working  with
enzymes  and will allow programs to be developed that can help with the creation
of new metabolic pathways.

A sample entry is shown below:

     ID   1.14.17.3
     DE   PEPTIDYLGLYCINE MONOOXYGENASE.
     AN   PEPTIDYL ALPHA-AMIDATING ENZYME.
     CA   PEPTIDYLGLYCINE + ASCORBATE + O(2) = PEPTIDYL(2-HYDROXYGLYCINE) +
     CA   DEHYDROASCORBATE + H(2)O.
     CC   THE PRODUCT IS UNSTABLE AND DISMUTATES TO GLYOXYLATE AND THE
     CC   CORRESPONDING DESGLYCINE PEPTIDE AMIDE.
     CF   COPPER.
     DR   P10731, AMD$BOVIN ;  P14925, AMD$RAT   ;  P08478, AMD1$XENLA;
     DR   P12890, AMD2$XENLA;
     //

The impact of this new database on SWISS-PROT is the following:

      o  The ECNUMBER.DOC file is now obsolete and is now longer generated.

      o  Instead of having CC (comments) lines with the topics:

         CC   -!- CATALYTIC ACTIVITY:  description_of_catalytic_activity.
         CC   -!- COFACTOR:  description_of_cofactor.

         the enzyme entries in SWISS-PROT will, in future releases, have two new
         linetypes:

         CA   Description_of_catalytic_activity
         CF   Description_of_cofactor

         These  lines  will  be  automatically  generated  at  each  release  of
         SWISS-PROT  from  the  information  stored in the ENZYME database.  The
         introduction of these new  linetypes  is  planned  for  release  16  of
         SWISS-PROT.


ENZYME is distributed with SWISS-PROT; for a complete description of the content
and  format  of  this  database  you  should  refer  to  the  User  Manual (file
ENZYME.USR).



6  DISTRIBUTION MEDIA

Data is available on magnetic tape, TK50 cassette and CD-ROM.  This  section  of
the  release  notes  applies to tape and TK50 cassette only; CD-ROM releases are
accompanied by their own release notes which detail the file  organisation  used
on CD.



6.1  Tape Formats

The distribution tapes are 9-track industry standard magnetic tapes.  Each  file
consists  of  fixed-length  80  byte  records,  padded  with  trailing blanks as
appropriate (except for VMS Backup  format  tapes  which  have  variable  length
records).   Tape  format details (density, blocksize, label type, character set)
are attached to each tape.

In many formats, a release requires more than one  tape  volume.   In  order  to
support  sequential volume serial numbers for multi-volume tape sets, the volume
labels are EMBL01 for the first tape, EMBL02 for the second tape, and so on.

VMS Backup format tapes (and all TK50 cassettes) contain the files listed below,
in the order shown, as a single save set called SWISS14.BCK.



6.2  Documentation

The documentation files on tape (those ending with a file extension of .DOC) are
designed to be easily printable.  As with all other tape files they have a fixed
record length of 80 bytes.  The page length of 63 lines per page was  chosen  so
that the pages will fit both on DIN A4 paper and on American 8-1/2" x 11" paper.
Page throws are indicated by lines with  the  six  character  string  <PAGE>  in
positions  1-6,  and  nothing  else.  If you wish to print any of these files we
suggest you copy them down onto disk, use your local  editor  to  replace  every
occurrence of <PAGE> in columns 1-6 by a formfeed (or whatever is appropriate to
force a page throw on your printer), and then print them.



6.3  Release 14 Files

The distribution tape(s) contain the files shown below,  in  the  order  listed.
Where more than one tape is required, subsequent volumes will continue where the
preceding volume left off.

   File Number    File Name       Description                      #Records
   -----------    ------------    -----------------------------    --------
             1    CONTENTS.DOC    Tape Contents (this table)             63
             2    SWISSPRT.USR    User Manual                          1829
             3    RELNOTES.DOC    Release Notes (this document)        1063

             4    SPECIES.NDX     Species Index                        8779
             5    KEYWORD.NDX     Keyword Index                       16830
             6    AUTHOR.NDX      Author Index                        74268
             7    SHORTDIR.NDX    Short Directory Index               31363
             8    ACNUMBER.NDX    Accession Number Index              15535
             9    CITATION.NDX    Citation Index                      30411
            10    SPIDCODE.NDX    Species ID Code Index                2842

            11    EMBLTOSP.DOC    EMBL/SWISS-PROT Xreferences         13567
            12    ORGCODES.DOC    Organism Code List                   3199
            13    PDBTOSP.DOC     PDB/SWISS-PROT Xreferences            575
            14    JOURLIST.DOC    Journal Abbreviation List            1343
            15    DATASUB.TXT     Data Submission Form                  323

            16    SEQ.DAT         SWISS-PROT Sequence Entries        490281

            17    PROSITE.USR     PROSITE Database User Manual          915
            18    PROSITE.LIS     PROSITE Entry List                    424
            19    PROSITE.DOC     PROSITE Entry Documentation         12313
            20    PROSITE.DAT     PROSITE Database Entries             6401

            21    ENZYME.USR      ENZYME Database User Manual           427
            22    ENZYME.DAT      ENZYME Database                     12008



7  INDEX FILE FORMATS

The index key of each index file (keywords, authors, citations, etc.) is  sorted
alphabetically;  the  names  of  all entries containing the index key are listed
alphabetically after the key.  Each entry name is  accompanied  by  its  primary
accession number.

Except for the short directory, accession number and species  id  code  indices,
all  index  files have the same layout:  each value of the index key begins on a
new line in column 1, and the associated entry names begin  on  the  next  line.
Lines containing entry names are in fixed-format, layed out as follows:


                     Columns   Description
                     -------   ---------------------------
                     14-23     entry name (left-justified)
                     29-34     primary accession number
                
                     36-45     entry name (left-justified)
                     51-56     primary accession number

                     58-67     entry name (left-justified)
                     73-78     primary accession number


Up to three entry names fit on each such line; if a given  index  key  has  more
than  three  entries associated with it, additional lines are used (with exactly
the same layout).  This index file format is  identical  to  that  of  the  EMBL
Nucleotide Sequence database.



7.1  Species Index

This file lists all  species  which  appear  in  the  database.   It  is  sorted
alphabetically  on  (english)  common name.  The latin genus and species will be
listed, if present in  the  database  entries.   Mitochondrion  and  chloroplast
sequences  appear  under  separate  index  keys,  immediately  after the related
nuclear sequences.  An excerpt from the species index file is given  below  (the
ruler is presented for your convenience - it does not appear in the index file):

1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
CHILEAN POTATO-TREE (SOLANUM CRISPUM) 
             PLAS$SOLCR     P00297 
CHIMPANZEE (PAN SATYRUS) 
             HBA3$PANSA     P01935 
CHIMPANZEE (PAN TROGLODYTES) 
             CD4$PANTR      P16004 HA1A$PANTR     P13748 HA1B$PANTR     P13749 
             HA1C$PANTR     P16209 HA1D$PANTR     P16210 HA1E$PANTR     P16215 
             HA1M$PANTR     P13750 HA1N$PANTR     P13751 HBAZ$PANTR     P06347 
             MBP$PANTR      P06906 MYG$PANTR      P02145 NUO4$PANTR     P03906 
             NUO5$PANTR     P03916 
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80



7.2  Keyword Index

This file lists all keywords which appear in the database (on the KW lines).  It
is  sorted alphabetically on keyword.  An excerpt from the keyword index file is
given below (the ruler is presented for your convenience - it does not appear in
the index file):


1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
ACETYLCHOLINE RECEPTOR INHIBITOR
             CXA1$CONGE     P01519 CXA1$CONMA     P01521 CXA1$CONST     P15471 
             CXA2$CONGE     P01520 
ACIDIC PROTEIN
             143E$BOVIN     P11576 B23$RAT        P13084 B231$HUMAN     P08693 
             B232$HUMAN     P06748 BAT$HALHA      P13260 CALQ$CANFA     P12637 
             CALQ$RABIT     P07221 CENB$HUMAN     P07199 CMGA$HUMAN     P10645 
             CMGA$RAT       P10354 GRPE$ECOLI     P09372 MK16$YEAST     P10962 
             NFL$BOVIN      P02548 NO38$CHICK     P16039 NU38$XENLA     P07222 
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80



7.3  Author Index

This file lists all author names which appear in citations (on the RA lines)  in
the database.  It is sorted alphabetically on name.  Names are presented as they
appear in the database entries (i.e.  as cited in  publications)  -  we  do  not
attempt  to  handle  multiple  surname spellings, or different initials, for the
same author.  An excerpt from the author index file is given below (the ruler is
presented for your convenience - it does not appear in the index file):

1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
AAN F.
             PT3G$SALTY     P02908 
AARDEN L.A.
             IL6$HUMAN      P05231 
AARONSON R.P.
             HEMA$INCCA     P03465 
AARONSON S.
             PGDS$HUMAN     P16234 
AARONSON S.A.
             3ORF$EIAV1     P11305 DBL$HUMAN      P10911 ENV$EIAV1      P11306 
             ENV$SMSAV      P03384 GAG$AVISN      P03342 GAG$MSVMO      P03334 
             GAG$SMSAV      P03330 KMOS$MSVMO     P00538 PDGB$HUMAN     P01127 
             POL$EIAV1      P11204 POL$MMTVB      P03365 POL$SMSAV      P03359 
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80



7.4  Short Directory Index

This file contains summary  information  about  all  entries  in  the  database,
including  a  brief description of the entry, its sequence length, molecule type
and data management class.  The file is sorted  alphabetically  on  entry  name.
The lines are fixed-format, layed out as follows:

     Columns   Field Name          Description
     -------   ---------------     -------------------------------------------
     01-10     entry name          left-justified
     14-14     data class          s = standard
                                   u = unannotated
                                   p = preliminary
                                   r = unreviewed
     16-18     molecule type       PRT (protein)
     20-25     sequence length     right-justified
     27-80     description         left-justified


If an entry's description occupies more than 54 characters (cols 27-80), it will
be  continued  onto  one or more continuation lines.  Continuation lines contain
description text (left-justified) in cols  27-80;  cols  01-26  are  blank.   An
excerpt  from  the  short  description  index  file is given below (the ruler is
presented for your convenience - it does not appear in the index file):

1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
104K$THEPA   p PRT    924 104 KD MICRONEME-RHOPTRY ANTIGEN - THEILERIA PARVA 
10KA$MYCTU   s PRT    100 BCG-A HEATSHOCK PROTEIN (10 KD ANTIGEN) - 
                          MYCOBACTERIUM TUBERCULOSIS 
10KS$HUMAN   s PRT     91 10 KD SECRETORY PROTEIN PRECURSOR - HUMAN (HOMO 
                          SAPIENS) 
110K$PLAKN   s PRT    296 110 KD ANTIGEN (PK110) (FRAGMENT) - PLASMODIUM 
                          KNOWLESI 
11KD$ADE02   s PRT     79 11 KD CORE PROTEIN PRECURSOR (LATE L2 MU CORE PROTEIN)
                          (PROTEIN X) - ADENOVIRUS TYPE 2 
11SB$CUCMA   s PRT    480 11-S GLOBULIN BETA SUBUNIT PRECURSOR - PUMPKIN 
                          (CUCURBITA MAXIMA) 
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80



7.5  Accession Number Index

This  file  lists  all  accession  numbers  in  the  database.   It  is   sorted
alphabetically  on  accession  number.  Each accession number is followed by the
name and primary accession number of every entry in which it occurs.

The lines are fixed-format, layed out exactly the same as the other index files;
the only difference is that the index key (accession number) appears on the same
line (in cols 1-6) as the list of entries which contain the key.  In  the  other
index  files,  the  index  key  appears  on  a  line by itself.  This index key,
however, is short enough to fit on the same line as the  entries,  and  we  have
done this to save space.

Accession numbers which have been deleted from the database also appear in  this
index, containing the word DELETED (left-justified) in the entry name field.

An excerpt from the accession number index file is given  below  (the  ruler  is
presented for your convenience - it does not appear in the index file):

1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
P00835       ATPE$MAIZE     P00835 
P00836       ATPE$HORVU     P00836 
P00837       ATPG$ECOLI     P00837 
P00838       ATPG$ECOLI     P00837 
P00839       ATPL$BOVIN     P00839 ATPM$BOVIN     P07926 
P00840       ATP9$MAIZE     P00840 
P00841       ATP9$YEAST     P00841 
P00842       ATPL$NEUCR     P00842 
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80



7.6  Citation Index

This file lists all journal citations which  appear  in  the  database.   It  is
sorted  alphabetically  on citation.  An excerpt from the citation index file is
given below (the ruler is presented for your convenience - it does not appear in
the index file):

1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
ANN. GENET. SEL. ANIM. 4:515-521(1972)
             CASK$BOVIN     P02668 
ANN. INST. PASTEUR IMMUNOL. 127C:261-271(1976)
             KV3F$HUMAN     P01624 
ANN. INST. PASTEUR IMMUNOL. 132D:77-88(1981)
             HV41$MOUSE     P01811 
ANN. N.Y. ACAD. SCI. 165:360-377(1969)
             HBB$ATEGE      P02034 
ANN. N.Y. ACAD. SCI. 241:436-438(1974)
             HBB$RABIT      P02057 
ANN. N.Y. ACAD. SCI. 356:1-13(1980)
             CALM$METSE     P02596 
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80



7.7  Species ID Code Index

This file lists all species id codes which appear in the database.  It is sorted
alphabetically  on species id code.  Each code is followed by a (sorted) list of
all sequence entry codes which are associated with the  species  id  code.   The
sequence  entry  code is the first component of each entry name (before the "$")
and the species id code is the second component of each entry  name  (after  the
"$").   For  example, the entry called COA3$ADEA2 has a species id code of ADEA2
and a sequence entry code of COA3.

Each species id code starts on a  new  line,  occupying  columns  1-5,  and  the
associated  sequence entry codes (each up to 4 characters long) start in columns
10, 15, 20, 25, 30, 35, ...  70, 75.  If there are more than 14  sequence  entry
codes  for  a  given species id, as many continuation lines as necessary will be
used, with columns 1-9 left blank.

An excerpt from the species id code index file is  given  below  (the  ruler  is
presented for your convenience - it does not appear in the index file):

1       10        20        30        40        50        60        70        80
+--------+---------+---------+---------+---------+---------+---------+---------+
ACEME    PER  RBS1 RBS2 RBS3 RBS4 RBS5 
ACENE    CYC  
ACHKL    CALM 
ACHLY    API  
ACIBA    KKA  
ACICA    BEND CATA CATM DHGA DHGB ELH2 MURO PQQ1 PQQ2 PQQ3 PQQ5 PQQL PQQR TRPC 
         TRPD TRPG 
ACIFE    HGDA HGDB YHGD 
ACIGL    ASPQ 
ACIGU    PRT1 
+--------+---------+---------+---------+---------+---------+---------+---------+
1       10        20        30        40        50        60        70        80



8  WE NEED YOUR HELP !

We welcome  any  feedback  from  our  users.   We  would  especially  appreciate
information  about  any sequences belonging to your field of expertise which are
missing from the database.  We also would like to be notified about  annotations
which  should be updated (e.g.  if the function of a protein has been clarified,
or if new post-translational information has become available).



                                   APPENDIX A

                                SOME STATISTICS



A.1  AMINO ACID COMPOSITION

Composition in percent for the complete database:

     Ala (A) 7.69   Gln (Q) 4.11   Leu (L) 9.10   Ser (S) 7.06
     Arg (R) 5.21   Glu (E) 6.30   Lys (K) 5.86   Thr (T) 5.84
     Asn (N) 4.43   Gly (G) 7.17   Met (M) 2.30   Trp (W) 1.32
     Asp (D) 5.24   His (H) 2.26   Phe (F) 3.95   Tyr (Y) 3.20
     Cys (C) 1.83   Ile (I) 5.39   Pro (P) 5.12   Val (V) 6.48
     Asx (B) 0.01   Glx (Z) 0.01   Xaa (X) 0.04


Classification of the amino acids by their frequency:

     Leu, Ala, Gly, Ser, Val, Glu, Lys, Thr, Ile, Asp, Arg, Pro, Asn, Gln,
     Phe, Tyr, Met, His, Cys, Trp


A.2  DISTRIBUTION OF SEQUENCES BY SPECIES

Total number of species represented in this release of the database:  2192.

     Species represented 1x: 964
                         2x: 413
                         3x: 224
                         4x: 127
                         5x:  96
                         6x:  76
                         7x:  40
                         8x:  29
                         9x:  47
                        10x:  24
                    11- 20x:  75
                    21-100x:  60
                      >100x:  17

Table of the most common species:

     Number   Frequency          Species

          1        1336          Human
          2        1150          Escherichia coli
          3         763          Mouse
          4         652          Rat
          5         501          Baker's yeast (Saccharomyces cerevisiae)
          6         387          Bovine
          7         263          Fruit fly (Drosophila melanogaster)
          8         253          Chicken
          9         210          Rabbit
         10         176          Pig
         11         169          Bacillus subtilis
         12         146          African clawed frog (Xenopus laevis)
         13         134          Salmonella typhimurium
         14         132          Bacteriophage T4
         15         118          Maize
         16         104          Rice
                    104          Tobacco
         18          85          Wheat
         19          84          Liverwort (Marchantia polymorpha)
         20          77          Pea
         21          75          Slime mold (Dictyostelium discoideum)
                     75          Spinach
                     75          Staphylococcus aureus
         24          74          Soybean
         25          73          Vaccinia virus



A.3  DISTRIBUTION OF SEQUENCES BY LENGTH


     From   To  Number             From   To   Number

        1-  50    1012             1001-1100      131
       51- 100    1810             1101-1200       84
      101- 150    2648             1201-1300       68
      151- 200    1546             1301-1400       46
      201- 250    1233             1401-1500       35
      251- 300    1093             1501-1600       17
      301- 350     929             1601-1700       20
      351- 400     939             1701-1800       16
      401- 450     699             1801-1900       13
      451- 500     786             1901-2000       18
      501- 550     601             2001-2100        7
      551- 600     381             2101-2200       19
      601- 650     279             2201-2300       19
      651- 700     199             2301-2400       10
      701- 750     196             2401-2500        9
      751- 800     125             >2500           29
      801- 850     115
      851- 900     134
      901- 950      71
      951-1000      72

Currently the five largest sequences are:

     RYNR$RABIT  5037 a.a.
     APB$HUMAN   4563 a.a.
     APOA$HUMAN  4548 a.a.
     DMD$HUMAN   3685 a.a.
     DMD$CHICK   3660 a.a.