Skip Header

You are using a version of Internet Explorer that may not display all features of this website. Please upgrade to a modern browser.

Swiss-Prot release 33.0

Published February 1, 1996

                    SWISS-PROT RELEASE 33.0 RELEASE NOTES

                               1. INTRODUCTION

   1.1  Evolution

   Release 33.0  of SWISS-PROT contains 52'205 sequence entries, comprising
   18'531'384  amino   acids  abstracted   from  45'351   references.  This
   represents an  increase of  6.5% over release 32. The growth of the data
   bank is summarized below.

   Release    Date   Number of entries     Nb of amino acids

   2.0        09/86               3939               900 163
   3.0        11/86               4160               969 641
   4.0        04/87               4387             1 036 010
   5.0        09/87               5205             1 327 683
   6.0        01/88               6102             1 653 982
   7.0        04/88               6821             1 885 771
   8.0        08/88               7724             2 224 465
   9.0        11/88               8702             2 498 140
   10.0       03/89              10008             2 952 613
   11.0       07/89              10856             3 265 966
   12.0       10/89              12305             3 797 482
   13.0       01/90              13837             4 347 336
   14.0       04/90              15409             4 914 264
   15.0       08/90              16941             5 486 399
   16.0       11/90              18364             5 986 949
   17.0       02/91              20024             6 524 504
   18.0       05/91              20772             6 792 034
   19.0       08/91              21795             7 173 785
   20.0       11/91              22654             7 500 130
   21.0       03/92              23742             7 866 596
   22.0       05/92              25044             8 375 696
   23.0       08/92              26706             9 011 391
   24.0       12/92              28154             9 545 427
   25.0       04/93              29955            10 214 020
   26.0       07/93              31808            10 875 091
   27.0       10/93              33329            11 484 420
   28.0       02/94              36000            12 496 420
   29.0       06/94              38303            13 464 008
   30.0       10/94              40292            14 147 368
   31.0       02/95              43470            15 335 248
   32.0       11/95              49340            17 385 503
   33.0       02/96              52205            18 531 384


   2.1  Sequences and annotations

   2'910 sequences  have been  added since release 32, the sequence data of
   1085 existing  entries has  been updated  and the  annotations of  6'340
   entries have been revised.

   Major annotations and sequences updates have been made in preparation of
   the changes that will take place in release 33 (see section 3.1 of these

   2.2  What's happening with the model organisms

   We have  selected a  number of  organisms that  are the target of genome
   sequencing and/or mapping projects and for which we intend to:

   -  Be as  complete as  possible. All sequences available at a given time
      should be  immediately included  in SWISS-PROT.  This  also  includes
      sequence corrections and updates;
   -  Provide a higher level of annotation;
   -  Provide cross-references  to specialized  database(s)  that  contain,
      among other  data, some genetic information about the genes that code
      for these proteins;
   -  Provide specific indices or documents.

   What was  done since  the last  release or  in preparation  for the next
   release concerning model organisms:

   -  We have  added Mycoplasma  genitalium to the list of model organisms.
      It is the second bacterial genome to be completely sequenced. We have
      already annotated  344 of  the 470  putative proteins encoded by this
      small genome.

   -  We have  started a  major effort  in catching  up with the backlog of
      sequences from eukaryotic model organisms. In particular we added 262
      entries from  yeast, 194  from  human,  180  from  S.pombe,  82  from
      C.elegans, 68 from A.thaliana and 50 from Drosophila.

   -  We have  added in SWISS-PROT, all the sequences from yeast chromosome
      X. We plan to integrate data from chromosome XIII very soon.

   Here is the current status of the model organisms:

   Organism         Database               Index file       Number of
                    cross-referenced                        sequences
   --------------   ---------------------  --------------   ---------
   A.thaliana       None yet               In preparation        500
   B.subtilis       SubtiList              SUBTILIS.TXT         1389
   C.albicans       None yet               CALBICAN.TXT          106
   C.elegans        WormPep                CELEGANS.TXT         1006
   D.discoideum     DictyDB                DICTY.TXT             213
   D.melanogaster   FlyBase                In preparation        818
   E.coli           EcoGene                ECOLI.TXT            3471
   H.influenzae     None yet               HAEINFLU.TXT         1577
   H.sapiens        MIM                    MIMTOSP.TXT          3475
   M.genitalium     None yet               In preparation        344
   S.cerevisiae     LISTA/SGD              YEAST.TXT            3653
   S.typhimurium    StyGene                SALTY.TXT             603
   S.pombe          None yet               POMBE.TXT             640
   S.solfataricus   None yet               None yet               61

   2.3  Major changes to the cross-references to EMBL

   In this  release, the  format of the DR (Database cross-Reference) lines
   pointing to  EMBL Nucleotide Sequence Database entries have been changed




   Where 'PID'  stands for  the "Protein  IDentification" number.  It is  a
   number that  you  find  in  EMBL  and  GenBank  in  a  qualifier  called
   "/db_xref" which  is tagged  to every  CDS in  the nucleotide  database.

   FT   CDS             54..1382
   FT                   /note="ribulose-1,5-bisphosphate carboxylase/
   FT                   oxygenase activase precursor"
   FT                   /db_xref="PID:g1006835"

   When an EMBL database CDS exists as a sequence report in SWISS-PROT, the
   SWISS-PROT DR  lines of  the corresponding  SWISS-PROT  entry  has  been
   updated by  citing the PID as secondary identifier. In all cases where a
   PID has  been integrated  into SWISS-PROT, a "/db_xref" qualifier citing
   the corresponding  SWISS-PROT entry  has been added to the EMBL database
   CDS labeled with this PID. Example:

   FT   CDS             14556__15696
   FT                   /gene="cytochrome b"
   FT                   /codon_start=1
   FT                   /product="apoprotein"
   FT                   /db_xref="PID:g463170"
   FT                   /db_xref="SWISS-PROT:P12778"

   This approach  enables us  to point  precisely from  a given  SWISS-PROT
   entry to one of potentially many CDS in the corresponding EMBL entry and
   vice versa.  This change  also allows  the development of software tools
   that automatically retrieve the part of a nucleotide sequence entry that
   codes for  a specific  protein. This is especially useful in the context
   of World-Wide  Web as  it will  render obsolete  the  current  situation
   where, for  example, one  needs to  retrieve the  complete sequence of a
   yeast chromosome  when one  wants the  nucleotide sequence  coding for a
   specific protein encoded on that chromosome.

   An additional  important principle  of the PID system is that whenever a
   change is  made to  the nucleotide  entry or  to the annotations of that
   entry and  that this  change produces  a modification  in the translated
   protein sequence,  the PID  number corresponding  to the modified CDS is
   replaced by  a completely  new number.  The old number will be kept in a
   special field tagged to the CDS. The exact syntax of this field is under
   discussion at the international nucleotide databases.

   The  new   cross-referencing   system   will   allow   a   much   closer
   interconnection between  SWISS-PROT  and  the  international  nucleotide
   sequence databases.  For example, it will allow us to automatically take
   into account  sequence updates  made to  the nucleotide entry when these
   updates have an impact on the derived protein sequence(s).

   It should also be noted that the "PID" numbers in the context of GenBank
   replace the  "NCBI gi" numbering system which was present in the "/note"
   qualifier. The "gi" identifiers for the nucleic acid sequences have been
   replaced by "NID" (nucleic acid identifier) numbers.

   The 'STATUS_IDENTIFIER'  provides  information  about  the  relationship
   between the  sequence in  the  SWISS-PROT  entry  and  the  CDS  in  the
   corresponding EMBL entry.

   a) In  most cases  the translation  of the  EMBL nucleotide sequence CDS
   results in  the same  sequence as  shown in the corresponding SWISS-PROT
   entry or  the differences  are mentioned  in the SWISS-PROT feature (FT)
   lines as  CONFLICT, VARIANT  or VARSPLIC  and in  the RP lines. In these
   cases the status identifier shows a dash ("-").


   DR   EMBL; Y00312; G63880; -.

   b) In  some cases  the translation  of the  EMBL nucleotide sequence CDS
   results  in  a  sequence  different  from  the  sequence  shown  in  the
   corresponding SWISS-PROT  entry  and  the  differences  are  either  not
   mentioned in  the SWISS-PROT  feature (FT) lines as CONFLICT, VARIANT or
   VARSPLIC and  in the  RP lines,  or do  simply not meet the criteria for
   such situations.

   1) If the  difference is  due to a different start of the sequence (e.g.
      SWISS-PROT believes  that the  start of  the sequence  is upstream or
      downstream of  the site annotated as the start of the sequence in the
      EMBL database),  the status  identifier shows the comment "ALT_INIT".

        DR   EMBL; L29151; G466334; ALT_INIT.

   2) If the  difference is  due to a different termination of the sequence
      (e.g. SWISS-PROT  believes that  the termination  of the  sequence is
      upstream or  downstream of  the site  annotated as  the  end  of  the
      sequence in  the EMBL  database), the  status  identifier  shows  the
      comment "ALT_TERM". Example:

        DR   EMBL; L20562; G398099; ALT_TERM.

   3) If the  difference is  due to  frameshifts in  the EMBL sequence, the
      status identifier shows the comment "ALT_FRAME". Example:

        DR   EMBL; M95935; G146416; ALT_FRAME.

   4) If the difference is not due to the cases mentioned above (e.g. wrong
      intron-exon boundaries  given in  the EMBL  entry) or to a mixture of
      the cases  mentioned above,  the status  identifier shows the comment
      "ALT_SEQ". Example:

        DR   EMBL; X79206; G809602; ALT_SEQ.

   c) In some cases the nucleotide sequence of a complete CDS is divided in
   exons present in different EMBL entries. We point to the exon containing
   EMBL entries  by citing  the PID  as secondary identifier and adding the
   comment "JOINED"  into the status identifier. These EMBL entries are not
   containing a  CDS feature,  they contain  exons joined  to a CDS feature
   which is labeled with the given PID.


   DR   EMBL; M63397; G177196; -.
   DR   EMBL; M63395; G177196; JOINED.
   DR   EMBL; M63396; G177196; JOINED.

   In the  above example  the SWISS-PROT  sequence is  derived from the CDS
   labeled with  the PID G177196. This CDS feature can be found in the EMBL
   entry M63397.  Exons belonging  to this  CDS are  not only found in EMBL
   entry M63397, but also in the EMBL entries M63395 and M63396.

   d) In  some cases  there is  no CDS  feature key  annotating  a  protein
   translation in  an EMBL entry and thus no PID for that CDS. Therefore it
   is not  possible for  us to point to a PID as a secondary identifier. In
   these cases  we point  to the  relevant EMBL entries by including a dash
   ("-") in  the position  of the  missing PID and "NOT_ANNOTATED_CDS" into
   the status identifier.



   2.4  New cross-references

   We have added cross-references from SWISS-PROT to the Harefield Hospital
   2D gel  protein databases  prepared under the supervisation of Mike Dunn
   (see Corbett  J.M., Wheeler C.H., Baker C.S., Yacoub M.H. and Dunn M.J.;
   Electrophoresis 15:1459-1465(1994)).  These cross-references are present
   in the DR lines:

   Data bank identifier: HSC-2DPAGE
   Primary identifier:   The protein spot unique identifier [1]
   Secondary identifier: The species of origin [2]
   Example:              HSC-2DPAGE; P47985; HUMAN.

   [1] Harefield 2D databases uses SWISS-PROT primary accession numbers as
       the alphanumeric designation of spots that are linked to SWISS-PROT
   [2] Currently only  `HUMAN' is  used, but 'RAT' and 'DOG' will be added
       in the next release.

   2.5  Introduction of a new CC line-type topic (MASS SPECTROMETRY)

   We have  introduced a  new 'topic' for the comments (CC) line-type: MASS
   SPECTROMETRY. This topic is used to report the exact molecular weight of
   a protein  or part  of a  protein as  determined by  mass  spectrometric
   methods. The syntax of this new topic is:



   -  "MW=XX" is the determined molecular weight (MW);
   -  "MW_ERR=XX" (optional)  is the  accuracy or  error range  of  the  MW
   -  "METHOD=XX" is the masss spectrometric method;
   -  "RANGE=XX-XX" (optional) is used to indicate what part of the protein
      sequence entry corresponds to the molecular weight. If this qualifier
      is not  present, the  MW value  corresponds to the full length of the
      protein sequence.

   Examples of its usage:

   CC       RANGE=40-119.

   It should  be noted  that the  syntax of this topic may evolve in future
   releases as  we  expect  feedback  from  groups  using  MS  for  protein
   identification on  2D gels,  MW determination  and  characterization  of
   post-translational modifications.

   2.6  Change in the syntax of the SQ line

   The SQ  (SeQuence header)  line marks the beginning of the sequence data
   and gives a quick summary of its content. The format of the SQ line used
   to be:


   The line  contains the  length  of  the  sequence  in  amino-acids  (AA)
   followed by  the molecular weight (MW) rounded to the nearest gram and a
   checking number (CN) as shown in the example:

   SQ   SEQUENCE 104 AA; 11530 MW; 54319 CN;

   Starting with this release, we have replaced the checking number (CN) by
   a 32-bit CRC (Cyclic Redundancy Check) value. The new syntax is:



   SQ   SEQUENCE   104 AA;  11530 MW;  7A70363C CRC32;

   2.7  Status of the documentation files

   SWISS-PROT is  distributed with  a large  number of documentation files.
   Some of  these files  have been  available for  a long  time  (the  user
   manual, release  notes, the  various  indices  for  authors,  citations,
   keywords, etc.),  but  many  have  been  created  recently  and  we  are
   continuously adding  new files.  Since release  32, we  have added 2 new
   document files.  The following  table list  all the  documents that  are
   either currently  available or  that we  plan to  add in  the  next  few

   USERMAN .TXT   User manual
   RELNOTES.TXT   Release notes
   SHORTDES.TXT   Short description of entries in SWISS-PROT

   JOURLIST.TXT   List of abbreviations for journals cited
   KEYWLIST.TXT   List of keywords in use
   SPECLIST.TXT   List of organism identification codes
   EXPERTS .TXT   List of on-line experts for PROSITE and SWISS-PROT
   SUBMIT  .TXT   Submission of sequence data to the SWISS-PROT data bank

   ACINDEX .TXT   Accession number index
   AUTINDEX.TXT   Author index
   CITINDEX.TXT   Citation index
   KEYINDEX.TXT   Keyword index
   SPEINDEX.TXT   Species index
   7TMRLIST.TXT   List of 7-transmembrane G-linked receptors entries
   AATRNASY.TXT   List of aminoacyl-tRNA synthetases
   ALLERGEN.TXT   Nomenclature and index of allergen sequences
   CALBICAN.TXT   Index of Candida albicans entries and their corresponding
                  gene designations
   CDLIST  .TXT   CD nomenclature for surface proteins of human leucocytes
   CELEGANS.TXT   Index  of   Caenorhabditis  elegans   entries  and  their
                  corresponding gene
                  designations and WormPep cross-references
   DICTY   .TXT   Index  of  Dictyostelium  discoideum  entries  and  their
                  corresponding gene
                  designations and DictyDB cross-references
   EC2DTOSP.TXT   Index of  Escherichia coli  Gene-protein database entries
                  referenced in SWISS-PROT
   ECOLI   .TXT   Index of  Escherichia coli  K12 chromosomal  entries  and
                  their corresponding EcoGene cross-reference
   EMBLTOSP.TXT   Index of  EMBL Database  entries referenced in SWISS-PROT
   EXTRADOM.TXT   Nomenclature of extracellular domains
   GLYCOSYL.TXT   Classification of  glycosyl hydrolases families and index
                  of glycosyl hydrolase entries [1]
   HAEINFLU.TXT   Index of Haemophilus influenzae RD chromosomal entries
   HOXLIST .TXT   Vertebrate homeotic Hox proteins: nomenclature and index
   HUMCHR21.TXT   Index  of  protein  sequence  entries  encoded  on  human
                  chromosome 21
   HUMCHR22.TXT   Index  of  protein  sequence  entries  encoded  on  human
                  chromosome 22
   HUMCHRY .TXT   Index  of  protein  sequence  entries  encoded  on  human
                  chromosome Y
   MIMTOSP .TXT   Index of MIM entries referenced in SWISS-PROT
   MYGENIT .TXT   Index of Mycoplasma genitalium chromosomal entries [2]
   NOMLIST .TXT   List of nomenclature related references for proteins
   PDBTOSP .TXT   Index of Brookhaven PDB entries referenced in SWISS-PROT
   PEPTIDAS.TXT   Classification  of   peptidase  families   and  index  of
                  peptidases entries
   PLASTID .TXT   List of chloroplast and cyanelle encoded proteins
   POMBE   .TXT   Index of  Schizosaccharomyces pombe entries in SWISS-PROT
                  and their corresponding gene designations
   RESTRIC .TXT   List of restriction enzymes and methylases entries
   RIBOSOMP.TXT   Index of ribosomal proteins classified by families on the
                  basis of sequence similarities [2]
   SALTY   .TXT   Index of  Salmonella typhimurium  LT2 chromosomal entries
                  and their corresponding StyGene cross-references
   SUBTILIS.TXT   Index of  Bacillus subtilis  168 chromosomal  entries and
                  their corresponding SubtiList cross-references
   YEAST   .TXT   Index  of  Saccharomyces  cerevisiae  entries  and  their
                  corresponding gene designations
   YEAST1  .TXT   Yeast Chromosome I entries
   YEAST2  .TXT   Yeast Chromosome II entries
   YEAST3  .TXT   Yeast Chromosome III entries
   YEAST5  .TXT   Yeast Chromosome V entries
   YEAST6  .TXT   Yeast Chromosome VI entries
   YEAST8  .TXT   Yeast Chromosome VIII entries
   YEAST9  .TXT   Yeast Chromosome IX entries
   YEAST10 .TXT   Yeast Chromosome X entries [1]
   YEAST11 .TXT   Yeast Chromosome XI entries
   YEAST13 .TXT   Yeast Chromosome XIII entries [2]


   [1]  New in release 33.
   [2]  Will be available starting with release 34 of October 1996.
   [3]  The format of that file was completely changed to take into account
        the new format of cross-references to  EMBL that includes the "PID"
        (see section 2.3).

   We have  continued to  include in  some SWISS-PROT  document  files  the
   references of  World-Wide  Web  sites  relevant  to  the  subject  under
   consideration. There are now 11 documents that include such links.

   2.8  The ExPASy World-Wide Web server

        2.8.1  Background information

   The most  efficient and  user-friendly way  to browse  interactively  in
   SWISS-PROT, PROSITE, ENZYME, SWISS-2DPAGE and other databases. is to use
   the World-Wide  Web (WWW)  molecular biology  server ExPASy.  WWW  is  a
   global information  retrieval system  merging the  power  of  world-wide
   networks, hypertext  and multimedia.  Through hypertext  links, it gives
   access to  documents and  information available  on thousands of servers
   around the  world. To  access a  WWW server  one needs  a  WWW  browser.
   Currently, the  most popular  browser  is  Netscape  Navigator(TM)  from
   Netscape Communications Corp. (available from Using a
   WWW browser, one has access to all the hypertext documents stored on the
   ExPASy server as well as many other WWW servers.

   The ExPASy server was made available to the public in September 1993. On
   February 1996  a cumulative total of 4 million connections was attained.
   It may  be accessed  through its  Uniform Resource  Locator (URL  -  the
   addressing system defined in WWW), which is:

   The ExPASy  WWW server  allows access, using the user-friendly hypertext
   model, to  the SWISS-PROT,  PROSITE,  ENZYME,  SWISS-2DPAGE  and  SWISS-
   3DIMAGE databases and, through any SWISS-PROT protein sequence entry, to
   other databases  such as  EMBL, EcoCyc,  FlyBase, GCRDb, LISTA, MaizeDB,
   SubtiList, OMIM, PDB, HSSP, ProDom, REBASE, SGD, YEPD and Medline. Using
   a browser  which is  able to display images one can also remotely access
   2D gels  image data from SWISS-2DPAGE. ExPAsy also offers many tools for
   the analysis of protein sequences and 2D gels.

   For more  information on  the  ExPASy  WWW  server,  you  can  read  the
   following article:

      Appel R.D., Bairoch A., Hochstrasser D.F.
      A new  generation of  information retrieval tools for biologists: the
      example of the ExPASy WWW server.
      Trends Biochem. Sci. 19:258-260(1994).

   Or you can contact Dr. Ron Appel:

      Fax: +41-22-372 61 98

        2.8.2  SWISS-SHOP

   Thanks to the work of Manuel Peitsch from the Geneva Glaxo Institute for
   Molecular Biology,  we can  provide, on ExPASy, a  service called SWISS-
   SHOP. SWISS-Shop  allows  any  users  of  SWISS-PROT  to  indicate  what
   proteins he/she  is interested  in.  This  can  be  done  using  various
   criteria that can be combined:

   -  By entering  one  or  more  words  that  should  be  present  in  the
      description line;
   -  By entering one or more species name(s) or taxonomic division(s);
   -  By entering one or more keywords;
   -  By entering one or more author names;
   -  By entering the accession number (or entry name) of a PROSITE pattern
      or a user-defined sequence pattern;
   -  By entering  the accession  number (or  entry name)  of  an  existing
      SWISS-PROT entry or by entering a "private" sequence.

   Every week,  the new  sequences entered  in SWISS-PROT are automatically
   compared with all the criteria that have been defined by the users. If a
   sequence corresponds  to the  selection criteria defined by a user, that
   sequence is sent by electronic mail.

        2.8.3  What is new on ExPASy

   Since  the   last  release,  there  has  been  a  large  number  of  new
   developments on the ExPASy WWW server. Here are some highlights of these

   -  ProtScale is  a new tool which we have implemented and that allows to
      compute and  represent the profile produced by an amino acid scale on
      a selected  protein in  SWISS-PROT or  entered by the user. 50 scales
      are provided,  including 'classics'  such as  the Kyte  and Doolittle
      hydrophobicity scale.

   -  We have added a new tool, SIM which computes a user defined number of
      best non-intersecting  alignments between  two sequences. The results
      of the alignment can be viewed graphically using the LALNVIEW program
      developed  by   Laurent  Duret   (  and  which  is
      available (it  can directly  be downloaded  from ExPASy) for PC under
      MS-Windows, Macs and UNIX.

   -  We have recently started to create a list of Biomolecular servers for
      our own  usage, this  list is  available on  the ExPASy  top page  or
      directly from:


   -  WWW links  have been  implemented between some SWISS-PROT entries and
      HSC-2DPAGE (see section 2.4).

   -  Many other changes have been made to all parts of the server.

   2.9  Weekly updates of SWISS-PROT

   Weekly updates of SWISS-PROT are available by anonymous FTP. Three files
   are updated at each update:

   new_seq.dat    Contains all the new entries since the last full release;
   upd_seq.dat    Contains the entries for which the sequence data has been
                  updated since the last release;
   upd_ann.dat    Contains the  entries for  which one  or more  annotation
                  fields have been updated since the last release.

   Currently these  files are  available on  the  following  anonymous  ftp

   Organization   ExPASy (Geneva University Expert Protein Analysis System)
   Address  (or
   Directory      /databases/swiss-prot/updates

   Organization   National Center for Biotechnology Information (NCBI)
   Address (or
   Directory      /repository/swiss-prot/updates

   Organization   European Bioinformatics Institute (EBI)
   Address (or
   Directory      /pub/databases/swissprot/new

   Organization   Bioinformatics Unit, Weizmann Institute of Science (WIS)
   Address (or
   Directory      /pub/databases/swiss-prot/updates

   !! Important notes !!!

   Although we  try to  follow a  regular schedule,  we do  not promise  to
   update these  files every  week. In some cases two weeks will elapse in-
   between two updates.

   Due to  the current  mechanism used  to build a release the entries that
   are provided in these updates are not guaranteed to be error free.

                      3.0  IMPORTANT FORTHCOMING CHANGE

   3.1  TREMBL - a supplement to SWISS-PROT

   The ongoing  genome sequencing  and mapping  projects have  dramatically
   increased the number of protein sequences to be incorporated into SWISS-
   PROT. Since we do not want to dilute the quality standards of SWISS-PROT
   by incorporating  sequences  into  SWISS-PROT  without  proper  sequence
   analysis and  annotation, we  cannot speed  up the  incorporation of new
   incoming data  indefinitely. But  as we  also want to make the sequences
   available as  fast as  possible, we  will introduce  with SWISS-PROT  an
   computer annotated supplement to SWISS-PROT. This supplement consists of
   entries in  SWISS-PROT-like format  derived from  the translation of all
   coding sequences  (CDS) in the EMBL nucleotide sequence database, except
   the CDS already included in SWISS-PROT.

   We name  this supplement  TREMBL  (TRanslation  from  EMBL),  since  the
   translation tools  used to  create the translations of the CDS are based
   on the  program  'trembl'  written  by  Thure  Etzold  at  the  EMBL  in

   We will  translate all  CDS's in  the EMBL  Nucleotide Sequence Database
   into TREMBL  preentries. The  preentries already  as sequence reports in
   SWISS-PROT will be excluded from TREMBL. Then the remaining entries will
   be automatically  merged  whenever  possible  to  reduce  redundancy  in

   We will split TREMBL in two main sections; SP-TREMBL and REM-TREMBL:

   SP-TREMBL (SWISS-PROT  TREMBL) will  contain the entries which should be
   incorporated into  SWISS-PROT. SP-TREMBL  will  be  partially  redundant
   against SWISS-PROT,  since approximately half of these SP-TREMBL entries
   will be  only additional  sequence reports of proteins already in SWISS-
   PROT. We  will try  to merge  these sequence reports as fast as possible
   with the  already existing  SWISS-PROT entries for these proteins, so as
   to make SWISS-PROT and TREMBL completely nonredundant.

   REM-TREMBL (REMaining  TREMBL) will  contain the  entries that we do not
   want to  include in  SWISS-PROT. This  section will be organized in four

   1) Most REM-TREMBL entries will be immunoglobulins and T-cell receptors.
      We stopped  entering immunoglobulins and T-cell receptors into SWISS-
      PROT, because  we only  want to  keep  the  germ  line  gene  derived
      translations of  these proteins  in  SWISS-PROT  and  not  all  known
      somatic recombinated  variations of  these proteins. We are expecting
      more than  10'000 immunoglobulins  and T-cell receptors in TREMBL. We
      would like  to create  a  specialized  database  dealing  with  these
      sequences as  a further  supplement to  SWISS-PROT and  keep  only  a
      representative cross-section of these proteins in SWISS-PROT.

   2) Another category of data which will not be included in SWISS-PROT are
      synthetic sequences.  Again, we do not want to leave these entries in
      TREMBL.  Ideally   one  should   build  a  specialized  database  for
      artificial sequences as a further supplement to SWISS-PROT.

   3) A third  subsection consists  of fragments with less than seven amino

   4) The last subsection consists of CDS translations where we have strong
      evidence to believe that these CDS are not coding for real proteins.

   The first  full release of TREMBL will be distributed with release 34 of
   SWISS-PROT. However  we are  making available,  with release  33, a beta
   release so that users and software developers can send us feedback about
   this new supplement to SWISS-PROT.

                            4. ENZYME AND PROSITE

   4.1  The ENZYME data bank

   Release 20.0  of the  ENZYME data bank is distributed with release 33 of
   SWISS-PROT. ENZYME  release 20.0  contains information  relative to 3601

   4.2  The PROSITE data bank

   Release 13.1  of the PROSITE data bank is distributed with release 33 of
   SWISS-PROT. This  release of  PROSITE contains 889 documentation entries
   that describe  1'167 different  patterns, rules  and  profiles/matrices.
   Release 13.1  does not  really represent a new release; the only changes
   between releases  13.0 and  13.1 are  updating of  the pointers  to  the
   SWISS-PROT entries whose name have been modified between releases 32 and
   33. The  next release of PROSITE (14.0) will be distributed with release
   35 of SWISS-PROT.

                             WE NEED YOUR HELP !

   We welcome  feedback from our users. We would especially appreciate that
   you notify  us if  you find  that sequences  belonging to  your field of
   expertise are  missing from  the data  bank. We  also would  like to  be
   notified about  annotations to be updated, if, for example, the function
   of a protein has been clarified or if new post-translational information
   has become available.


                         APPENDIX A: SOME STATISTICS

   A.1  Amino acid composition

        A.1.1  Composition in percent for the complete data bank

   Ala (A) 7.54   Gln (Q) 4.02   Leu (L) 9.31   Ser (S) 7.19
   Arg (R) 5.15   Glu (E) 6.31   Lys (K) 5.94   Thr (T) 5.76
   Asn (N) 4.54   Gly (G) 6.86   Met (M) 2.36   Trp (W) 1.26
   Asp (D) 5.29   His (H) 2.23   Phe (F) 4.06   Tyr (Y) 3.21
   Cys (C) 1.70   Ile (I) 5.72   Pro (P) 4.91   Val (V) 6.52

   Asx (B) 0.001  Glx (Z) 0.001  Xaa (X) 0.02

        A.1.2  Classification of the amino acids by their frequency

   Leu, Ala, Ser, Gly, Val, Glu, Lys, Thr, Ile, Asp, Arg, Pro, Asn, Phe,
   Gln, Tyr, Met, His, Cys, Trp

   A.2  Repartition of the sequences by their organism of origin

   Total number of species represented in this release of SWISS-PROT: 5020

   A.2.1 Table of the frequency of occurrence of species

        Species represented 1x: 2250
                            2x:  808
                            3x:  446
                            4x:  285
                            5x:  209
                            6x:  189
                            7x:  129
                            8x:   96
                            9x:  105
                           10x:   44
                       11- 20x:  204
                       21- 50x:  154
                       51-100x:   42
                         >100x:   59

   A.2.2  Table of the most represented species

    Number   Frequency          Species
         1        3653          Baker's yeast (Saccharomyces cerevisiae)
         2        3475          Human
         3        3471          Escherichia coli
         4        2137          Mouse
         5        1866          Rat
         6        1577          Haemophilus influenzae
         7        1389          Bacillus subtilis
         8        1006          Caenorhabditis elegans
         9         833          Bovine
        10         818          Fruit fly (Drosophila melanogaster)
        11         642          Chicken
        12         640          Fission yeast (Schizosaccharomyces pombe)
        13         603          Salmonella typhimurium
        14         508          African clawed frog (Xenopus laevis)
        15         500          Arabidopsis thaliana (Mouse-ear cress)
        16         469          Rabbit
        17         397          Pig
        18         344          Mycoplasma genitalium
        19         326          Maize
        20         275          Bacteriophage T4
        21         256          Rice
        22         253          Vaccinia virus (strain Copenhagen)
        23         240          Pseudomonas aeruginosa
        24         214          Slime mold (Dictyostelium discoideum)
        25         213          Tobacco
        26         203          Pea
        27         193          Human cytomegalovirus (strain AD169)
        28         187          Wheat
        29         184          Vaccinia virus (strain WR)
        30         176          Soybean
        31         175          Barley
        32         171          Staphylococcus aureus
                   171          Dog
        34         165          Pseudomonas putida
                   165          Neurospora crassa
        36         159          Sheep
        37         158          Rhodobacter capsulatus
        38         154          Autographa californica nuclear polyhedrosis virus
        39         150          Marchantia polymorpha (Liverwort)
                   150          Klebsiella pneumoniae
        41         146          Variola virus
                   146          Bacillus stearothermophilus
        43         142          Spinach
                   142          Cyanophora paradoxa
        45         141          Potato
        46         139          Tomato
        47         130          Rhizobium meliloti
        48         123          Odontella sinensis
        49         122          Mycobacterium leprae
        50         119          Lactococcus lactis (subsp. lactis)
        51         117          Agrobacterium tumefaciens
        52         112          Synechocystis sp. (strain PCC 6803)
        53         108          Chlamydomonas reinhardtii
        54         106          Candida albicans
        55         105          Guinea pig
        56         104          Streptomyces coelicolor
                   104          Horse
        58         101          Trypanosoma brucei brucei
                   101          Aspergillus nidulans

   A.3  Repartition of the sequences by size

               From   To  Number             From   To   Number
                  1-  50    2706             1001-1100      471
                 51- 100    4851             1101-1200      340
                101- 150    6660             1201-1300      258
                151- 200    5047             1301-1400      169
                201- 250    4552             1401-1500      146
                251- 300    4075             1501-1600       88
                301- 350    3857             1601-1700       68
                351- 400    3897             1701-1800       63
                401- 450    2963             1801-1900       69
                451- 500    2974             1901-2000       41
                501- 550    2141             2001-2100       24
                551- 600    1521             2101-2200       53
                601- 650    1120             2201-2300       56
                651- 700     824             2301-2400       24
                701- 750     761             2401-2500       31
                751- 800     607             >2500          156
                801- 850     477
                851- 900     481
                901- 950     345
                951-1000     289

   A.4  Longest sequences

   The longest sequences (>=4000 residues) are listed here:

                               HTS1_COCCA  5217
                               FAT_DROME   5147
                               RYNR_RABIT  5037
                               RYNR_PIG    5035
                               RYNR_HUMAN  5032
                               RYNC_RABIT  4969
                               DYHC_DICDI  4725
                               DYHC_RAT    4644
                               DYHC_DROME  4639
                               APB_HUMAN   4563
                               APOA_HUMAN  4548
                               RRPA_CVMJH  4488
                               DYHC_ANTCR  4466
                               DYHC_TRIGR  4466
                               GRSB_BACBR  4451
                               PKSK_BACSU  4447
                               PKSL_BACSU  4427
                               YP73_CAEEL  4385
                               DYHC_NEUCR  4367
                               DYHC_EMENI  4344
                               PLEC_RAT    4140
                               DYHC_YEAST  4092
                               RRPA_CVH22  4085

   A.5  Statistics for journal citations

   Total number of journals cited in this release of SWISS-PROT: 710

   A.5.1 Table of the frequency of journal citations

        Journals cited 1x: 275 
                       2x:  99 
                       3x:  43 
                       4x:  28 
                       5x:  28 
                       6x:  14 
                       7x:  10 
                       8x:  13 
                       9x:  13 
                      10x:  10 
                  11- 20x:  54 
                  21- 50x:  45 
                  51-100x:  21 
                    >100x:  57 

   A.5.2  List of the most cited journals in SWISS-PROT

   Citations          Journal abbreviation
   ---------          ----------------------------------
   5010               J. BIOL. CHEM.
   3191               NUCLEIC ACIDS RES.
   3152               PROC. NATL. ACAD. SCI. U.S.A.
   2136               J. BACTERIOL.
   1828               GENE
   1706               FEBS LETT.
   1584               EUR. J. BIOCHEM.
   1436               EMBO J.
   1392               BIOCHEM. BIOPHYS. RES. COMMUN.
   1359               NATURE
   1300               BIOCHEMISTRY
   1092               BIOCHIM. BIOPHYS. ACTA
   1023               J. MOL. BIOL.
    996               CELL
    956               MOL. CELL. BIOL.
    811               MOL. GEN. GENET.
    756               PLANT MOL. BIOL.
    713               VIROLOGY
    708               BIOCHEM. J.
    636               SCIENCE
    585               MOL. MICROBIOL.
    575               J. BIOCHEM.
    458               J. VIROL.
    407               J. GEN. VIROL.
    367               GENOMICS
    335               J. CELL BIOL.
    299               GENES DEV.
    291               PLANT PHYSIOL.
    286               YEAST
    266               CURR. GENET.
    255               J. IMMUNOL.
    255               BIOL. CHEM. HOPPE-SEYLER
    240               ARCH. BIOCHEM. BIOPHYS.
    233               INFECT. IMMUN.
    221               MOL. BIOCHEM. PARASITOL.
    213               HOPPE-SEYLER'S Z. PHYSIOL. CHEM.
    204               HUM. MOL. GENET.
    202               J. GEN. MICROBIOL.
    193               MOL. ENDOCRINOL.
    182               ONCOGENE
    177               J. CLIN. INVEST.
    169               FEMS MICROBIOL. LETT.
    167               AM. J. HUM. GENET.
    149               DNA
    140               J. EXP. MED.
    140               GENETICS
    137               J. MOL. EVOL.
    134               DEVELOPMENT
    123               BLOOD
    120               HUM. MUTAT.
    117               HUM. GENET.
    116               NEURON
    114               DNA CELL BIOL.
    110               NAT. GENET.
    110               APPL. ENVIRON. MICROBIOL.
    109               HEMOGLOBIN
    104               AGRIC. BIOL. CHEM.



   The current  status of the relationships (cross-references) between some
   biomolecular databases is shown in the following schematic:

******************       *  EMBL Nucleotide    *       **********************
* EPD [Euk.Prom] * <---> *  Sequence Database  * <---- * ECDC [E.coli map]  *
******************       *       [EBI]         *       **********************
                          ^  ^ ^  ^  ^ ^ ^  ^
******************        |  | |  I  | | |  |
* FlyBase        * <------+  | |  I  | | |  |          **********************
* [D.melanogas.] *        |  | |  I  | | |  +--------> * GCRDb [7TM recep.] *
******************        |  | |  I  | | |  |          **********************
                          |  | |  I  | | |  |
******************        |  | |  I  | | |  |          **********************
* SubtiList      * <---------+ |  I  | | +-----------> * EcoGene [E.coli]   *
* [B.subtilis]   *        |  | |  I  | | |  |          **********************
******************        |  | |  I  | | |  |
                          |  | |  I  | | |  |          **********************
******************        |  | |  I  +---------------> * LISTA [Yeast]      *
* MaizeDb        * <-----------+  I  | | |  |          **********************
* [Zea mays]     *        |  | |  I  | | |  |
******************        |  | |  I  | | |  |          **********************
                          |  | |  I  | +-------------> * SGD [Yeast]        *
******************        |  | |  I  | | |  |          **********************
* WormPep        *        |  | |  I  | | |  |
* [C.elegans]    * <----+ |  | |  I  | | |  |          **********************
******************      | |  | |  I  | | |  | +------> * DictyDB [D.disco.] *
                        | |  | |  I  | | |  | |        **********************
******************      | v  v v  v  v v v  v v
* REBASE         *      ***********************        **********************
* [Restriction   * <--- *  SWISS-PROT         * <----- * ENZYME [Nomencl.]  *
*  enzymes]      *      *  Protein Sequence   *        **********************
******************      *  Data Bank          *            v
                        ***********************        **********************
******************      ^ ^ ^ ^ ^ ^ ^ | ^ ^ |          * OMIM [Human]       *
* StyGene        *      | | | | | | | | | | +--------> **********************
* [S.Typhimurium]* <----+ | | | | | | | | |
******************        | | | | | | | | |            **********************
                          | | | | | | | | +----------> * ECO2DBASE     [2D] *
******************        | | | | | | | |              **********************
* Transfac       * <------+ | | | | | | |
******************          | | | | | | |              **********************
                            | | | | | | +------------> * SWISS-2DPAGE  [2D] *
******************          | | | | | |                **********************
* Harefield [2D] * <--------+ | | | | |
******************            | | | | |                **********************
                              | | | | +--------------> * Aarhus/Ghent  [2D] *
******************            | | | |                  **********************
* PROSITE        *            | | | |
* [Patterns and  * <----------+ | | +----------------> **********************
* profiles]      *              | |                    * YEPD [Yeast]  [2D] *
******************              | +----------------+   **********************
             |                  v                  |
             |          ***********************    +-> **********************
             +--------> * PDB [3D structures] * <----- * HSSP [3D similar.] *
                        ***********************        **********************