TrEMBL release 27.0
Published July 6, 2004
UniProt/TrEMBL Release Notes
Release 27, 5th July 2004
EMBL Outstation
European Bioinformatics Institute (EBI)
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
United Kingdom
Telephone: (+44 1223) 494 444
Fax: (+44 1223) 494 468
Electronic mail address: datalib@ebi.ac.uk/swissprot@ebi.ac.uk
WWW server: http://www.ebi.ac.uk/
Swiss Institute of Bioinformatics (SIB)
Centre Medical Universitaire
1, rue Michel Servet
1211 Geneva 4
Switzerland
Telephone: (+41 22) 702 50 50
Fax: (+41 22) 702 58 58
Electronic mail address: swiss-prot@expasy.org
WWW server: http://www.expasy.org/
Protein Information Resource (PIR)
Georgetown University Medical Center
3900 Reservoir Road, NW
Box 571455
Washington, DC 20057-1455
United States of America
Telephone: (+1 202) 687 1039
Fax: (+1 202) 687 0057)
Electronic mail address: pirmail@georgetown.edu
WWW server: http://pir.georgetown.edu
Acknowledgements
UniProt/TrEMBL has been prepared by:
o Claire O'Donovan, Maria Jesus Martin, Yasmin Alam-Faruque,
Nicola Althorpe, Daniel Barrell, Wei mun Chan, Paul Browne,
Kirill Degtyarenko, Ruth Eberhardt, Gill Fraser, Alexander
Fedetov, Rodrigo Fernandez, John Garavelli, Andre Hackmann,
Alan Horne, Julius Jacobsen, Alexander Kanapin, Youla
Karavidopoulou, Paul Kersey, Ernst Kretschmann, Kati Laiho,
Minna Lehvaslaiho, Michele Magrane, Virginie Mittard, Nicola
Mulder, John F. O'Rourke, Sandra Orchard, Astrid Rakow,
Mark Rynbeek, Sandra van den Broek, Eleanor Whitfield, Allyson
Williams and Rolf Apweiler at the EMBL Outstation - European
Bioinformatics Institute (EBI) in Hinxton, UK.
o Amos Bairoch, Alexandre Gattiker, Karine Michoud, Catherine
Rivoire, Nicole Redaschi and Sandrine Pilbout at the Swiss
Institute of Bioinformatics in Geneva, Switzerland.
Copyright Notice
UniProt/TrEMBL copyright (c) 2004 EMBL-EBI
This manual and the database it accompanies may be copied and
redistributed freely, without advance permission, provided
that this copyright statement is reproduced with each copy.
Citation
If you want to cite UniProt/TrEMBL in a publication please use
the following reference:
Apweiler R., Bairoch A., Wu C.H., Barker W.C., Boeckmann B., Ferro
S., Gasteiger E., Huang H., Lopez R., Magrane M., Martin M.J.,
Natale D.A., O'Donovan C., Redaschi N. and Yeh L.L.
UniProt: the Universal Protein knowledgebase
Nucleic Acids Res. 32: D115-D119 (2004)
1. Introduction
UniProt/TrEMBL is a computer-annotated protein sequence database
complementing the UniProt/Swiss-Prot database. Together they constitute
the UniProt Knowledgebase. The DDBJ/EMBL/GenBank nucleotide sequence
databases� CDS translations, the sequences of PDB structures, and
directly sequenced peptides extracted from the literature or submitted
directly to UniProt are used by default as the raw material for the
UniProt Knowledgebase. However, some data from DDBJ/EMBL/GenBank including
most of the Whole Genome Shotgun (WGS) data, CDS translations leading to
small fragments or not coding for real proteins, synthetic sequences,
non-germline Immunoglobulins and T-cell receptors, and most patent
application sequences are actively excluded from the Knowledgebase.
Having this data into the Knowledgebase would pollute the database
with highly unstable and low-quality data. However, we do provide all
publicly available protein sequences in the UniProt archive (UniParc)
(http://www.uniprot.org/). UniParc sequences from other UniParc source
records identified by the UniProt curators as important sequences
missing in the Knowledgebase are also used to create new UniProt
Knowledgebase records. This process ensures that the UniProt Knowledgebase
is not missing any important sequences available in the protein sequence
repositories, but minimises the amount of unstable and low quality data in
the Knowledgebase.
2. Why a complement to UniProt/Swiss-Prot?
The ongoing gene sequencing and mapping projects have dramatically
increased the number of protein sequences to be incorporated into
UniProt/Swiss-Prot. We do not want to dilute the quality standards of
UniProt/Swiss-Prot by incorporating sequences without proper sequence
analysis and annotation but we do want to make the sequences available
as quickly as possible. UniProt/TrEMBL achieves this goal and is a major
step in the process of speeding up subsequent upgrading of annotation to
the standard UniProt/Swiss-Prot quality.
3. The Release
This UniProt/TrEMBL release has been produced in synch with
UniProt/Swiss-Prot release 44 and together they comprise the UniProt
Knowledgebase release 2.0. It was created from the EMBL Nucleotide
Sequence Database release 79 and updates until the 18-June-2004 and
contains 1'333'917 entries and 413'323'560 amino acids.
UniProt/TrEMBL is organized in subsections:
arc.dat (Archaea): 4245 entries
arp.dat (Complete Archaeal proteomes): 33050 entries
fun.dat (Fungi): 41959 entries
hum.dat (Human): 49176 entries
inv.dat (Invertebrates): 147306 entries
mam.dat (Other Mammals): 18352 entries
mhc.dat (MHC proteins): 10528 entries
org.dat (Organelles): 112691 entries
phg.dat (Bacteriophages): 13750 entries
pln.dat (Plants): 116371 entries
pro.dat (Prokaryotes): 169966 entries
prp.dat (Complete Prokaryotic Proteomes): 330392 entries
rod.dat (Rodents): 47097 entries
unc.dat (Unclassified): 963 entries
vrl.dat (Viruses): 124972 entries
vrt.dat (Other Vertebrates): 30294 entries
vrv.dat (Retroviruses): 116571 entries
275'585 new entries have been integrated in UniProt/TrEMBL.
More statistics for the UniProt/TrEMBL release are available at
http://www.ebi.ac.uk/trembl/
In the document delac_tr.txt, you will find a list of all accession
numbers which were previously present in UniProt/TrEMBL, but which
have now been deleted from the database.
4. Format differences between UniProt/Swiss-Prot and UniProt/TrEMBL
The format and conventions used by UniProt/TrEMBL follow as closely
as possible that of UniProt/Swiss-Prot. Hence, it is not necessary
to produce an additional user manual and extensive release notes
for UniProt/TrEMBL. The information given in the UniProt/Swiss-Prot
release notes and user manual are in general valid for UniProt/TrEMBL.
The differences are mentioned below.
The general structure of an entry is identical in both databases.
The data class used in UniProt/TrEMBL (in the ID line) is always
'PRELIMINARY',whereas in UniProt/Swiss-Prot it is always 'STANDARD'.
Differences in line types:
The ID line (IDentification):
The entry name used in UniProt/TrEMBL is the same as the Accession
Number of the entry.
The DT line (DaTe)
The format of the DT lines that serve to indicate when an entry was
created and updated are identical to that defined in UniProt/Swiss-Prot;
but the DT lines in UniProt/TrEMBL refer to the UniProt/TrEMBL release.
The difference is shown in the example below.
DT lines in a UniProt/Swiss-Prot entry:
DT 01-JAN-1988 (Rel. 06, Created)
DT 01-JUL-1989 (Rel. 11, Last sequence update)
DT 01-AUG-1992 (Rel. 23, Last annotation update)
DT lines in a UniProt/TrEMBL entry:
DT 01-NOV-1996 (TrEMBLrel. 01, Created)
DT 01-NOV-1999 (TrEMBLrel. 12, Last sequence update)
DT 01-MAR-2004 (TrEMBLrel. 26, Last annotation update)
5. Bi-Weekly incremental UniProt Knowledgebase releases
5.1 UniProt Knowledgebase
In addition to full releases, we also provide biweekly two compressed
files: uniprot.sprot.dat.gz and uniprot.trembl.dat.gz at
http://www.uniprot.org/database/download.shtml allowing users access
to the latest data.
5.2 XML
A version of the UniProt Knowledgebase in XML format has been developed
and is provided with this release. More information is available at
http://www.uniprot.org/support/documents.shtml and the data can be
downloaded from http://www.uniprot.org/database/download.shtml
We would welcome any feedback from the user community.
5.3 Varsplic Expand
We also provide Varsplic Expand which is a program to generate
"expanded" sequences from UniProt Knowledgebase records i.e. sequences
including the variants specified by the varsplic, variant and conflict
annotations. New records are produced in either pseudo-UniProt/Swiss-Prot
or FASTA format for each specified variant. More information and the data
is available at http://www.uniprot.org/database/download.shtml
6. Access/Data Distribution
The UniProt/TrEMBL release 27 is available at:
ftp.ebi.ac.uk/pub/databases/trembl
The biweekly UniProt Knowledgebase release is available for searches and
download from http://www.uniprot.org/database/download.shtml
The UniProt Knowledgebase release is also available on CD-ROM from the EBI.
7. General announcements and Forthcoming changes
7.1 Recent and Forthcoming changes documentation for users
We have introduced two new resources for users to enable us to communicate
effectively between releases about what is new in the UniProt Knowledgebase
and what is planned for the future.
These are available at:
http://www.uniprot.org/support/documents.shtml
7.2 TrEMBL enhancements
This release of TrEMBL has been produced from a new relational
database system. This new system enables the biweekly synchronization of
UniProt/TrEMBL with it's source EMBL/DDBJ/GenBank nucleotide sequence
databases. It has also facilitated the integration of various bioinformatic
tools to enhance the UniProt/TrEMBL annotation. As a result, this release of
the database has significant annotation differences with regards to previous
releases and we are committed to further raising the annotation standards.
We welcome feedback from the user community.
