TrEMBL release 27.0
Published July 6, 2004
UniProt/TrEMBL Release Notes Release 27, 5th July 2004 EMBL Outstation European Bioinformatics Institute (EBI) Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD United Kingdom Telephone: (+44 1223) 494 444 Fax: (+44 1223) 494 468 Electronic mail address: firstname.lastname@example.orgemail@example.com WWW server: http://www.ebi.ac.uk/ Swiss Institute of Bioinformatics (SIB) Centre Medical Universitaire 1, rue Michel Servet 1211 Geneva 4 Switzerland Telephone: (+41 22) 702 50 50 Fax: (+41 22) 702 58 58 Electronic mail address: firstname.lastname@example.org WWW server: http://www.expasy.org/ Protein Information Resource (PIR) Georgetown University Medical Center 3900 Reservoir Road, NW Box 571455 Washington, DC 20057-1455 United States of America Telephone: (+1 202) 687 1039 Fax: (+1 202) 687 0057) Electronic mail address: email@example.com WWW server: http://pir.georgetown.edu Acknowledgements UniProt/TrEMBL has been prepared by: o Claire O'Donovan, Maria Jesus Martin, Yasmin Alam-Faruque, Nicola Althorpe, Daniel Barrell, Wei mun Chan, Paul Browne, Kirill Degtyarenko, Ruth Eberhardt, Gill Fraser, Alexander Fedetov, Rodrigo Fernandez, John Garavelli, Andre Hackmann, Alan Horne, Julius Jacobsen, Alexander Kanapin, Youla Karavidopoulou, Paul Kersey, Ernst Kretschmann, Kati Laiho, Minna Lehvaslaiho, Michele Magrane, Virginie Mittard, Nicola Mulder, John F. O'Rourke, Sandra Orchard, Astrid Rakow, Mark Rynbeek, Sandra van den Broek, Eleanor Whitfield, Allyson Williams and Rolf Apweiler at the EMBL Outstation - European Bioinformatics Institute (EBI) in Hinxton, UK. o Amos Bairoch, Alexandre Gattiker, Karine Michoud, Catherine Rivoire, Nicole Redaschi and Sandrine Pilbout at the Swiss Institute of Bioinformatics in Geneva, Switzerland. Copyright Notice UniProt/TrEMBL copyright (c) 2004 EMBL-EBI This manual and the database it accompanies may be copied and redistributed freely, without advance permission, provided that this copyright statement is reproduced with each copy. Citation If you want to cite UniProt/TrEMBL in a publication please use the following reference: Apweiler R., Bairoch A., Wu C.H., Barker W.C., Boeckmann B., Ferro S., Gasteiger E., Huang H., Lopez R., Magrane M., Martin M.J., Natale D.A., O'Donovan C., Redaschi N. and Yeh L.L. UniProt: the Universal Protein knowledgebase Nucleic Acids Res. 32: D115-D119 (2004) 1. Introduction UniProt/TrEMBL is a computer-annotated protein sequence database complementing the UniProt/Swiss-Prot database. Together they constitute the UniProt Knowledgebase. The DDBJ/EMBL/GenBank nucleotide sequence databases' CDS translations, the sequences of PDB structures, and directly sequenced peptides extracted from the literature or submitted directly to UniProt are used by default as the raw material for the UniProt Knowledgebase. However, some data from DDBJ/EMBL/GenBank including most of the Whole Genome Shotgun (WGS) data, CDS translations leading to small fragments or not coding for real proteins, synthetic sequences, non-germline Immunoglobulins and T-cell receptors, and most patent application sequences are actively excluded from the Knowledgebase. Having this data into the Knowledgebase would pollute the database with highly unstable and low-quality data. However, we do provide all publicly available protein sequences in the UniProt archive (UniParc) (http://www.uniprot.org/). UniParc sequences from other UniParc source records identified by the UniProt curators as important sequences missing in the Knowledgebase are also used to create new UniProt Knowledgebase records. This process ensures that the UniProt Knowledgebase is not missing any important sequences available in the protein sequence repositories, but minimises the amount of unstable and low quality data in the Knowledgebase. 2. Why a complement to UniProt/Swiss-Prot? The ongoing gene sequencing and mapping projects have dramatically increased the number of protein sequences to be incorporated into UniProt/Swiss-Prot. We do not want to dilute the quality standards of UniProt/Swiss-Prot by incorporating sequences without proper sequence analysis and annotation but we do want to make the sequences available as quickly as possible. UniProt/TrEMBL achieves this goal and is a major step in the process of speeding up subsequent upgrading of annotation to the standard UniProt/Swiss-Prot quality. 3. The Release This UniProt/TrEMBL release has been produced in synch with UniProt/Swiss-Prot release 44 and together they comprise the UniProt Knowledgebase release 2.0. It was created from the EMBL Nucleotide Sequence Database release 79 and updates until the 18-June-2004 and contains 1'333'917 entries and 413'323'560 amino acids. UniProt/TrEMBL is organized in subsections: arc.dat (Archaea): 4245 entries arp.dat (Complete Archaeal proteomes): 33050 entries fun.dat (Fungi): 41959 entries hum.dat (Human): 49176 entries inv.dat (Invertebrates): 147306 entries mam.dat (Other Mammals): 18352 entries mhc.dat (MHC proteins): 10528 entries org.dat (Organelles): 112691 entries phg.dat (Bacteriophages): 13750 entries pln.dat (Plants): 116371 entries pro.dat (Prokaryotes): 169966 entries prp.dat (Complete Prokaryotic Proteomes): 330392 entries rod.dat (Rodents): 47097 entries unc.dat (Unclassified): 963 entries vrl.dat (Viruses): 124972 entries vrt.dat (Other Vertebrates): 30294 entries vrv.dat (Retroviruses): 116571 entries 275'585 new entries have been integrated in UniProt/TrEMBL. More statistics for the UniProt/TrEMBL release are available at http://www.ebi.ac.uk/trembl/ In the document delac_tr.txt, you will find a list of all accession numbers which were previously present in UniProt/TrEMBL, but which have now been deleted from the database. 4. Format differences between UniProt/Swiss-Prot and UniProt/TrEMBL The format and conventions used by UniProt/TrEMBL follow as closely as possible that of UniProt/Swiss-Prot. Hence, it is not necessary to produce an additional user manual and extensive release notes for UniProt/TrEMBL. The information given in the UniProt/Swiss-Prot release notes and user manual are in general valid for UniProt/TrEMBL. The differences are mentioned below. The general structure of an entry is identical in both databases. The data class used in UniProt/TrEMBL (in the ID line) is always 'PRELIMINARY',whereas in UniProt/Swiss-Prot it is always 'STANDARD'. Differences in line types: The ID line (IDentification): The entry name used in UniProt/TrEMBL is the same as the Accession Number of the entry. The DT line (DaTe) The format of the DT lines that serve to indicate when an entry was created and updated are identical to that defined in UniProt/Swiss-Prot; but the DT lines in UniProt/TrEMBL refer to the UniProt/TrEMBL release. The difference is shown in the example below. DT lines in a UniProt/Swiss-Prot entry: DT 01-JAN-1988 (Rel. 06, Created) DT 01-JUL-1989 (Rel. 11, Last sequence update) DT 01-AUG-1992 (Rel. 23, Last annotation update) DT lines in a UniProt/TrEMBL entry: DT 01-NOV-1996 (TrEMBLrel. 01, Created) DT 01-NOV-1999 (TrEMBLrel. 12, Last sequence update) DT 01-MAR-2004 (TrEMBLrel. 26, Last annotation update) 5. Bi-Weekly incremental UniProt Knowledgebase releases 5.1 UniProt Knowledgebase In addition to full releases, we also provide biweekly two compressed files: uniprot.sprot.dat.gz and uniprot.trembl.dat.gz at http://www.uniprot.org/database/download.shtml allowing users access to the latest data. 5.2 XML A version of the UniProt Knowledgebase in XML format has been developed and is provided with this release. More information is available at http://www.uniprot.org/support/documents.shtml and the data can be downloaded from http://www.uniprot.org/database/download.shtml We would welcome any feedback from the user community. 5.3 Varsplic Expand We also provide Varsplic Expand which is a program to generate "expanded" sequences from UniProt Knowledgebase records i.e. sequences including the variants specified by the varsplic, variant and conflict annotations. New records are produced in either pseudo-UniProt/Swiss-Prot or FASTA format for each specified variant. More information and the data is available at http://www.uniprot.org/database/download.shtml 6. Access/Data Distribution The UniProt/TrEMBL release 27 is available at: ftp.ebi.ac.uk/pub/databases/trembl The biweekly UniProt Knowledgebase release is available for searches and download from http://www.uniprot.org/database/download.shtml The UniProt Knowledgebase release is also available on CD-ROM from the EBI. 7. General announcements and Forthcoming changes 7.1 Recent and Forthcoming changes documentation for users We have introduced two new resources for users to enable us to communicate effectively between releases about what is new in the UniProt Knowledgebase and what is planned for the future. These are available at: http://www.uniprot.org/support/documents.shtml 7.2 TrEMBL enhancements This release of TrEMBL has been produced from a new relational database system. This new system enables the biweekly synchronization of UniProt/TrEMBL with it's source EMBL/DDBJ/GenBank nucleotide sequence databases. It has also facilitated the integration of various bioinformatic tools to enhance the UniProt/TrEMBL annotation. As a result, this release of the database has significant annotation differences with regards to previous releases and we are committed to further raising the annotation standards. We welcome feedback from the user community.