UniProt release 12.7
Published January 15, 2008
Addition of more than 40’000 microbial entries derived from automated annotation in UniProtKB
Thanks to genome sequencing efforts, there has been a tremendous rise in the
number of submitted protein sequences. And this is only the beginning, as
faster and cheaper sequencing methods will greatly increase the rate at which
new genomes are sequenced.
Semi-automated annotation methods are
necessary in order to provide the users with a maximum number of annotated
protein sequences. The approach used by UniProtKB/Swiss-Prot differs from most
other automated methods as the bulk of the annotation procedure is still
performed manually, since we want to make sure that we produce high quality
annotation with a minimal amount of incorrect inferences.
automatic annotation project is called HAMAP, which stands for High-quality Automated and Manual Annotation of microbial Proteomes. In the context of this project, proteins
from complete bacterial and archaeal proteomes, together with the related
plastid proteins, are automatically annotated based on manually created family
rules for complete protein annotation, with template-based feature propagation.
We are very aware of the danger posed by automatic annotation procedures and
have been extremely careful in the implementation of the pipeline, establishing
many checks and conditional propagation in order to ensure that automatic
annotation will produce data of a quality up to that of manual curation.
At this release, we have begun the procedure to integrate automatically into
UniProtKB/Swiss-Prot the entries annotated by the HAMAP automated pipeline;
over 40’000 bacterial and archaeal entries were integrated. This is the largest
number of entries ever integrated at one release.
It must be noted that the planned introduction of ‘evidence tags’ should
allow us to unambiguously flag whether an information item has been derived
manually or automatically. For the time being, all entries annotated by the HAMAP pipeline have a cross-reference to HAMAP (for an example
see entry Q02JM4).
Cross-references to dictyBase
The DictyBase database was renamed dictyBase.
We changed the database name in the relevant cross-references (DR lines in the flat file) accordingly.
DR dictyBase; DDB0201569; manA.
Cross-references to PDBsum
Cross-references have been added to the PDBsum database. PDBsum provides an
overview of every macromolecular structure deposited in the Protein Data Bank
(PDB), giving schematic diagrams of the molecules in each structure and of the
interactions between them.
The PDBsum database is available at http://www.ebi.ac.uk/pdbsum.
The format of the explicit links in the flat file is:
|Resource identifier||PDB entry name.|
<a href="/uniprot/Q07540#section_x-ref">Q07540</a>: DR PDBsum; 2FQL; -. DR PDBsum; 2GA5; -. <a href="/uniprot/P78536#section_x-ref">P78536</a>: DR PDBsum; 1BKC; -. DR PDBsum; 1ZXC; -. DR PDBsum; 2A8H; -. DR PDBsum; 2DDF; -. DR PDBsum; 2FV5; -. DR PDBsum; 2FV9; -. DR PDBsum; 2I47; -.
Cross-references to VectorBase
Cross-references have been added to the
Invertebrate Vectors of Human Pathogens database. VectorBase is a NIAID Bioinformatics
Resource Center for Invertebrate Vectors of Human Pathogens. VectorBase annotates and
maintains vector genomes providing an integrated resource for the research community.
The VectorBase database is available at http://www.vectorbase.org/index.php.
The format of the explicit links in the flat file is:
|Resource identifier||VectorBase Gene ID.|
|Optional information 1||Species name.|
<a href="/uniprot/Q17KX3#section_x-ref">Q17KX3</a>: DR VectorBase; AAEL001551; Aedes aegypti. <a href="/uniprot/Q7PD39#section_x-ref">Q7PD39</a>: DR VectorBase; AGAP005024; Anopheles gambiae. DR VectorBase; AGAP005025; Anopheles gambiae.
Release of new species-specific documents
There are 9 new documents for several Brucella, Rickettsia and Coxiella complete
proteomes, listing all the UniProtKB/Swiss-Prot entries from these proteomes and their
corresponding gene designations.
The documents contain, for each relevant UniProtKB/Swiss-Prot entry, the corresponding
ordered locus name, entry name, accession number, sequence length and gene name(s).
- Brucella abortus strain 2308: brua2.txt (ftp)
- Brucella abortus: bruab.txt (ftp)
- Brucella melitensis: brume.txt (ftp)
- Brucella suis: brusu.txt (ftp)
- Coxiella burnetii: coxbu.txt (ftp)
- Rickettsia bellii strain RML369-C: ricbr.txt (ftp)
- Rickettsia conorii: riccn.txt (ftp)
- Rickettsia felis (Rickettsia azadi): ricfe.txt (ftp)
- Rickettsia typhi: ricty.txt (ftp)
Changes concerning keywords
- GM2-gangliosidosis -> Gangliosidosis
- Ribosomal frameshift -> Ribosomal frameshifting
Changes in subcellular location controlled vocabulary
New subcellular location:
- Cyanelle stroma
New clustered sequence sets
The UniProt Metagenomic and Environmental Sequences (UniMES) database is a repository
specifically developed for metagenomic and environmental data.
We now provide UniMES clusters, i.e. clustered sets
(unimes_cluster100.fasta and unimes_cluster90.fasta) of sequences at two resolutions (100% and >90%).
In unimes_cluster100.fasta, identical sequences and subfragments from unimes.fasta are placed into
a single cluster.
The unimes_cluster90.fasta is built by clustering unimes_cluster100.fasta representative sequences
(the longest sequence in a cluster) using the CD-HIT algorithm (Li W., Jaroszewski L., and Godzik A., Bioinformatics, 17: 282-283, 2001)
such that each cluster is composed of sequences that have at least 90% sequence identity,
to the representative sequence. Only the representative sequences of the clusters are present
in these files.