Release 14.0
Published July 22, 2008
Headlines
A UniProtKB major release (14.0) on our brand-new UniProt website
After almost one year of beta testing, the UniProt consortium is proud to announce the release of its new official unified website: a new interface, a new search engine and many new options to serve you better. The content of the various databases we provide is unchanged, except for all the improvements we keep carrying with each new release, as we used to from the very beginning of our existence, some 20 years ago in the case of our older database: UniProtKB/Swiss-Prot. Many documents are available on the Documentation/help page, including FAQs. However, don't hesitate to contact us for any further questions, remarks or update requests.
UniProt Knowledgebase release 14.0 includes Swiss-Prot release 56.0 and TrEMBL release 39.0.
Release 56.0 of 22-Jul-08 of UniProtKB/Swiss-Prot contains 392'667 sequence entries, comprising 141'217'034 amino acids abstracted from 172'036 references. 36'631 sequences have been added since release 55.0, the sequence data of 605 existing entries has been updated and the annotations of 356'036 entries have been revised.
The following improvements were carried out in the last 5 months:
- We have structured the UniProtKB DE lines. The new format includes 3 categories:
- 'RecName' is the protein name recommended by the UniProt Consortium,
- 'AltName' represents synonyms found in the literature or in other databases,
- 'SubName' is the name provided by the submitters of the underlying nucleotide sequence. It is found in UniProtKB/TrEMBL only.
- Abbreviations and acronyms are available in the 'Short' subcategory,
- WHO INN (International Nonproprietary Names) are found in the 'INN' subcategory,
- EC (Enzyme nomenclature) numbers are located in the 'EC' subcategory.
- We have changed the FASTA headers in order to make them compatible with the -o option of the NCBI's program formatdb.
- We have added a new term to the list of valid plastid values in the OG line: Chromatophore.
- We have enriched the controlled vocabulary for post-translational modification descriptions with 20 new terms: 12 for the feature key 'CROSSLNK' and 8 for the feature key 'MOD_RES'.
- We have added cross-references to 7 new databases, bringing the total number of explicit cross-references to 102: AGRICOLA, Candida Genome Database(CGD), HOGENOM, HOVERGEN, NMPDR, ProMEX, and BindingDB. We have removed cross-references to HIV and TRANSFAC.
- We have added 22 new keywords and 2 have been deleted. 12 of the newly created keywords are medical ones.
- We have added 1 new document:
"For additional information, see UniProt release notes.
UniProtKB News
Change of the protein names section (DE line in the flat file)
Up to now, the UniProtKB protein names section (DE lines in the flat file) were listing protein names in a computer parsable format, but with a minimal amount of structure. In UniProtKB/Swiss-Prot, the description starts with the recommended name of the protein and additional alternative names are indicated between parentheses. In UniProtKB/TrEMBL, the description is derived directly from the underlying nucleotide entry and its accuracy relies on the information provided by the submitter of the nucleotide entry, unless it has been improved by automatic annotation procedures.
Consistent nomenclature is indispensable for communication, literature searching and entry retrieval. The protein names provided in the description lines of UniProtKB/Swiss- Prot are widely used by life scientists and often propagated during the annotation of new genomic sequences. For these reasons we have structured the UniProtKB DE lines more explicitly: We introduced 3 categories, as well as several subcategories, of protein names:
| Category Field | Subcategory Field | Cardinality | Description |
|---|---|---|---|
| RecName: | 1 in UniProtKB/Swiss-Prot 0-1 in UniProtKB/TrEMBL |
The name recommended by the UniProt consortium. | |
| Full= | 1 | The full name. | |
| Short= | 0-n | An abbreviation of the full name or an acronym. | |
| EC= | 0-n | An Enzyme Commission number. | |
| AltName: | 0-n | A synonym of the recommended name. | |
| Full= | 0-1 | The full name. | |
| Short= | 0-n | An abbreviation of the full name or an acronym. | |
| EC= | 0-n | An Enzyme Commission number. | |
| AltName: | Allergen= | 0-1 | See allergen.txt. |
| AltName: | Biotech= | 0-1 | A name used in a biotechnological context. |
| AltName: | CD_antigen= | 0-n | See cdlist.txt. |
| AltName: | INN= | 0-n | The international nonproprietary name: A generic name for a pharmaceutical substance or active pharmaceutical ingredient that is globally recognized and is a public property. |
| SubName: | 0 in UniProtKB/Swiss-Prot 0-n in UniProtKB/TrEMBL |
A name provided by the submitter of the underlying nucleotide sequence. | |
| Full= | 1 | The full name. | |
| EC= | 0-n | An Enzyme Commission number. |
Each name is shown on a separate line; lines may therefore exceed 75 characters.
A block of DE lines may further contain multiple Includes: and/or Contains: sections and a separate field Flags: to indicate whether the protein sequence is a precursor or a fragment:
| Field | Cardinality | Value |
|---|---|---|
| Includes: | 0-n | A block of protein names as described in the table above. |
| Contains: | 0-n | A block of protein names as described in the table above. |
| Flags: | 0-1 | Precursor and/or Fragment or Fragments |
Examples:
P09919:Previous format:
DE Granulocyte colony-stimulating factor precursor (G-CSF) (Pluripoietin) DE (Filgrastim) (Lenograstim).
New format:
DE RecName: Full=Granulocyte colony-stimulating factor; DE Short=G-CSF; DE AltName: Full=Pluripoietin; DE AltName: INN=Filgrastim; DE AltName: INN=Lenograstim; DE Flags: Precursor;Q10743:
Previous format:
DE ADAM 10 precursor (EC 3.4.24.81) (A disintegrin and metalloproteinase DE domain 10) (Mammalian disintegrin-metalloprotease) (Kuzbanian protein DE homolog) (CD156c antigen) (Fragment).
New format:
DE RecName: Full=ADAM 10; DE EC=3.4.24.81; DE AltName: Full=A disintegrin and metalloproteinase domain 10; DE AltName: Full=Mammalian disintegrin-metalloprotease; DE AltName: Full=Kuzbanian protein homolog; DE AltName: CD_antigen=CD156c; DE Flags: Precursor; Fragment;Q07908:
Previous format:
DE Arginine biosynthesis bifunctional protein argJ [Includes: Glutamate DE N-acetyltransferase (EC 2.3.1.35) (Ornithine acetyltransferase) DE (Ornithine transacetylase) (OATase); Amino-acid acetyltransferase DE (EC 2.3.1.1) (N-acetylglutamate synthase) (AGS)] [Contains: Arginine DE biosynthesis bifunctional protein argJ alpha chain; Arginine DE biosynthesis bifunctional protein argJ beta chain].
New format:
DE RecName: Full=Arginine biosynthesis bifunctional protein argJ; DE Includes: DE RecName: Full=Glutamate N-acetyltransferase; DE EC=2.3.1.35; DE AltName: Full=Ornithine acetyltransferase; DE Short=OATase; DE AltName: Full=Ornithine transacetylase; DE Includes: DE RecName: Full=Amino-acid acetyltransferase; DE EC=2.3.1.1; DE AltName: Full=N-acetylglutamate synthase; DE Short=AGS; DE Contains: DE RecName: Full=Arginine biosynthesis bifunctional protein argJ alpha chain; DE Contains: DE RecName: Full=Arginine biosynthesis bifunctional protein argJ beta chain;
Changes in the FASTA header line
The UniProtKB FASTA headers were unfortunately incompatible with the -o option of the NCBI's program formatdb. We have been working with the NCBI to remedy this and changes were required on both sides. The new version of formatdb now accepts a database code for UniProtKB/TrEMBL, and we have modified our UniProtKB FASTA headers accordingly. For consistency reasons, we also changed the FASTA headers of the other UniProt databases.
UniProtKB
>db|UniqueIdentifier|EntryName ProteinName OS=OrganismName[ GN=GeneName]PE=ProteinExistence SV=SequenceVersionWhere:
- db is 'sp' for UniProtKB/Swiss-Prot and 'tr' for UniProtKB/TrEMBL.
- UniqueIdentifier is the primary accession number of the UniProtKB entry.
- EntryName is the entry name of the UniProtKB entry.
- ProteinName is the recommended name of the UniProtKB entry as
annotated in the
RecNamefield from release 14.0 on. For UniProtKB/TrEMBL entries without aRecNamefield, theSubNamefield is used. The 'precursor' attribute is excluded, 'Fragment' is included with the name if applicable. - OrganismName is the scientific name of the organism of the UniProtKB entry.
- GeneName is the first gene name of the UniProtKB entry.
If there is no gene name,
OrderedLocusNameorORFname, theGNfield is not listed. - ProteinExistence is the numerical value describing the evidence for the existence of the protein.
- SequenceVersion is the version number of the sequence.
Examples:
>sp|Q8I6R7|ACN2_ACAGO Acanthoscurrin-2 (Fragment) OS=Acanthoscurria gomesiana GN=acantho2 PE=1 SV=1 >sp|P27748|ACOX_RALEH Acetoin catabolism protein X OS=Ralstonia eutropha (strain ATCC 17699 / H16 / DSM 428 / Stanier 337) GN=acoX PE=4 SV=2 >sp|P04224|HA22_MOUSE H-2 class II histocompatibility antigen, E-K alpha chain OS=Mus musculus PE=1 SV=1 >tr|A3SA23|A3SA23_9RHOB TonB dependent, hydroxamate-type ferrisiderophore, outer membrane receptor OS=Sulfitobacter sp. EE-36 GN=EE36_08023 PE=3 SV=1 >tr|Q8N2H2|Q8N2H2_HUMAN CDNA FLJ90785 fis, clone THYRO1001457, moderately similar to H.sapiens protein kinase C mu OS=Homo sapiens PE=2 SV=1Alternative isoforms (this only applies to UniProtKB/Swiss-Prot):
>sp|IsoID|EntryName Isoform IsoformName of ProteinName OS=OrganismName[ GN=GeneName]Where:
- IsoID is the isoform identifier as assigned in the ALTERNATIVE PRODUCTS section of the UniProtKB entry.
- IsoformName is the isoform name as annotated in the ALTERNATIVE PRODUCTS
Namefield of the UniProtKB entry.
Example:
sp|Q4R572-2|1433B_MACFA Isoform Short of 14-3-3 protein beta/alpha OS=Macaca fascicularis GN=YWHAB
UniRef
>UniqueIdentifier ClusterName n=Members Tax=Taxon RepID=RepresentativeMemberWhere:
- UniqueIdentifier is the primary accession number of the UniRef cluster.
- ClusterName is the name of the UniRef cluster.
- Members is the number of UniRef cluster members.
- Taxon is the scientific name of the lowest common taxon shared by all UniRef cluster members.
- RepresentativeMember is the entry name of the representative member of the UniRef cluster.
Example:
>UniRef100_A5DI11 Elongation factor 2 n=1 Tax=Pichia guilliermondii RepID=EF2_PICGU
UniParc
>UniqueIdentifier status=StatusWhere:
- UniqueIdentifier is the primary accession number of the UniParc entry.
- Status is 'active' if the UniParc entry has at least one active cross-reference, and 'inactive' if it does not have any active cross-references.
Example:
>UPI0000000005 status=active
UniMES
>UniqueIDentifier ProteinName OS=OrganismName[ Pep=SourcePeptideIdentifier]SV=SequenceVersionWhere:
- UniqueIdentifier is the primary accession number of the UniMES entry.
- ProteinName is the protein name of the UniMES entry.
- OrganismName is the scientific name of the organism (group) of the UniMES entry.
- SourcePeptideIdentifier is the (optional) peptide identifier provided by the submitter.
- SequenceVersion is the version number of the sequence.
Example:
>MES00000000005 Putative uncharacterized protein GOS_3018412 (Fragment) OS=marine metagenome Pep=JCVI_PEP_1096688850003 SV=1
Archived UniProtKB sequence versions
>db|UniqueIdentifier archived from Release ReleaseNumber ReleaseDate SV=SequenceVersionWhere:
- db is 'sp' for UniProtKB/Swiss-Prot and 'tr' for UniProtKB/TrEMBL.
- UniqueIdentifier is the primary accession number of the UniProtKB entry.
- ReleaseNumber refers to the release from which the sequence was archived (Swiss-Prot or TrEMBL release numbers for releases prior to the first UniProt release, and both UniProt and Swiss-Prot or TrEMBL release numbers for releases after the first UniProt release).
- ReleaseDate is the date of the release form which the sequence was archived.
- SequenceVersion is the version number of the sequence.
Examples:
"pre-UniProt":>sp|P05067 archived from Release 18.0 01-MAY-1991 SV=3 >tr|Q55167 archived from Release 17.0 01-JUN-2001 SV=1"post-UniProt":
>sp|P05067 archived from Release 9.2/51.2 28-NOV-2006 SV=3 >tr|A0RTJ8 archived from Release 11.0/36.0 29-MAY-2007 SV=1
Chromatophore: a new organelle value in the 'Encoded on' subsection (OG line in the flat file).
We have added Chromatophore to the list of valid plastid values in the OG (OrGanelle) line in the flat file. The chromatophore is the photosynthetic inclusion found in Paulinella chromatophora, a photosynthetic thecate amoeba. It encodes and houses the machinery necessary for photosynthesis and CO2 fixation; it also has the genetic capacity to synthesize some amino acids, some fatty acids and a few cofactors. It is not yet clear whether the chromatophore derives from the same endosymbiotic event that is thought to have led to all other plastids. The chromatophore genome of P. chromatophora has been sequenced (PubMed:18356055) and been found to be just over 1 Mb, approximately 9 times larger than the average photosynthetic plastid and approximately 1/3 smaller than the smallest cyanobacterial genome.
Example:
OG Plastid; Chromatophore.
On the website, the information is found in the 'Names and origin' section, 'Encoded on' subsection (see for instance B1X5D6).
Cross-references to BindingDB
Cross-references have been added to The Binding Database. BindingDB is a public, web-accessible database of measured binding affinities, focusing chiefly on the interactions of proteins considered to be drug-targets with small, drug-like molecules.
The Binding Database is available at http://www.bindingdb.org/.
The format of the explicit link in the flat file is:
| Data bank identifier | BindingDB |
|---|---|
| Primary identifier | The primary identifier consists of a UniProtKB accession number. |
| Secondary identifier | None; a dash '-' is stored in that field. |
| Examples |
P50613: DR BindingDB; P50613; -. P68850: DR BindingDB; P68850; -. |
UniProt decoy databases
The target-decoy search strategy, which has become widespread and is recommended
in journal guidelines, consists of attaching a decoy database to a forward database
and searching MS/MS spectra against this composite database. It is more stringent
than a simple search, and allows to compute an estimation of the false discovery rate.
For this strategy to be efficient, the decoy database has to preserve the general composition of the target database while minimizing the peptide sequence overlap between the target and the decoy.
We developed a new algorithm that shuffles proteins and keeps re-shuffling each tryptic peptide until it no longer matches with any peptide from the original database. This method ensures that no tryptic peptide is shared between the target and decoy databases.
Decoy versions of UniProtKB/Swiss-Prot, UniProtKB/TrEMBL and UniRef100 can now be retrieved in FASTA format from our : public FTP site.
Changes concerning keywords
New keywords:
Deleted keywords:
- Inner membrane
- Outer membrane
Changes in subcellular location controlled vocabulary
New subcellular locations:
- Chromatophore
- Chromatophore membrane
- Chromatophore stroma
- Chromatophore thylakoid
- Chromatophore thylakoid lumen
- Chromatophore thylakoid membrane
Changes in PTM controlled vocabulary
New terms in the 'Amino acid modifications' subsection (feature key 'CROSSLNK' in the flat file):
- (2-aminosuccinimidyl)acetic acid (Asp-Gly)
- 5'-tyrosyl-5'-aminotyrosine (Tyr-Tyr) (interchain)
- Cyclopeptide (Ala-Ile)
- Cyclopeptide (Leu-Trp)
- Cyclopeptide (Pro-Met)
- Cyclopeptide (Pro-Tyr)
- Cyclopeptide (Ser-Gly)
- Isoaspartyl lysine isopeptide (Lys-Asn)
- Lysine tyrosylquinone (Tyr-Lys)
- S-cysteinyl 3-(oxidosulfanyl)alanine (Cys-Cys)
- Threonyl lysine isopeptide (Lys-Thr) (interchain with T-...)
- Threonyl lysine isopeptide (Thr-Lys) (interchain with K-...)
New terms in the 'Amino acid modifications' subsection (feature key 'MOD_RES' in the flat file):
- Aminomalonic acid (Ser)
- Aspartate 1-(chondroitin 4-sulfate)-ester
- Aspartic acid 1-[(3-aminopropyl)(5'-adenosyl)phosphono]amide
- Aspartyl isopeptide (Asp)
- Beta-decarboxylated aspartate
- N-D-glucuronoyl glycine
- O-AMP-tyrosine
- O-UMP-tyrosine



