UniProt release 5.0
Published May 10, 2005
UniProtKB/Swiss-Prot major release (47.0)Release 47.0 of UniProtKB/Swiss-Prot contains 181'571 sequence entries, comprising 65'742'349 amino acids abstracted from 128'438 references. 11'531 sequences have been added since release 46, the sequence data of 841 existing entries has been updated and the annotations of 166'572 entries have been revised. This represents an increase of 6%.
Many improvements were carried out in the last 3 months. In particular, we have introduced 5 new feature keys in order to better describe different types of regions in a protein sequence, and we added an additional qualifier to our cross-references to nucleotide sequence databases, the molecule type.
All the recent changes to the Swiss-Prot format are described in detail in the continuously updated document:
UniProt Knowledgebase release 5.0 includes Swiss-Prot release 47.0 and TrEMBL release 30.0.
Format change in the DR line
The DR (Database cross-Reference) lines are used as pointers to information in external data resources that is related to UniProtKB entries. Until now, the format of a DR line was:
DR DATABASE_IDENTIFIER; PRIMARY_IDENTIFIER; SECONDARY_IDENTIFIER[; TERTIARY_IDENTIFIER].
We have introduced a forth identifier, changing the DR line format to:
DR DATABASE_IDENTIFIER; PRIMARY_IDENTIFIER; SECONDARY_IDENTIFIER[; TERTIARY_IDENTIFIER][; QUATERNARY_IDENTIFIER].
The database cross-references to EMBL is affected by this modification (see below).
Changes in the cross-references to EMBL
The biological source of the molecule has been added as quaternary identifier to the cross-reference (DR line) of the EMBL database.
DR EMBL; ACCESSION_NUMBER; PROTEIN_ID; STATUS_IDENTIFIER.
DR EMBL; ACCESSION_NUMBER; PROTEIN_ID; STATUS_IDENTIFIER; MOLECULE_TYPE.
The molecule type is controlled vocabulary and currently includes:
DR EMBL; M68939; AAA26107.1; -; Genomic_DNA.
DR EMBL; U56386; AAB72034.1; -; mRNA.
Cross-references to PANTHER
Cross-references have been added to PANTHER, which stands for Protein ANalysis THrough Evolutionary Relationships, a classification system that was designed to classify proteins (and their genes) in order to facilitate high-throughput analysis. Proteins have been classified according to families and subfamilies, molecular functions, biological processes and pathways. PANTHER is available at https://panther.appliedbiosystems.com/.
The format of the explicit links in the flat file is:
|Resource identifier||PANTHER's unique identifier for a protein family or sub-family.|
|Optional information 1||PANTHER's entry name for a protein family or sub-family.|
|Optional information 2||Number of domains found, which is generally 1, rarely 2 for the fusion of identical domains/proteins.|
O59826: DR PANTHER; PTHR11732:SF69; KCNAB_channel; 1.
New feature (FT) keys and redefinition of existing FT keys
The feature keys DOMAIN and SITE were used to describe distinct types of regions in a protein sequence and we found this situation unsatisfactory. We therefore redefined these two feature keys and introduced 5 new ones.
Redefinition of the feature keys DOMAIN and SITE:
- DOMAIN - Extent of a domain, which is defined as a specific combination of secondary structures
organized into a characteristic three-dimensional structure or fold. Example:
FT DOMAIN 37 94 Ig-like.
- SITE - Any interesting single amino-acid site on the sequence,
that is not defined by another feature key. It can also apply to an
amino acid bond which is represented by the positions of the two
flanking amino acids. Example:
FT SITE 176 177 Cleavage (by trypsin) (Probable).
Description of the 5 new feature keys:
- COILED - Extent of a coiled-coil region. Example:
FT COILED 100 165 Potential.
- COMPBIAS - Extent of a compositionally biased region. Example:
FT COMPBIAS 43 57 Pro/Thr-rich.
- MOTIF - Short (<=20 amino acids) sequence motif of biological interest. Example:
FT MOTIF 153 154 Di-leucine motif.
- REGION - Extent of a region of interest in the sequence. Example:
FT REGION 23 54 Binds to CD4.
- TOPO_DOM - Topological domain. Example:
FT TOPO_DOM 139 182 Cytoplasmic (Potential).
The introduction of these new feature keys allows to establish a clear sorting order for feature tables. The following order is used:
1. Molecule processing * INIT_MET, SIGNAL, PROPEP, TRANSIT, CHAIN, PEPTIDE 2. Regions * TOPO_DOM, TRANSMEM * DOMAIN, REPEAT * CA_BIND, ZN_FING, DNA_BIND, NP_BIND * REGION * COILED * MOTIF * COMPBIAS 3. Sites * ACT_SITE * METAL * BINDING * SITE 4. Amino acid modifications (pre and PTM) * SE_CYS * MOD_RES * LIPID * CARBOHYD * DISULFID * CROSSLNK 5. Natural variations * VARSPLIC * VARIANT 6. Experimental info * MUTAGEN * UNSURE * CONFLICT * NON_CONS * NON_TER 7. Secondary structure * HELIX, TURN, STRAND
Keys of equal priority (listed on one line above) are ordered according to sequence positions.