Skip Header

UniProt release 2010_08

Published July 13, 2010

Headline

Viral reference strains: a virtual vaccine against virus pandemic in sequence databases

Viruses are not only the most abundant biological entities on the planet, they are also the most represented taxonomic group in UniProtKB. Without contest the title holder is the HIV-1 virus with about 350’000 entries. Taking into account that the HIV genomes encode about 9 proteins, these entries correspond to the equivalent of about 35’000 complete genomes!

While these numbers reflect the tremendous sequence diversity of viruses, they also make it difficult to find one’s way around, and users looking for general information on a viral species face a dilemma: which one to choose? Retrieving only manually reviewed proteins will still leave the user in doubt as the same viral proteins can be present by the dozen in UniProtKB/Swiss-Prot. For example, which Influenza A Hemagglutinin proteins should be selected preferentially among the 170 reviewed entries?

The UniProt solution to this problem is to define viral reference strains, each being representative of one virus genus, to curate them to the highest quality standards and to continuously maintain their annotation. The reference strains that have been selected are those whose genomes belong to the NCBI Reference Sequence collection (RefSeq). Therefore not only their proteomes, but also their genomes are carefully reviewed. The keyword ‘Virus reference strain’ has been created to allow their easy retrieval. At the current time we have defined 355 viral reference strains. These reference strains contain 12’576 proteins, of which 4’500 entries, most representing double strand DNA viruses, have been tagged with the ‘Virus reference strain’ keyword. We are actively updating the remaining 8’000 entries to provide a full set of tagged entries reflecting the diversity of the virus world.

Reference strains allow users to identify the strain with the best and most up-to-date information for any given virus. For bioinformaticians, they present another interesting feature as they can serve as templates for high quality automated annotation of other viruses of the same genus, following a pipeline analogous to the one used in UniProtKB for microbial proteins (see HAMAP program).

The viral reference strains are also accessible via the ViralZone fact sheet which provides links to the corresponding UniProtKB proteome and RefSeq genome (see for instance Influenza A).

UniProtKB News

Format change in the cross-references to WormBase

C.elegans and C.briggsae entries used to have cross-references to both WormPep and WormBase databases. WormPep is no longer active, and all worm sequences are contained in WormBase, a comprehensive database for biological information on worm sequences and annotation. We have therefore removed cross-references to WormPep and modified the WormBase cross-references to include transcript and protein identifiers from WormPep. Proteins with alternative products have one WormBase cross-reference per gene product.

Previous format in the flat file:

DR   WormPep; TranscriptIdentifier; ProteinIdentifier.
DR WormBase; GeneIdentifier; GeneName.

New format:

DR   WormBase; TranscriptIdentifier; ProteinIdentifier; GeneIdentifier; GeneName.

If there is no GeneName, a dash (’-’) is stored in that position.

Example: O45818

Previous format in the flat file:

DR   WormBase; WBGene00012019; dkf-2.
DR WormPep; T25E12.4a; CE18967.
DR WormPep; T25E12.4b; CE18283.
DR WormPep; T25E12.4c; CE42507.

New format:

DR   WormBase; T25E12.4a; CE18967; WBGene00012019; dkf-2.
DR WormBase; T25E12.4b; CE18283; WBGene00012019; dkf-2.
DR WormBase; T25E12.4c; CE42507; WBGene00012019; dkf-2.

Show all the entries having a cross-reference to WormBase.

Cross-references to WormPep have been removed.

Changes concerning keywords

New keywords:

Changes concerning the controlled vocabulary for PTMs

New term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • S-(coelenterazin-3a-yl)cysteine

Deleted terms:

  • Glutamyl lysine isopeptide (Gln-Lys) (interchain with K-...)
  • Glutamyl lysine isopeptide (Lys-Gln) (interchain with Q-...)