Skip Header

 

How to retrieve a complete set of protein sequences?

Last modified June 28, 2009

Complete set of canonical sequences for organisms whose genomes have been completely sequenced

As described in 'What are Complete Proteome Sets?', UniProtKB provides several complete sets of proteins thought to be expressed by organisms whose genomes have been completely sequenced.

These sets can be downloaded:

These sets of sequences include only the so-called 'canonical' UniProtKB sequences, i.e. the protein sequence displayed by default in the entries, but none of the 'alternative sequences'.

See 'What is the canonical sequence? Are all isoforms described in one entry? How can I retrieve them?'

Complete set of alternative sequences for a given organism

Alternative sequences, i.e. those produced by alternative promoter usage, alternative splicing, alternative initiation and ribosomal frameshifting, are stored in a different file and can be dowloaded in FASTA format from the file 'Isoform sequences'.

Depending upon the organism, the inclusion of alternative sequences to the basic set of protein sequences can make a tremedous difference. For instance, in Homo sapiens, alternative sequences currently represent close to 40% of the total number of annotated human sequences described in UniProtKB/Swiss-Prot.

Set of canonical AND alternative sequences for any organism

It is possible to download a complete set of protein sequences annotated in UniProtKB for a given organism, i.e. a set including the canonical along with alternative sequences. To do so, you should query UniProtKB with the taxonomy identifier (TaxId) of your favorite organism and download the protein sequences provided in FASTA format using the option 'Canonical and isoform sequence data in FASTA format'.

This can be done for all organisms, including those for which genome is not yet fully sequenced. Note that in this case by "complete set" we do not mean the complete proteome of the organism of interest, but simply all sequences available in UniProtKB, including isoform sequences.

Example of query:

Human complete proteome set, then click on 'Download'

See also: