How to retrieve sets of protein sequences?
Last modified March 21, 2012
UniProtKB entries are available in three file formats - Flat Text, XML and RDF/XML. UniProtKB entries in these formats each contain only one protein sequence, the so-called 'canonical' sequence. UniProtKB canonical sequences are also available in FASTA format, as are additional manually curated isoform sequences that are described in UniProtKB/Swiss-Prot. Below we describe how these sets can be accessed.
See also:
What is the canonical sequence? Are all isoforms described in one entry?
Retrieving sequences from the web site
- Perform your favorite query and view the resulting list
of entries (e.g. this query retrieves human UniProtKB entries tagged with the
keyword
'Complete proteome': organism:9606 AND keyword:"Complete proteome") - Click the orange Download button in the query result page
- Choose the desired download format (Flat Text, XML, RDF/XML, or FASTA if additional isoform sequences are desired)
- Choosing
Flat Text,XML, orRDF/XMLallows retrieval of all entries (and their canonical sequences) from the result list in the desired format. - Choosing
Canonical sequence data in FASTA formatallows retrieval of all canonical sequences from the query result list. This can include canonical sequences from both UniProtKB/Swiss-Prot and/or UniProtKB/TrEMBL entries. - Choosing the option
Canonical and isoform sequence data in FASTA formatallows retrieval of all canonical sequences plus all manually reviewed isoform sequences described within UniProtKB/Swiss-Prot. These manually reviewed isoform sequences are available as distinct sequences in FASTA format only within this expanded downloadable set.
- Choosing
To automate the above, please read the section Downloading data at every UniProt release of our FAQ about programmatic access.
Retrieving sequences from the FTP site
The UniProt FTP sites (accessible via the Downloads link located at the top of all
UniProt web pages) provide the most frequently requested data sets in each of
the aforementioned file formats (Flat Text, XML, RDF/XML, FASTA). The additional manually curated isoform sequences that are described
in UniProtKB/Swiss-Prot
are available in a separate FASTA file (uniprot_sprot_varsplic.fasta.gz).
Our FTP directory also includes expanded FASTA sets, containing both the canonical and manually reviewed isoform sequences,
for a selection of the most widely used complete proteomes.
