Skip Header

You are using a version of browser that may not display all the features of this website. Please consider upgrading your browser.

What is the canonical sequence? Are all isoforms described in one entry?

Last modified December 13, 2021

What is the canonical sequence?

Each UniProtKB/Swiss-Prot entry contains all curated protein products encoded by a given gene in a given species or strain. For each UniProtKB/Swiss-Prot entry, we choose a canonical (or representative) sequence for display that should conform to at least one of the following criteria:

  1. It is functional;
  2. It is widely expressed;
  3. It is encoded by conserved exons found in orthologous sequences;
  4. It is identical to consensus sequences chosen by other resources and genome curation efforts such as CCDS and MANE (see also UniProt's human proteome)
  5. In the absence of any information, we choose the longest sequence.

Sequences chosen according to these criteria generally allow the description of the majority of functionally important domains, motifs, sites, and post-translational modifications, naturally occurring variants with functional and clinical significance, and other sequence features.

Additional information can be found in the 'Alternative sequence' subsection.

The various UniProtKB distribution formats (Flat Text, XML, RDF/XML) display only the canonical sequence. The website's 'Sequences' section displays the canonical sequence, but for convenience it offers also a view of the isoforms that are described in the 'Alternative sequence' subsection.

Are all isoforms described in one UniProtKB/Swiss-Prot entry?

Whenever possible, all the protein products encoded by one gene in a given species are described in a single UniProtKB/Swiss-Prot entry, including isoforms generated by alternative splicing, alternative promoter usage, and alternative translation initiation (*). However, some alternative splicing isoforms derived from the same gene share only a few exons, if any at all, the same for some 'trans-splicing' events. In these cases, the divergence is obviously too important to merge all protein sequences into a single entry and the isoforms have to be described in separate 'external' entries.

Example: isoforms derived from the lola gene (Drosophila melanogaster)

(*) Important remark: Due to the increase of sequence data coming from large-scale sequencing projects, UniProtKB/TrEMBL may contain additional predicted sequences encoded by genes which are described in a UniProtKB/Swiss-Prot entry.

How can I retrieve isoform sequences?

Alternative sequences, described in either single or separate entries, are all available for Blast searches.

Isoform sequences can be downloaded in FASTA format from our FTP download index page (choose the file: 'Isoform sequences').

Query-derived sets of canonical sequences alone or canonical and isoform sequences can also be downloaded in FASTA format (see How to retrieve sets of UniProtKB protein sequences?).

See also:

UniProt is an ELIXIR core data resource
Main funding by: National Institutes of Health

We'd like to inform you that we have updated our Privacy Notice to comply with Europe’s new General Data Protection Regulation (GDPR) that applies since 25 May 2018.

Do not show this banner again