Skip Header

UniRef

The UniRef databases provide clustered sets of sequences from UniProt Knowledgebase (including splice variants and isoforms) and selected UniParc records, in order to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences (but not their descriptions) from view. Unlike UniParc, sequence fragments are merged in UniRef. The UniRef100 database combines identical sequences and sub-fragments with 11 or more residues (from any organism) into a single UniRef entry, displaying the sequence of a representative protein, the accession numbers of all the merged entries, and links to the corresponding UniProtKB and UniParc records. UniRef90 and UniRef50 are built by clustering UniRef100 sequences with 11 or more residues using the CD-HIT algorithm (Li W., Jaroszewski L., and Godzik A., Bioinformatics, 17: 282-283, 2001) such that each cluster is composed of sequences that have at least 90% or 50% sequence identity, respectively, to the longest sequence (UniRef seed sequence). UniRef90 and UniRef50 yield a database size reduction of approximately 40% and 65%, respectively, providing for significantly faster sequence searches. All the sequences in each cluster are ranked to facilitate the selection of a representative sequence for the cluster. The sequences are ranked as follows:

  1. quality of the entry: member entries from UniProtKB/Swiss-Prot are preferred,
  2. meaningful name (entries with names that do not contain words such as hypothetical, probable, etc. are preferred),
  3. organism (entries from model organisms preferred), and
  4. length of the sequence (longest sequence preferred).

UniRef100

UniRef100 contains all the records in the UniProt knowledgebase and selected UniParc records (see next section). In UniRef100, the identical sequences and subfragments with 11 or more residues are placed into a single record. UniRef50 and UniRef90 are built based on UniRef100.

The UniRef100 identifier is generated by placing "UniRef100_" prefix before UniProtKB accession or UniParc identifier of the representative UniProt or UniParc entry, e.g. "UniRef100_Q8WZ42" or "UniRef100_UPI0000000F90".

In addition to UniProtKB records, UniRef100 also includes the UniParc entries that are not covered by UniProtKB and contain cross-references to the following databases:

UniRef90

UniRef90 is generated by clustering UniRef100 sequences.

The UniRef100 sequences shorter than 11 residues are excluded in UniRef90 clusters. Each UniRef90 cluster has one representative sequence from UniRef100 database.

UniRef90 cluster titles and identifiers are derived from the representative UniRef100 entry. The UniRef90 identifier is generated by replacing "UniRef100_" prefix of the representative with "UniRef90_". e.g. "UniRef90_Q8WZ42".

UniRef50

UniRef50 is generated by clustering UniRef100 sequences.

UniRef50 cluster titles and identifiers are derived from the representative UniRef90 entry. The UniRef50 identifier is generated by replacing "UniRef100_" prefix of the representative with "UniRef50_". e.g. "UniRef50_Q10466".

Further information