Skip Header

You are using a version of browser that may not display all the features of this website. Please consider upgrading your browser.

UniProt release 2017_08

Published August 30, 2017

Headline

Curation of human immunoglobulin genes: a fruitful collaboration between UniProtKB/Swiss-Prot and IMGT®

The existence of an agent in the blood that could neutralize diphteria toxin was reported as early as 1890. Over a century after this major discovery, much is known about immunoglobulins (IG) or antibodies. They are large heterodimeric proteins made up of 2 heavy (H) chains and 2 light (L) kappa or lambda chains, held together by disulfide bonds to form a ‘Y’ shaped molecule. Each chain comprises one variable (V) domain at the N-terminal end and one or several (for L and H, respectively) constant (C) domains. The antigen binding site is formed by the V domain of one H chain, together with that of its associated L chain. Thus, each immunoglobulin has 2 antigen binding sites with remarkable affinity for a particular antigen. Each variable domain is encoded by a variable (V) gene, a diversity (D) gene (only for H) and a joining (J) gene which are assembled by a process called V-(D)-J rearrangement and can then be subjected to somatic hypermutations which, after exposure to antigen and selection, allow affinity maturation for a particular antigen. The resulting rearranged V-(D)-J genes are further spliced to C genes. The C region determines the effector properties and the mechanism used to destroy the antigen, such as activation of complement or binding to Fc receptors. An immunoglobulin is encoded by 7 genes (IGHV, IGHD, IGHJ, IGHC for the H chain and IGKV, IGKJ, IGKC for a kappa or IGLV, IGLJ or IGLC for a lambda L chain). The human genome contains 176 functional immunoglobulin genes clustered in 3 loci, IGH on chromosome 14 (50 V, 23 D, 6 J and 9 C), IGK on chromosome 2 (40 V, 5 J and 1 C) and IGL on chromosome 22 (32 V, 5 J and 5 C). During the development of B cells, the mechanisms of diversity involved in the immunoglobulin synthesis (combinatorial V-(D)-J diversity, junctional diversity and somatic hypermutations) lead to the huge potential antibody repertoire of each individual, estimated to comprise 1012 different immunoglobulins, the limiting factor being only the number of B cells that an organism is genetically programmed to produce.

In 2008, we announced the first draft of the complete human proteome in UniProtKB/Swiss-Prot, and have been continuing to update this resource ever since. Recent work performed in collaboration with the IMGT® team has included a thorough review and update of the immunoglobulin genes, for which we now present a representative set of full-length germline immunoglobulin protein sequences. 15 entries showing the sequence of all C gene products and 122 representing all V gene products are now publicly available. These entries can be retrieved with the keyword ‘Immunoglobulin C region’ and ‘Immunoglobulin V region’, respectively. D and J gene products are extremely small, with an average of 5 amino acids for D genes and 15-30 for J. In other words, they are too short to be informative on their own. Therefore we have decided to curate a single peptide representative of D gene products and 3 of J gene products, one for H chains and 2 for L chains kappa and lambda. As for other human proteins, the sequences shown match the translation of the reference genome (Genome Reference Consortium GRCh38/hg38). The nomenclature used is the official one from IMGT/GENE-DB, approved by HGNC and endorsed by NCBI Gene and the IUIS-Nomenclature SubCommittee. Cross-references were implemented in the 141 UniProtKB/Swiss-Prot immunoglobulin entries, providing direct access to the dedicated IMGT® resource and its comprehensive sequence repertoire, which currently describes 927 alleles from 462 functional and non-functional genes together with a wealth of additional information concerning immunoglobulins. Reciprocal links to UniProtKB from IMGT® ensure easy navigation between both resources.

We also provide several examples of full-length rearranged immunoglobulins. Among the 1012 predicted sequences, we have selected some of those that have been entirely sequenced at the amino acid level. However, the representation of the full repertoire is beyond the scope of our knowledgebase and UniProtKB users interested in these complex molecules are advised to visit IMGT®.

We would like take this opportunity to thank Marie-Paule Lefranc, Sofia Kossida and the IMGT® team for this fruitful collaboration, which is beneficial not only for both resources, but hopefully also for the scientific community as a whole.

Cross-references to ELM

Cross-references have been added to the Eukaryotic Linear Motif (ELM) resource for functional sites in proteins.

ELM is available at http://elm.eu.org.

The format of the explicit links is:

Resource abbreviation ELM
Resource identifier UniProtKB accession number

Example: P12931

Show all entries having a cross-reference to ELM.

Text format

Example: P12931

DR   ELM; P12931; -.

XML format

Example: P12931

<dbReference type="ELM" id="P12931"/>

RDF format

Example: P12931

uniprot:P12931
  rdfs:seeAlso <http://purl.uniprot.org/elm/P12931> .
<http://purl.uniprot.org/elm/P12931>
  rdf:type up:Resource ;
  up:database <http://purl.uniprot.org/database/ELM> .

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted diseases

  • Mental retardation, X-linked, syndromic, 10

Changes in subcellular location controlled vocabulary

New subcellular locations:

UniParc news

UniParc XSD change for InterPro annotations

To reduce the sequence redundancy in UniProtKB, we apply a procedure to identify highly redundant proteomes within selected species groups to exclude them from UniProtKB. Their sequences are still available for download from the UniParc sequence archive, which stores protein sequences that are 100% identical and the same length in a single record, with cross-references to the source database where the protein exists. UniParc also includes basic annotation data (taxonomy, gene and protein names, proteome identifier and component) to allow users interested in redundant proteomes to retrieve meaningful data sets, and we have now further enhanced UniParc with InterPro annotations and for this purpose extended the UniParc XSD with new elements and types as shown below in red color:

    <xs:element name="entry">
        <xs:complexType>
            <xs:sequence>
                ...
                <xs:element name="signatureSequenceMatch" type="seqFeatureType" minOccurs="0" maxOccurs="unbounded"/>
                ...
            </xs:sequence>
            ...
        </xs:complexType>
    </xs:element>
    ...
    <xs:complexType name="seqFeatureType">
        <xs:sequence>
            <xs:element name="ipr" type="seqFeatureGroupType" minOccurs="0" maxOccurs="1"/>
            <xs:element name="lcn" type="locationType" minOccurs="1" maxOccurs="unbounded"/>
        </xs:sequence>
        <xs:attribute name="database" type="xs:string" use="required"/>
        <xs:attribute name="id" type="xs:string" use="required"/>
    </xs:complexType>

    <xs:complexType name="seqFeatureGroupType">
        <xs:attribute name="name" type="xs:string"/>
        <xs:attribute name="id" type="xs:string" use="required"/>
    </xs:complexType>

    <xs:complexType name="locationType">
        <xs:attribute name="start" type="xs:int" use="required"/>
        <xs:attribute name="end" type="xs:int" use="required"/>
    </xs:complexType>