«Pendant des siècles, la Médecine s’est préoccupée de soigner. Aujourd’hui elle s'est donnée comme but de prévenir plutôt que de guérir.»
Pr Jean Dausset, Prix Nobel de Médecine, 1980
La Fondation Jean Dausset - Centre d’Etude du Polymorphisme Humain participe aux efforts nationaux et internationaux de recherche pour mieux déterminer le rôle du polymorphisme génétique chez l’Homme, tout particulièrement dans les maladies complexes, pour mieux les comprendre, les diagnostiquer et participer au développement d’une médecine personnalisée.
Introduction HGDP-CEPH Database Genotype Submission Sequence Submission Access Policy Publications

HGDP-CEPH Human Genome Diversity Cell Line Panel

March 2020, HGDP-CEPH panel distributed by the CEPH Biobank

has been sequenced by the Wellcome Sanger Institute:

Global human genomes reveal rich genetic diversity shaped by complex evolutionary history.

The results were published in Science journal on March 18, 2020

Introduction to the HGDP-CEPH Panel

A resource of 1063 lymphoblastoid cell lines (LCLs) from 1050 individuals in 52 world populations and corresponding milligram quantities of DNA is banked at the Foundation Jean Dausset-CEPH in Paris. These LCLs were collected from various laboratories by the Human Genome Diversity Project (HGDP) and CEPH in order to provide unlimited supplies of DNA and RNA for studies of sequence diversity and history of modern human populations. Information for each LCL is limited to sex of the individual, population and geographic origin.
The table provides details of all the LCLs in the resource, uncorrected for duplicates (13 duplicate pairs), 2 genomically atypical samples, 13 pairs of duplicate LCLs and those from two genetically atypical individuals, and 96 pairs of close relatives (first and/or second degree relative pairs) LCLs for ("close") relative pairs. Sixteen LCLs differ in gender indicated on records and that determined by molecular typing. All samples used for this resource were collected with proper informed consent.

The DNAs have been distributed to 113 investigators for genotyping and/or resequencing; the results are contributed to a central database. To date, the DNAs have been typed genome wide with almost 1 million SNPs, 843 microsatellites, and 51 small indel loci. Some 10,000 CNV calls from two different laboratories are included in the database. Nuclear and mitochondrial DNA regions have been resequenced. High throughput sequencing of entire HGDP-CEPH genomes and array captured targets is underway.

For more information contact the HGDP Manager.

The HGDP-CEPH Database

The HGDP-CEPH Diversity Panel Database is designed to receive and store polymorphic marker genotypes, copy number variant (CNVs) calls, and Sanger DNA sequences generated by users of the DNAs of the HGDP-CEPH Diversity Panel.

V3.0 of the database contains the following datasets :

HGDP-CEPH database, 17 collaborators : (December 1, 2010)
Datasets available as text flat files :
Universidade de São Paulo (24 avril 2018)
Stanford University (September 20, 2007)
NIH-UMichigan (UMich-NIH) (October 15, 2007)
MPI-EVA (October 22, 2009)
U-Washington (UWash-NIH) (March 3, 2009)
NIH-UMichigan (UMich-NIH) (October 15, 2007)
University New Mexico (UNM) (June 3, 2008)
MPI-EVA-Neandertal/Denisova (March 9, 2010)
MPI-EVA-Neandertal/aa-capture (March 9, 2010)
Erasmus Forensic MOL BIOL (April 6, 2011)
Harvard (August 12, 2011)
MPI-EVA-Denisova (March 21, 2013)
Children's Hospital Oakland Research Institute, Oakland, CA (July 10, 2014)
Max Planck Institute, Leipzig : MPI-EVA (August 29 2014)
Los Angeles Biomedical Research Institute at Harbor/UCLA Medical Center (July 11 2014)
Institute of Clinical Pharmacology, University Medical Center Goettingen, Germany (April 28 2015)
Departement of Genetic, Harvard Medical School, Boston, Massachusetts 02115, USA. (September 2016)
Unit of Forensic Genetic, Centre universitaire romand de médecine légale, Lausanne, Switzerland (21 April 2018)
Wellcome Sanger Institute, Hinxton CB10 1SA, UK. (20 March 2020)

Dataset 1 generated by 17 collaborators (HGDP-CEPH Database)

  • 5662 SNPs,
  • 843 microsatellites
  • 51 small indels
  • one gene deletion and duplication polymorphism (CYP2D6).

Download dataset in flat files format or access the HGDP-CEPH database web interface.

Dataset 1b generated by the Department of Genetics and Evolutionary Biology, Instituto de Biociências, Universidade de São Paulo, São Paulo, São Paulo, Brazil.
Inbreeding is observed in almost all the populations of the panel HGDP-CEPH with different levels of inbreeding and frequencies. (PMID: 21364699)

Access to the published data.

Dataset 2 Stanford University
Genotypes (flat files) for ~ 660,918 tag SNPs (Illumina HuHap 650k), in autosomes, chromosome X and Y, the pseudoautosomal region and mitochondrial DNA, typed across 1043 individuals from all panel populations (Li JZ et al. Science 319: 1100-4, 2008).

Download dataset in flat files format

Dataset 3 Michigan University (UMich-NIH)
Genotypes (flat files) (and CNV calls, supplement 5, below) for some 525,910 tag SNPs (Illumina HuHap550k), all of which are included in the HuHap 650k genotyping panel, typed across 485 HGDP-CEPH individuals from 29 populations (Jakobsson M et al. Nature 451: 998-1003, 2008).

Download dataset in flat files format

Dataset 4 Max Planck Insittute, Leipzig : MPI-EVA
Genotypes (flat files) for 488,755 SNPs (Affymetrix GeneChip Human Mapping 500 K Array Set), typed across 255 individuals from all 52 HGDP-CEPH populations (5 samples per population) (López Herráez D et al. PLoS One. 2009 Nov 18;4(11):e7888). After merging the Affymetrix and Illumina (data supplement 1) non-overlapping datasets for 250 of these same individuals (no filters applied), we find genotypes for 939,383 unique SNPs.

Download dataset in flat files format

Dataset 5 Washington University (UWash-NIH)
Calls for 6538 copy number variants (CNVs), size range 225-5,470,050 bp. These calls were ascertained in 883 unrelated HGDP-CEPH individuals (all panel populations) from SNP intensity data (data supplement 1), using rigorous statistical criteria and direct validation with CGH oligonucleotide arrays for CNV discovery on 12 panel individuals with 98 CNVs (Itsara A et al. Amer J Hum Genet 84: 148-161, 2009). This study is also found at www.ebi.ac.uk/dgva/ or www.ncbi.nlm.nih.gov/dbvar, study number nstd27.

Download dataset in flat files format

Dataset 6 Michigan University (UMich-NIH)
3436 CNV calls, size range (2,019-998,213 bp) for 438 individuals in 29 populations. These calls were based on SNP intensity data (data supplement 2) and quality thresholds of the PennCNV algorithm. This study is found at www.ebi.ac.uk or www.ncbi.nlm.nih.gov, study number nstd30.

Download dataset in flat files format

Dataset 7 New Mexico University (UNM)
Sequences from the D-loop region of mitochondrial DNA for 1064 HGDP-CEPH individuals. The number of base pairs sequenced per individual ranges from 1021 to 1047 (average 1044.4, median 1045). These sequences are found at www.ncbi.nlm.nih.gov.

Dataset 8 Max Planck Insitute, Leipzig : MPI-EVA-Neandertal/Denisova
Whole-genome, shotgun sequences at 4-6x coverage (Illumina GAII platform) for five HGDP-CEPH individuals, HGDP00778 (Han), HGDP00542 (Papuan), HGDP00927 (Yoruban), HGDP01029 (San) and HGDP00521 (French), as part of the Neandertal Genome Project (Green RE et al. Science 328: 710-722, 2010).
In addition, this data supplement also contains whole genome sequences at 1.3-1.9x coverage for seven HGDP-CEPH DNAs, from HGDP00456, (Mbuti Pygmy), 00998 (Karitiana), 00665 (Sardinian), 00491 (Melanesian from Bougainville Island), 00711 (Cambodian), 01224 (Mongola), and 00551 (Papuan), as part of the characterization of an archaic hominin from Denisova Cave, Siberia (Reich D et al Nature 468: 1053-1060, 2010). Sequences from these seven HGDP-CEPH genomes are available from the NCBI SRA, www.ncbi.nlm.nih.gov/sra/?term=hgdp-ceph.

Sequences from these seven HGDP-CEPH genomes are available from the NCBI SRA, www.ncbi.nlm.nih.gov/sra/?term=hgdp-ceph.

Dataset 9 Max Planck Institute, Leipzig : MPI-EVA-Neandertal/aa-capture
Sequences from 50 different HGDP-CEPH populations, covering ~14,000 protein-coding human lineage positions (Burbano HA et al. Science 328: 723-725, 2010). Paired end sequences, derived from bar-coded genomic libraries, were captured on a single microarray containing these positions. Experiment ERX004007 contains all the HGDP-CEPH sequences in fastq files from runs ERR011028-ERR0011032 and can be downloaded from www.ebi.ac.uk/ena/data/view/ERX004007. For links between the sequence barcodes and the corresponding HGDP-CEPH identifiers, click on "View XML" found at the top left of the web page.

Dataset 10 Erasmus Forensic University, Rotterdam
Genotypes for 76 Y-STRs for HGDP-CEPH males. Genotypes are presented as repeat numbers. Descriptions of the Y-STRs used can be found in Kayser et al 2004 Am J Hum Genet 74: 1183-1197 and Ballantyne et al Am J Hum Genet 2010 87: 341-353. Genotyping procedures are as described in Vermeulen et al Forensic Sci Int Genet 2009 3: 205-213, and Ballantyne et al Forensic Science Int Genet 2011. doi:10.1016/j.fsigen.2011.04.017.

Download dataset in flat files format

Dataset 11 Harvard Genetic Department
Data from 629,443 SNPs that were obtained by genotyping 934 unrelated HGDP-CEPH individuals with the soon to be released Affymetrix Axiom® Human Origins Array Plate, and merging the genotypes with data from Neandertal, Denisova and chimpanzee. The SNP data are divided among 14 partially overlapping datasets, 13 of which are of value for analysis of different population genetics scenarios. For each of datasets 1-12, SNPs were discovered as heterozygotes by whole-genome shotgun sequencing of a different HGDP-CEPH individual of known ancestry, as per Keinan et al. Nature Genetics 39: 1251-1255, 2007. Dataset 13 contains heterozygote SNPs for each of which a random Denisovan allele matches that of chimpanzee, and the random San Bushman allele is derived. Dataset 14, which is valuable for studying population structure using a maximum number of SNPs, does not allow demographic modeling. This dataset combines all SNPs along with an additional 87,044 SNPs chosen to allow haplotype inference at mitochondrial DNA and the Y chromosome, and to provide overlap with previous Affymetrix and Illumina genotyping arrays so that users can merge the data available here with previously published datasets. IMPORTANT: Read the detailed technical document before using these data in order to avoid pitfalls. The array was developed by David Reich and colleagues in collaboration with Affymetrix for the purpose of generating data with clearly documented ascertainment.

Download dataset in flat files format

Dataset 12 Max Planck Institute, Leipzig : MPI-EVA-Denisova
High-coverage sequences of 10 HGDP-CEPH genomes: HGDP00456-Mbuti Pygmy (24.3x), HGDP00521-French (26.7x), 00542-Papuan (25.9x), 00665-Sardinian (24.7x), 00778-Han (27.7x), 00927-Yoruba (32.1x), 00998-Karitiana (26.0x), 01029-San (32.7x), 01284-Mandenka (24.5x), and 01307-Dai (28.3x) (rounded averages).
These sequences were determined for comparison with the genome of an archaic Denisovan individual (Meyer, M. Science 338:222-6, 2012).
The raw human sequences and alignments to hg19, are available in BAM format from http://cdna.eva.mpg.de/denisova/BAM/human/. The BAM files may be analyzed with sequence tool kits e.g. SAMtools and Picard.

Dataset 13 Children's Hospital Oakland Research Institute, Oakland, CA
Genotypic data on presence/absence information for 16 genes at the Killer Immunoglobulin-like Receptor (KIR) locus obtained on 976 HGDP-CEPH individuals (Hollenbach et al 2012 Immunogenetics 64: 719-737).

Download dataset in flat files format

Dataset 14 Max Planck Institute, Leipzig : MPI-EVA

Sequencing data from 500Kb of chromosome Y generated on 623 males from the HGDP-CEPH panel. A total of 2228 SNPs was identified, each individual's genotype is given according to the human genome GRCh37/hg19 assembly, Lippold S. et al. 2014.

Download dataset in text files format

Dataset 15 Institute for Translational Genomics and Population Sciences Los Angeles Biomedical Research Institute at Harbor/UCLA Medical Center

Genotypes (plink files) for 143,945 markers (Illumina ImmunoChip), typed across 889 individuals from all 52 HGDP-CEPH populations.

Download dataset in plink bfiles format

Dataset 16 Institute of Clinical Pharmacology, University Medical Center Goettingen, Germany

Genotypes and FASTQ sequences for 21 coding polymorphisms causing amino acid substitutions in OCT1 gene, typed across 962 individuals from HGDP-CEPH populations (Tina Seitz, Robert Stalmann, Nawar Dalila, Jiayin Chen, Sherin Pojar, Joao N. Dos Santos Pereira, Ralph Krätzner, Jürgen Brockmöller and Mladen V. Tzvetkov Global genetic analyses reveal strong inter-ethnic variability in the loss of activity of the organic cation transporter OCT1 Genome Medicine 2015, 7:56 doi:10.1186/s13073-015-0172-0).

Download dataset : genotypes in Excel format, rar archive of FASTQ sequences from Ion Torrent PGM sequencing of this coding region

Dataset 17 Genetics Department Harvard Medical School, Boston, Massachusetts 02115, USA.

NGS sequencing of 300 individuals (including 132 individuals from the HGDP CEPH panel) high quality genomes including at least 5.8 million base pairs that are not present in the human reference genome.The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature. 2016 Sep 21. doi: 10.1038/nature18964).

Dataset access

Dataset 18 Unit of Forensic Genetic, Centre universitaire romand de médecine légale, Lausanne, Switzerland

Genotyping data on a set of DIP-STR markers. These markers are phased haplotypes comprising one Indel (DIP) and a closely located STR. Genotypes are expressed with numbers like: S135, L142. S and L indicate the Indel allele of the haplotype (Small, Long) and 135, 142 represent the size of the STR. Increasing sizes correspond to increasing number of repeats.

Dataset 19, Wellcome Sanger Institute, Hinxton CB10 1SA, UK.

High coverage sequencing of the Human Genome Diversity Project (HGDP) Cell Line Panel samples on the Illumina X10.

Dataset access


Genotype Submission

Panel users who want to submit marker genotypes and related information should first contact the HGDP Manager for a password. It is then possible to submit electronic files directly or by arrangement with the DB manager. Data submissions should provide identifying, genetic and genomic information for markers and sequences e.g. official (HUGO/NCBI) nomenclature, genbank identifier, dbSNP identifier (rs or ss number), local name, type of marker (SNP, short indel, STRP etc.), the allelic nucleotides for SNP loci and indels (as A,T,G,C,-) and allelic repeat sequences for STRPs, ancestral alleles, genetic map position defined by chromosome number and genome sequence coordinates (current NCBI build number required), 100bp of sequence flanking the allelic nucleotides, genotyping technique(s) and, of course, the actual genotypes for each HGDP-CEPH individual (HGDP identifier) as allelic nucleotides or numeric code (1 and 2 for SNPs and 1, 2, etc. for short indels), and number of nucleotides in the allelic repeat or numeric code for STRP markers, with correspondance between numeric alleles and nucleotide alleles indicated in the file. For CNVs provide the DGVa or dbVAR study ID (nstd number), variant ID (nsv number) and supporting variant ID (nssv number) for each call (see www.ebi.ac.uk/dgva or www.ncbi.nlm.nih.gov/dbvar, respectively). Also, for each CNV, give start and stop coordinates in the genome sequence and the typing method used.

Sequence Submission

Sanger sequencing: Send sequences in FASTA or GenBank format with phred scores or other evaluations of base call quality, and sequencing error rates.

Short reads (Illumina, Solid, 454...):
At present we recommend submission of short read sequences from HGDP-CEPH samples to sequence read archives (SRA) at NCBI www.ncbi.nlm.nih.gov, EBI www.ebi.ac.uk, or DDBJ trace.ddbj.nig.ac.jp. When you submit, you will be asked for a STUDY TITLE, and an ANONYMIZED NAME for each sample sequenced. Please include in the study title the name of the resource, "HGDP-CEPH Human Genome Diversity Panel". A study title might read, "whole genome resequencing 10x of HGDP-CEPH Human Genome Diversity Panel samples". The anonymized name for each sample sequenced should be the HGDP-CEPH identifier, e.g. HGDP00989. The sample database link for HGDP-CEPH identifiers is www.cephb.fr/common/HGDPid_populations.xls. For each sample submission XML file, please add a sample link to the HGDP-CEPH database in xml format.
For the NCBI SRA:

For the EBI SRA:
    <TAG>HGDP-CEPH Database Link</TAG>
    <    <VALUE>http://www.cephb.fr/common/HGDPid_populations.xls</VALUE>

Use of the Panel-related study title, sample, anonymized name and sample link will permit us to track all SRA short read submissions for HGDP-CEPH samples, and include a list of them in a dedicated section in the panel database.

Access Policy to the HGDP-CEPH

The main goal of the Panel is to allow further research in human population genetics. A resource of 1063 lymphoblastoid cell lines (LCLs) from 1050 individuals in 51 world populations is presently banked at the Foundation Jean Dausset (CEPH). DNAs have been produced from these LCLs and organized into a panel at CEPH that is available for distribution to qualified, non-commercial, academic research laboratories on a collaborative basis.

Panel H952 contains no pairs of relatives closer than first cousins with a few possible exceptions, no duplicate pairs and no atypical samples.

Researchers who request the panel DNAs must commit to type at least all DNAs of H952 with each marker used (at least 50 common markers), and to communicate the results to CEPH, no later than 6 months after completion of typing the DNAs or than time of publication (please mention Fondation Jean Dausset - CEPH in the acknowledgements), for inclusion in a central database (www.cephb.fr/hgdp-cephdb), available to diversity panel users as well as to the public. If these two conditions cannot be met please provide a scientific justification. Collaborators must agree to use the DNA samples for academic research only and not to transfer DNA samples to other laboratories without permission from HGDP and CEPH. We will need your agreement to each of these conditions (original or modified as above) which should be specifically mentioned in writing to HGDP Manager before we can send the DNAs.

Some laboratories may wish to use the HGDP-CEPH panel for resequencing projects. We recognize that the requirements posed for resequencing all 1050 DNAs may be prohibitive for such undertakings, given the present technical limits. We encourage these laboratories to contact us and propose the requirements for the work that they wish to undertake.

We would appreciate learning about the research for which you propose to use the diversity panel DNAs. One or two paragraphs will be sufficient. Genetic markers to be used should be described or preferably listed if practical; use official nomenclature and give their genome positions. For a resequencing study, indicate the genome region(s), the size of each region and how many individuals and corresponding populations to be sequenced.

In general, panel DNAs, dissolved in TE (10:1), will be sent in 96 well microtiter plates, at a concentration of ~60ng/µl. The quantity of DNA to be shipped will be ~5.0 micrograms per well. If you require more than 5.0 µg of each panel DNA, please contact us with the details of what you need. As we do not have a specific budget to support managing the LCLs, DNA production, quality control and formatting, we charge for DNAs that you receive on a cost price recovery basis.

The CEPH can also provide RNA extracted from lymphoblastoid cell lines. If you have interest in expression studies on the HGDP-CEPH panel please send an e-mail to HGDP Manager.

Cell lines are not distributed.

The LCLs of the HGDP-CEPH panel were produced by different laboratories in various countries over the past 20-30 years. In this project, DNA from these cell lines will be used by many laboratories in different countries. The HGDP-CEPH collaboration has determined that all of the blood samples used to produce these LCLs were collected with appropriate informed consent for the time and place of their collection. Recipients of DNA from these LCLs are, of course, responsible for ensuring that its use complies with legal standards that govern their laboratories.

Researchers who wish to participate in the project as outlined above should use the following procedure: send an e-mail to the HGDP Manager (specifically indicating their agreement with the terms of collaboration as above).
After approval a purchase order should be sent by e-mail to the BRC Manager or by fax (+33 1 53 72 51 58) giving the following information :

  • PO date and number
  • billing address
  • intra-european VAT number if applicable
  • panel ID (H1063 or H952) or requested DNA IDs (e.g. : HGDP01340) in a excel file
  • requested quantity in µg (5µg multiples only)
  • international courier account number

You will be informed by e-mail of the DNAs estimated delivery time. All your feedbacks on the quality of samples would be appreciated. Thanks to mention the CEPH Biobank in the acknowledgements of your publications using the HGDP-CEPH panel.

Donate and help
our research !
15 euros 30 euros
60 euros 90 euros
Other amount