The HGDP-CEPH Diversity Panel Database is designed to receive and store polymorphic marker genotypes, copy number variant (CNVs) calls, and Sanger DNA sequences generated by users of the DNAs of the HGDP-CEPH Diversity Panel. The data are publically accessible via this website. The panel DNAs are presently produced from 1063 lymphoblastoid cell lines (LCLs) representing 1050 individuals sampled from 52 populations throughout the world. Each blood sample used for this resource was freely donated under conditions of informed consent. In addition to sequence-based information, the DB includes geographic and population origin and gender of each of the participating volunteers, who are identified by code numbers only (HGDP identifiers). The panel contains 13 pairs of duplicate LCLs and those from two genetically atypical individuals. The DB also contains a link to a file that identifies LCLs for ("close") relative pairs in the panel.
Panel users who want to submit marker genotypes and related information should for a password. It is then possible to submit electronic files directly or by arrangement with the DB manager. Data submissions should provide identifying, genetic and genomic information for markers and sequences e.g. official (HUGO/NCBI) nomenclature, genbank identifier, dbSNP identifier (rs or ss number), local name, type of marker (SNP, short indel, STRP etc.), the allelic nucleotides for SNP loci and indels (as A,T,G,C,-) and allelic repeat sequences for STRPs, ancestral alleles, genetic map position defined by chromosome number and genome sequence coordinates (current NCBI build number required), 100bp of sequence flanking the allelic nucleotides, genotyping technique(s) and, of course, the actual genotypes for each HGDP-CEPH individual (HGDP identifier) as allelic nucleotides or numeric code (1 and 2 for SNPs and 1, 2, etc. for short indels), and number of nucleotides in the allelic repeat or numeric code for STRP markers, with correspondance between numeric alleles and nucleotide alleles indicated in the file. For CNVs provide the DGVa or dbVAR study ID (nstd number), variant ID (nsv number) and supporting variant ID (nssv number) for each call (see www.ebi.ac.uk/dgva or www.ncbi.nlm.nih.gov/dbvar, respectively). Also, for each CNV, give start and stop coordinates in the genome sequence and the typing method used.
Submission of DNA Sequences (new)
Sanger sequencing: Send sequences in FASTA or GenBank format with phred scores or other evaluations of base call quality, and sequencing error rates.
Short reads (Illumina, Solid, 454...):
At present we recommend submission of short read sequences from HGDP-CEPH samples to sequence read archives (SRA) at NCBI www.ncbi.nlm.nih.gov, EBI www.ebi.ac.uk, or DDBJ trace.ddbj.nig.ac.jp. When you submit, you will be asked for a STUDY TITLE, and an ANONYMIZED NAME for each sample sequenced. Please include in the study title the name of the resource, "HGDP-CEPH Human Genome Diversity Panel". A study title might read, "whole genome resequencing 10x of HGDP-CEPH Human Genome Diversity Panel samples". The anonymized name for each sample sequenced should be the HGDP-CEPH identifier, e.g. HGDP00989. The sample database link for HGDP-CEPH identifiers is www.cephb.fr/common/HGDPid_populations.xls.
For each sample submission XML file, please add a sample link to the HGDP-CEPH database in xml format.
For the NCBI SRA:
For the EBI SRA:
<TAG>HGDP-CEPH Database Link</TAG>
Use of the Panel-related study title, sample, anonymized name and sample link will permit us to track all SRA short read submissions for HGDP-CEPH samples, and include a list of them in a dedicated section in the panel database.
V3.0 contains 12 datasets :
1) Dataset 1 A diversity genotype database (flat file and web format) contains genotypes generated by (17) HGDP-CEPH collaborators with 5662 SNPs, 843 microsatellites and 51 short indels, and for one gene deletion and duplication polymorphism (CYP2D6).
2) Dataset 2 (Stanford U) contains genotypes (flat files) for ~ 660,918 tag SNPs (Illumina HuHap 650k), in autosomes, chromosome X and Y, the pseudoautosomal region and mitochondrial DNA, typed across 1043 individuals from all panel populations (Li JZ et al. Science 319: 1100-4, 2008).
3) Dataset 3 (NIH-UMich) contains genotypes (flat files) (and CNV calls, supplement 5, below) for some 525,910 tag SNPs (Illumina HuHap550k), all of which are included in the HuHap 650k genotyping panel, typed across 485 HGDP-CEPH individuals from 29 populations (Jakobsson M et al. Nature 451: 998-1003, 2008).
4) Dataset 4 (MPI-EVA) contains genotypes (flat files) for 488,755 SNPs (Affymetrix GeneChip Human Mapping 500 K Array Set), typed across 255 individuals from all 52 HGDP-CEPH populations (5 samples per population) (López Herráez D et al. PLoS One. 2009 Nov 18;4(11):e7888.). After merging the Affymetrix and Illumina (data supplement 1) non-overlapping datasets for 250 of these same individuals (no filters applied), we find genotypes for 939,383 unique SNPs.
5) Dataset 5 (UWash, flat files) contains calls for 6538 copy number variants (CNVs), size range 225-5,470,050 bp. These calls were ascertained in 883 unrelated HGDP-CEPH individuals (all panel populations) from SNP intensity data (data supplement 1), using rigorous statistical criteria and direct validation with CGH oligonucleotide arrays for CNV discovery on 12 panel individuals with 98 CNVs (Itsara A et al. Amer J Hum Genet 84: 148-161, 2009). This study is also found at www.ebi.ac.uk/dgva/page.php or www.ncbi.nlm.nih.gov/dbvar, study number nstd27.
6). Dataset 6 (NIH-UMich, flat files) contains 3436 CNV calls, size range (2,019-998,213 bp) for 438 individuals in 29 populations. These calls were based on SNP intensity data (data supplement 2) and quality thresholds of the PennCNV algorithm. This study is found at www.ebi.ac.uk or www.ncbi.nlm.nih.gov, study number nstd30.
7) Dataset 7 (UNM) contains sequences from the D-loop region of mitochondrial DNA for 1064 HGDP-CEPH individuals. The number of base pairs sequenced per individual ranges from 1021 to 1047 (average 1044.4, median 1045). These sequences are found at www.ncbi.nlm.nih.gov.
8) Dataset 8 (MPI-EVA-Neandertal/Denisova) contains whole-genome, shotgun sequences at 4-6x coverage (Illumina GAII platform) for five HGDP-CEPH individuals, HGDP00778 (Han), HGDP00542 (Papuan), HGDP00927 (Yoruban), HGDP01029 (San) and HGDP00521 (French), as part of the Neandertal Genome Project (Green RE et al. Science 328: 710-722, 2010).
In addition, this data supplement also contains whole genome sequences at 1.3-1.9x coverage for seven HGDP-CEPH DNAs, from HGDP00456, (Mbuti Pygmy), 00998 (Karitiana), 00665 (Sardinian), 00491 (Melanesian from Bougainville Island), 00711 (Cambodian), 01224 (Mongola), and 00551 (Papuan), as part of the characterization of an archaic hominin from Denisova Cave, Siberia (Reich D et al Nature 468: 1053-1060, 2010). Sequences from these seven HGDP-CEPH genomes are available from the NCBI SRA, www.ncbi.nlm.nih.gov/sra/?term=hgdp-ceph.
9) Dataset 9 (MPI-EVA-Neandertal/aa-capture) contains sequences from 50 different HGDP-CEPH populations, covering ~14,000 protein-coding human lineage positions (Burbano HA et al. Science 328: 723-725, 2010). Paired end sequences, derived from bar-coded genomic libraries, were captured on a single microarray containing these positions. Experiment ERX004007 contains all the HGDP-CEPH sequences in fastq files from runs ERR011028-ERR0011032 and can be downloaded from www.ebi.ac.uk/ena/data/view/ERX004007. For links between the sequence barcodes and the corresponding HGDP-CEPH identifiers, click on "View XML" found at the top left of the web page.
10) Dataset 10 contains genotypes for 176 Y-STRs for HGDP-CEPH males. Genotypes are presented as repeat numbers. Descriptions of the Y-STRs used can be found in Kayser et al 2004 Am J Hum Genet 74: 1183-1197 and Ballantyne et al Am J Hum Genet 2010 87: 341-353. Genotyping procedures are as described in Vermeulen et al Forensic Sci Int Genet 2009 3: 205-213, and Ballantyne et al Forensic Science Int Genet 2011. doi:10.1016/j.fsigen.2011.04.017.
11) Dataset 11 (Harvard, flat files) contains data from 629,443 SNPs that were obtained by genotyping 934 .unrelated. HGDP-CEPH individuals with the soon to be released Affymetrix Axiom® Human Origins Array Plate, and merging the genotypes with data from Neandertal, Denisova and chimpanzee. The SNP data are divided among 14 partially overlapping datasets, 13 of which are of value for analysis of different population genetics scenarios. For each of datasets 1-12, SNPs were discovered as heterozygotes by whole-genome shotgun sequencing of a different HGDP-CEPH individual of known ancestry, as per Keinan et al. Nature Genetics 39: 1251-1255, 2007. Dataset 13 contains heterozygote SNPs for each of which a random Denisovan allele matches that of chimpanzee, and the random San Bushman allele is derived. Dataset 14, which is valuable for studying population structure using a maximum number of SNPs, does not allow demographic modeling. This dataset combines all SNPs along with an additional 87,044 SNPs chosen to allow haplotype inference at mitochondrial DNA and the Y chromosome, and to provide overlap with previous Affymetrix and Illumina genotyping arrays so that users can merge the data available here with previously published datasets. IMPORTANT: Read the detailed technical document before using these data in order to avoid pitfalls. The array was developed by David Reich and colleagues in collaboration with Affymetrix for the purpose of generating data with clearly documented ascertainment.
12) Dataset 12 (MPI-EVA Denisova) contains high-coverage sequences of 10 HGDP-CEPH genomes: HGDP00456-Mbuti Pygmy (24.3x), HGDP00521-French (26.7x), 00542-Papuan (25.9x), 00665-Sardinian (24.7x), 00778-Han (27.7x), 00927-Yoruba (32.1x), 00998-Karitiana (26.0x), 01029-San (32.7x), 01284-Mandenka (24.5x), and 01307-Dai (28.3x) (rounded averages).
These sequences were determined for comparison with the genome of an archaic Denisovan individual (Meyer, M. Science 338:222-6, 2012).
The raw human sequences and alignments to hg19, are available in BAM format from http://cdna.eva.mpg.de/denisova/BAM/human/. The BAM files may be analyzed with sequence tool kits e.g. SAMtools and Picard.