|
The HGDP-CEPH Diversity Panel Database is designed to receive and store polymorphic marker genotypes generated by users of the DNAs of the HGDP-CEPH Diversity Panel. The data are accessible publically via a web interface (database V2.0 only) and/or as flat files (V2.0; Supplements 1 and 2). These panel DNAs presently are produced from 1063 lymphoblastoid cell lines (LCLs) representing some 1050 individuals sampled from 51 populations throughout the world. Each blood sample used for this resource was freely donated under conditions of informed consent. In addition to genotypes, information in the DB include geographic and population origin and gender of each of the participating volunteers, who are identified by code numbers only (HGDP identifiers). The panel contains 13 pairs of duplicate LCLs and those from two genetically atypical individuals. The DB also contains a link to a file that identifies LCLs for ("close") relative pairs in the panel.
Panel users who want to submit marker genotypes and related information should for a password. It is then possible to submit electronic files directly via the DB web interface or by arrangement with the DB manager. Data submissions should provide identifying, genetic and genomic information for each marker that has been genotyped, i.e. official (HUGO/NCBI) marker/gene nomenclature, genbank identifier, dbSNP identifier (rs or ss number), local marker name, type of marker (SNP, indel, STRP etc.), the allelic nucleotides for SNP loci and indels (as A,T,G,C,-) and allelic repeat sequences for STRPs, ancestral alleles, genetic map position defined by chromosome number and genome sequence coordinates (current NCBI build number required), 100bp of sequence flanking the allelic nucleotides, genotyping technique(s) and, of course, the actual genotypes for each HGDP-CEPH individual (HGDP identifier) as allelic nucleotides (A,T,G,C,-) or numeric code (1 and 2 for SNPs and indels), and number of nucleotides in the allelic repeat or numeric code for STRP markers, with correspondance between numeric alleles and nucleotide alleles indicated in the file. Quality of the data being submitted will be checked at CEPH before they are incorporated into the DB.
V2.0 contains public genotypes for some 4991 markers generated on the HGDP-CEPH Diversity Panel. These include 835 STRPs genotyped at the Marshfield Clinic Center for Human Genetics, and 4155 SNPs and indels, for a total of approximately 4.9 million genotypes. Links to NCBI, for genetic markers identified by official nomenclature and code, are included in the web version of V2.0.
Supplement 1 contains flat files of genotypes generated on the Illumina 650K platform with 659,000 SNPs by the panel user and colleagues at Stanford University and colleagues. Supplement 2 contains flat files of genotypes for some 525,000 SNPs typed on the Illumina 550K platform. and for 1209 CNVs derived from the genotypes by collaborating panel users at NIH and the University of Michigan. There is considerable overlap of SNPs contained in the two platforms.
Data contributors who wish to have their genotypes protected temporarily should so indicate at the time of submission. Absence of such an indication will be taken as permission to consider the data public and permit access to them via the DB website and flat files. Protection of data will end with a relevant publication, 6 months from their submission or notification by the contributor, whichever comes first. Contributors of protected data are kindly requested to send to the DB manager citations of publications based on these data.
All panel users and visitors will be able to browse V2.0 public data on the web site, export them in LINKAGE-like and/or ARLEQUIN files and prepare data summaries, e.g. by populations, markers etc. The exported genotype files, especially those in ARLEQUIN format, will support a wide variety of population genetics analyses. . In this web version of the DB, automatic analyses of public data will include allele, observed heterozygote and haplotype frequencies (for all genetic markers) and pairwise linkage disequilibrium measures (D', r2) and LD blocks for available chromosome regions (limited to SNPs) for pooled regional populations (caution!). Results of these analyses will be displayed on the DB web site.
The DB will contain DNA sequences (resequencing of panel DNAs) in future versions.
We depend on advice from HGDP-CEPH Diversity Panel users to improve the DB.
(Note that in this DB, chromosomes 23 and 24 refer to the unique portions of the X and the Y, respectively, and chromosome 25 to the pseudoautosomal regions.)
|