Frequently Asked Questions
- What is the Atlas of Variant Age?
- What is the Shared Ancestry Database?
- How to cite anything found on human.genome.dating?
- How to interpret age estimation profiles?
- How to interpret the cumulative coalescent function (CCF)?
- How to interpret the coalescent intensity function (CIF)?
- How was the ancestral/derived state determined?
- How were the CCFs aggregated per individual?
- How are the figures generated?
- How to download figures?
- How to download data?
- How much data is there?
How to interpret age estimation profiles?
...
How to interpret the cumulative coalescent function (CCF)?
...
How to interpret the coalescent intensity function (CIF)?
...
How was the ancestral/derived state determined?
...
How are the figures generated?
Every figure displayed on this website is dynamically generated in your browser using the data fetched for a given component from the human.genome.dating database. The underlying plotting library is Vega v4.4, which is build on D3 (Data-Driven Documents).
If you encounter problems with the visualisation of any figure, please try again using another browser. Most modern browsers (e.g. Chrome, Safari, Firefox, Opera, etc) should be able to correctly display any of the figures by default. Note that JavaScript must be enabled in your browser.
How to download figures?
Every figure displayed on this website is dynamically generated in your browser and can be downloaded in PNG format.
Most modern browsers (e.g. Chrome, Safari, Firefox, Opera, etc) should be able to correctly display and download any figure. However, some browsers may show errors when attempting to download a figure, due to the large number of visual components that need to be converted into a downloadable graphical format. For example, Chrome is known to block such requests if the number of components exceeds a certain threshold, resulting in a "network error". If you encounter problems with PNG downloads, please try again using another browser. Note that JavaScript must be enabled in your browser.
How to download data?
A download button is profived on every page that displays a figure.
By clicking on the download button, a file is dynamically generated for the currently viewed component, which should start the download automatically.
This file will be downloaded and locally stored on your computer under the filename displayed next to the download button.
By default, all data files are generated in common CSV
format.
If you encounter problems with the downloading function, please try again using another browser. Note that you do not need to have JavaScript enabled to download data.
When using the Safari browser, file downloads work just fine, but the Safari console emits an error, which is a known issue in Safari.
How much data is there?
Data in the Atlas of Variant Age has an approximate size of 7.5 Terabytes.
Data in the Shared Ancestry Database has an approximate size of 22.1 Terabytes for the 1000 Genomes Project (TGP) sample, and 271.7 Gigabytes for the Simons Genome Diversity Project (SGDP) sample.
These numbers refer to the approximate total diskspace required to store all downloads provided for age estimation profiles or pairwise shared ancestry results. The actual size of the underlying database is smaller, approximately 300 Gigabytes, in which all data are highly compressed, clustered, and cross-referenced. The framework includes several MySQL and SQLite3 databases, as well as static data files. A direct download of this database framework is not provided.
Data formats
Variant age profile data
Variant age estimation profiles can be downloaded by variant locus.
Profile data consists of pairwise inference results for the haplotype pairs that were analysed to estimate allele age.
Relative to a given variant locus, GEVA estimates the local haplotype segment shared between a pair of haplotypes (i.e. the position of recombination breakpoints), from which the TMRCA (time to the most recent common ancestor) is inferred.
Information from a larger number of pairs is combined to estimate allele age.
All data files are provided in tabular CSV format.
Each file has a meta-header (lines beginning with ##
), which contains the download date, variant ID (rsID), allele information, and genomic location.
The first line following the meta-header is the actual CSV header that defines the number and names of each column in the data table.
Column names GammaAlpha_*
, GammaBeta_*
, and MeanTMRCA_*
are distinguished by their suffix, indicating the clock model used; mutation clock (Mut
), recombination clock (Rec
), and joint clock (Jnt
).
The following table lists and describes each column.
Pair1st | Sample ID (as defined in a given data source) of the first diploid individual in the haplotype pair. Haplotypes are distinguished by ˜A and ˜B , referring to the first or second phased haplotype of a given individual (in order of appearence in the data set). |
Pair2nd | As above, but for the second haplotype in the pair. |
PairType | Type of pair; either Concordant (both haplotypes carry the focal allele) or Discordant (one carrier and one non-carrier). |
Source | Abbreviation of the data source; 1000 Genomes Project (TGP ), Simons Genome Diversity Project (SGDP ). |
SegmentBreakLHS | Physical position (GRCh37) of inferred recombination breakpoint on left-hand side of focal variant position. Floating point values (as opposed to position integers) are given for consistency with the inferred physical length of a shared haplotype segment, which is bound by recombination occurring in between sites (rounded to the nearest 0.5 distance). |
SegmentBreakRHS | Physical position (GRCh37) of inferred recombination breakpoint on right-hand side of focal variant position. Floating point values (as opposed to position integers) are given for consistency with the inferred physical length of a shared haplotype segment, which is bound by recombination occurring in between sites (rounded to the nearest 0.5 distance). |
SegmentLength | Physical length of locally inferred shared haplotype segment. |
GeneticLength | Genetic length (in units of Morgan) of locally inferred shared haplotype segment, based on HapMap genetic maps. |
GammaAlpha_Mut | Inferred α parameter of Gamma distribution (describing posterior probability of TMRCA over time) in the mutation clock model; derived from the number of pairwise differences between the two haplotypes along the shared haplotype segment (after applying a correction to make this number consistent with expectations under the infinite-sites model), plus 1 due to the prior expectation of exponential coalescent times with rate = 1. |
GammaAlpha_Rec | Inferred α parameter of Gamma distribution (describing posterior probability of TMRCA over time) in the recombination clock model; derived from the number of inferred recombination breakpoints that delimit the shared haplotype segment (0 if it stretches along the whole chromosome, 1 if one-sided, and 2 if breakpoints were inferred on both sides), plus 1 due to the prior expectation of exponential coalescent times with rate = 1. |
GammaAlpha_Jnt | Inferred α parameter of Gamma distribution (describing posterior probability of TMRCA over time) in the joint clock model (which considers both mutational and recombinational information), plus 1 due to the prior expectation of exponential coalescent times with rate = 1. |
GammaBeta_Mut | Inferred β parameter of Gamma distribution (describing posterior probability of TMRCA over time) in the mutation clock model; derived from the physical length of the shared haplotype segment and the mutation rate (µ = 1.2 × 10-8) per site per generation and Ne, where Ne = 10,000. |
GammaBeta_Rec | Inferred β parameter of Gamma distribution (describing posterior probability of TMRCA over time) in the recombination clock model; derived from the genetic length and, thus, variable recombination rates (based on HapMap genetic maps) along the shared haplotype segment and Ne, where Ne = 10,000. |
GammaBeta_Jnt | Inferred β parameter of Gamma distribution (describing posterior probability of TMRCA over time) in the joint clock model (which considers both mutational and recombinational information), derived from both the mutation and recombination rates (as given above) and Ne, where Ne = 10,000. |
MeanTMRCA_Mut | Mean posterior density of TMRCA of the inferred Gamma distribution under the mutation clock model (with parameters as given above), scaled by 2Ne, where Ne = 10,000. |
MeanTMRCA_Rec | Mean posterior density of TMRCA of the inferred Gamma distribution under the recombination clock model (with parameters as given above), scaled by 2Ne, where Ne = 10,000. |
MeanTMRCA_Jnt | Mean posterior density of TMRCA of the inferred Gamma distribution under the joint clock model (with parameters as given above), scaled by 2Ne, where Ne = 10,000. |
See example download on page: rs182549
Variant age summary data
Variant dating results can be downloaded for the variants in a given gene or genomic region (as well as by chromosome; see bulk downloads).
Summary results are provided as a point-estimate of allele age for each variant (per data source).
Full results (variant age profiles) can be downloaded seperately for each variant.
All data files are provided in tabular CSV format.
Each file has a meta-header (lines beginning with ##
), which contains information about the download date and genomic location (as well as gene names if downloaded for a specific gene).
The first line following the meta-header is the actual CSV header that defines the number and names of each column in the data table.
Columns starting with AgeMode_*
, AgeMean_*
, AgeMedian_*
, AgeCI95Lower_*
, AgeCI95Upper_*
, and QualScore_*
are distinguished by their suffix, indicating the clock model used; mutation clock (Mut
), recombination clock (Rec
), and joint clock (Jnt
).
The following table lists and describes each column.
VariantID | Genetic variant rsID. Note that some variant IDs begin with X followed by a unique numeric string; this is because the data source from which allele age has been estimated either did not contain rsID information or matching of the variant to reference data (Ensembl) was inconclusive (matched by genomic location and allelic states). |
Chromosome | Human chromosome 1 to 22 (i.e. autosome) on which a given variant is located. |
Position | Physical position of variant on chromosome (GRCh37). |
AlleleRef | Reference allele. |
AlleleAlt | Alternate allele. Note that the alternate allele was assumed to be derived in all analyses, but which may not correctly distingish ancestral and derived states. Dating results reflect the estimated age of the alternate allele. |
AlleleAnc | Ancestral allele according to external reference data (Ensembl), or . if unknown at the time of data upload. |
DataSource | Abbreviation of the data source used to date a given variant; 1000 Genomes Project (TGP ), Simons Genome Diversity Project (SGDP ), or combined from both (Combined ). For the latter, results from the pairwise inference of TMRCA between haplotype pairs (independently in each data set) were combined to re-estimate allele age. |
NumConcordant | Number of concordant haplotype pairs (both carrying the derived/alternate allele) available that were analysed (shared haplotype detection and inference of TMRCA) to eventually estimate allele age. All concordant pairs were sampled at random from the set of possible concordant pairs for a given variant. |
NumDiscordant | Number of discordant haplotype pairs (carrier and non-carrier haplotypes) available. Discordant pairs were sampled after applying a "relaxed" prioritisation algorithm to identify non-carrier haplotypes that are the nearest genealogical neighbours to the focal sub-tree (carriers). Effectively, on average, half the pairs were selected from a prioritised set, and the other half was sampled at random. |
AgeMode_* | Allele age estimate taken at the mode of the composite posterior distribution, resulting from combining TMRCA information across available haplotype pairs (after filtering) under the mutation clock (suffix Mut ), recombination clock (suffix Rec ), or joint clock (suffix Jnt ) model. |
AgeMean_* | Allele age estimate taken at the mean of the composite posterior distribution; as above. |
AgeMedian_* | Allele age estimate taken at the median of the composite posterior distribution; as above. |
AgeCI95Lower_* | 95% confidence interval, lower bound; estimated by computing the cumulative composite posterior distribution. |
AgeCI95Upper_* | 95% confidence interval, upper bound; estimated by computing the cumulative composite posterior distribution. |
QualScore_* | Quality score, calculated from the proportion of concordant/discordant pairs retained after filtering of outlier haplotype pairs. |
See example download on page: LCT
Ancestry shared between two individuals
...
Sample-wide shared ancestry
...