Unlocking the potential of large prospective biobank cohorts for -omics data analysis: aspects of study design, prediction and causality

author: Krista Fischer, University of Tartu
published: July 18, 2016,   recorded: May 2016,   views: 1323


Related Open Educational Resources

Related content

Report a problem or upload files

If you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
Lecture popularity: You need to login to cast your vote.


Recent decade has seen a tremendous increase in availability of data from large population-based biobank cohorts. Such datasets include various types of -omics data (genomics, transcriptomics, metabolomics etc) as well as extensive data on participants' health, lifestyle and demographics at recruitment and often also detailed follow-up data from electronic health registries and other databases. This talk will discuss aspects of study design and statistical analysis based on such datasets.

First of all, the options of analysis of follow-up data will be discussed, in order to evaluate potential -omics based predictive biomarkers. One important issue to consider is the choice of timescale. Unlike traditional survival analysis projects, the recruitment time does not mark any important event (such as diagnosis of a serious disease) in the participants life course and therefore the actual follow-up time may not be the optimal timescale to use. However, this depends also on types of biomarkers to be considered - whether they depend on current health status of the participant (as metabolomics data, for instance) or are determined at birth (DNA-based markers). We illustrate the concepts based on both simulated data and the Estonian Biobank cohort to understand, what is the optimal analysis strategy in each of the situation considered. Another issue is study design - especially in cases where only a subset of a large cohort can be selected for genotyping or another kind of sample processing to obtain the relevant -omics data. Here, the potential of nested case-control study design will be discussed.

The second topic to be discussed is the use of genetic data in personalized risk prediction. Large biobank cohorts provide data to compare and validate such predictors. In case of common complex diseases, the polygenic nature of the disease has to be taken into account and therefore multimarker scores have considerably better predictive ability than any of the single SNPs. Here it is important to reach on optimal decision on the choice of genetic markers to the score as well as the weights used to combine them. The concept will be illustrated using the example of Type 2 Diabetes risk prediction in the Estonian Biobank data.

Finally, some aspects of causal modeling will be discussed. Availability of large cohorts has encouraged many researchers to use Mendelian Randomization methodology to estimate causal effects of different lifestyle and clinical parameters on the outcomes. However, causal inference techniques always rely on some untestable assumptions and these are often forgotten. We discuss, whether it is possible to distinguish between alternative causal scenarios (such as mediation and pleiotropy) in case of on genetic and two non-genetic variables.

See Also:

Download slides icon Download slides: ESHGsymposium2016_fischer_biobank_01.pdf (4.9 MB)

Help icon Streaming Video Help

Link this page

Would you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !

Write your own review or comment:

make sure you have javascript enabled or clear this field: