event thumbnail image
Machine Learning in Systems Biology
Pascal

Discovering Common Sequence Variation in Arabidopsis thaliana

author: Gunnar Rätsch, Max Planck Institute

Description

In order to characterize natural sequence variation in 20 strains of the model plant Arabidopsis thaliana, whole-genome resequencing with high-density oligonucleotide arrays was performed in collaboration with Perlegen Sciences Inc. Array data were analyzed with a combination of existing model-based (MB; Hinds et al., Science, 2005) and novel machine learning (ML) methods. For the identification of single nucleotide polymorphisms (SNPs) we developed an algorithm based on support vector machines. Training and evaluation was done on published alignments (Nordborg et al., PLoS Biology, 2005). At the same false discovery rates (FDR) as MB, the ML algorithm identifies significantly more true SNPs, especially in regions of high polymorphism density and/or low hybridization quality. The union of SNP predictions from both methods contains on average 143,572 SNPs per strain at a FDR of 2.8% (648,570 non-redundant SNPs). Furthermore, a machine learning algorithm was developed to detect polymorphic regions containing insertions, deletions and variational hotspots, where SNP detection algorithms typically fail to identify individual SNPs. It discovers the approximate location of a substantial additional proportion of polymorphisms (54% of deleted nucleotides and 33% of insertion sites). With a combination of all three methods 74% of SNPs can be directly called or are contained in a polymorphic region prediction (Zeller et al., in preparation). We examined the patterns of and forces shaping sequence variation in Arabidopsis (Clark et al., Science, 2007): e.g. significant differences were observed between gene families, and genes mediating interaction with the biotic environment harbor exceptional polymorphism levels.

You might be experiencing some problems with Your Video player.
Slides
0:00 - Discovering Common Sequence Variations in Arabidopsis thaliana - Announcement
1:34 Discovering Common Sequence Variations in Arabidopsis thaliana
1:46 Introduction - 1
3:07 Introduction - 2
4:02 Introduction - 3
5:02 Introduction - 4
6:00 Resequencing Array Basics I
7:30 Resequencing Array Basics II - 1
8:11 Resequencing Array Basics II - 2
8:52 Resequencing Data - 1
10:00 Resequencing Data - 2
11:23 Resequencing Data - 3
12:10 Support Vector Machines for SNP Identification - 1
13:00 Support Vector Machines for SNP Identification - 2
13:40 Support Vector Machines for SNP Identification - 3
13:59 Support Vector Machines for SNP Identification - 4
14:07 Support Vector Machines for SNP Identification - 5
14:57 2-Layered Architecture for Inter-Strain Integration - 1
15:31 2-Layered Architecture for Inter-Strain Integration - 2
16:20 2-Layered Architecture for Inter-Strain Integration - 3
18:15 Application to SNP Discovery
21:20 Limitations of the Technique - 1
21:30 Limitations of the Technique - 2
22:57 Limitations of the Technique - 3
25:33 For this Work We Used the Shogun Toolbox
25:58 JMLR - Machine Learning Open Source Publications
26:49 New Problems and Methods in Computational Biology
27:18 Modeling Polymorphic Regions - 1
28:01 Modeling Polymorphic Regions - 2
28:46 Modeling Polymorphic Regions - 3
29:28 Example - 1
30:25 Learning to Predict Segmentations
32:35 Example - 1
33:34 Example - 2
33:35 Detection Performance
34:21 Complementing SNP Calls
36:28 Polymorphism Distribution - 1
38:00 Polymorphism Distribution at Gene Boundaries
39:32 Polymorphism Distribution - 2
41:03 Modeling Polymorphic Regions - 3
41:57 Predicted Effects on Gene Products - 1
43:08 Predicted Effects on Gene Products - 2
43:47 Effects on Genes - 1
44:37 Effects on Genes - 2
45:17 Effects on Genes - 3
46:29 Ab initio Gene Finding - 1
46:59 Ab initio Gene Finding - 2
47:42 Predicted Effects by Gene Finding
48:38 Example of Predicted Splice Form Change
49:08 Conclusions - 1
49:45 Conclusions - 2
50:40 Conclusions - 3
50:56 Conclusions - 4
51:31 - Questions

Lecture rating

People found this lecture:
Worth seeing
because it is:
 Valuable and informative
Well presented
Easily understandable
Acceptably recorded
You need to login to cast your vote.

Report a problem or upload files

If you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.

Link this page

Would you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !

Write your own review or comment: