Predicting anti-cancer molecule activity using machine learning algorithms
published: April 17, 2008, recorded: March 2008, views: 419264
Report a problem or upload filesIf you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
In this paper we study the anti-cancer activity of - 4.000 unique compounds against a set of 60 cell lines (e.g. Leukemia, Prostate, Breast). Small molecules play an important role in biology as they can be used as building blocks for more complex molecules and also interact with proteins inhibiting or promoting their action. In this case the consequence of adding such a compound to a cell can be far reaching as the protein may be involved in a very complex chain reaction. As such it is possible to design small molecules which can be useful drugs. Here we concentrate only in predicting a property of a given molecule: whether it will show anti-cancer activity (measured as causing at least 50% cell growing inhibition) against a given cancerous cell line. This computational prediction is important as there are a growing number of small molecules in databases worldwide and the capacity for proper lab testing is limited. For instance, the In Vitro Cell Line Screening Project at the National Cancer Institute (NCI) can currently evaluate (only) up to 3000 compounds per year for potential anti-cancer activity. From a machine learning perspective, biological problems are a good application because datasets are abundant, the data is real, the type of algorithms most suitable for a particular problem may vary substantial and it is not unusual for a problem to highlight research needs in machine learning. Finally, helping to solve biological problems may have a big impact in the wider scientific community. The molecule dataset we used is publicly available at the NCI site. We applied a range of data mining classification algorithms to this problem: Decision Trees, Inductive Logic Programming and Support Vector Machines (SVMs). As molecular features used for the learning we have used molecular weight, octanol water partition coefficient (logp) and fragment counts. A fragment is a set of connected atoms where each atom in a fragment is simply identified by its type. (e.g. carbon). If we look at the molecule as a graph, the fragment list consists of all connected components with diameter two. The experiments demonstrate that our results using support vector machines (with RBF kernel) are identical to previous published state of the art work yielding an average 73% predictive accuracy (having 54% as the baseline). We noticed however, to our surprise, that if instead of using fragment counts we use only atom counts the results are nearly identical (about 1% less accuracy, although the diference is statistical significant). An important point that must be made is that, although numerical black box algorithms like SVMs tend to be slightly more accurate than logic models (Decision Trees and ILPs in this dataset have an accuracy 3% to 4% below SVMs), it is arguable the relevance of this predictive accuracy for important practical applications like drug design. In a drug design setting what is useful is to have a set of rules that describe what a "good" compound should look like. That goal is much easily achieved with a human readable logic model like the ones we also describe in the paper.
Link this pageWould you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !