Predicting anti-cancer molecule activity using machine learning algorithms
Description
In this paper we study the anti-cancer activity of - 4.000 unique compounds against a set of 60
cell lines (e.g. Leukemia, Prostate, Breast). Small molecules play an important role in biology as
they can be used as building blocks for more complex molecules and also interact with proteins
inhibiting or promoting their action. In this case the consequence of adding such a compound to a
cell can be far reaching as the protein may be involved in a very complex chain reaction. As such
it is possible to design small molecules which can be useful drugs. Here we concentrate only in
predicting a property of a given molecule: whether it will show anti-cancer activity (measured as
causing at least 50% cell growing inhibition) against a given cancerous cell line. This computational
prediction is important as there are a growing number of small molecules in databases worldwide
and the capacity for proper lab testing is limited. For instance, the In Vitro Cell Line Screening
Project at the National Cancer Institute (NCI) can currently evaluate (only) up to 3000 compounds
per year for potential anti-cancer activity. From a machine learning perspective, biological
problems are a good application because datasets are abundant, the data is real, the type of
algorithms most suitable for a particular problem may vary substantial and it is not unusual for
a problem to highlight research needs in machine learning. Finally, helping to solve biological
problems may have a big impact in the wider scientific community. The molecule dataset we used
is publicly available at the NCI site. We applied a range of data mining classification algorithms
to this problem: Decision Trees, Inductive Logic Programming and Support Vector Machines
(SVMs). As molecular features used for the learning we have used molecular weight, octanol
water partition coefficient (logp) and fragment counts. A fragment is a set of connected atoms
where each atom in a fragment is simply identified by its type. (e.g. carbon). If we look at the
molecule as a graph, the fragment list consists of all connected components with diameter two.
The experiments demonstrate that our results using support vector machines (with RBF kernel)
are identical to previous published state of the art work yielding an average 73% predictive
accuracy (having 54% as the baseline). We noticed however, to our surprise, that if instead of
using fragment counts we use only atom counts the results are nearly identical (about 1% less
accuracy, although the diference is statistical significant). An important point that must be made
is that, although numerical black box algorithms like SVMs tend to be slightly more accurate than
logic models (Decision Trees and ILPs in this dataset have an accuracy 3% to 4% below SVMs),
it is arguable the relevance of this predictive accuracy for important practical applications like
drug design. In a drug design setting what is useful is to have a set of rules that describe what
a "good" compound should look like. That goal is much easily achieved with a human readable
logic model like the ones we also describe in the paper.
| Slides | |
| 0:00 | Predicting anti-cancer molecule activity using machine learning algorithms |
| 0:29 | Problem |
| 0:43 | Motivation |
| 1:07 | National Cancer Institute dataset |
| 2:15 | Compound information - 1 |
| 2:45 | Compound information - 2 |
| 3:45 | Data example |
| 4:14 | Machine learning algorithms |
| 4:26 | Decision trees: C5.0 |
| 5:27 | Inductive logic programming: Progol |
| 7:43 | Support Vector Machines: LIB SVM |
| 8:57 | Comparison with other work |
| 9:56 | Some classification results per cell line |
| 10:22 | Overall classification results |
| 11:00 | Generalizing fragment counts - 1 |
| 11:28 | Compound information - 2 |
| 11:46 | Generalizing fragment counts - 2 |
| 12:35 | - Questions |
Lecture rating
| People found this lecture: | ||
| Worth seeing | ||
| because it is: | ||
| Valuable and informative | ||
| Well presented | ||
| Easily understandable | ||
| Acceptably recorded | ||
| You need to login to cast your vote. | ||
Report a problem or upload files
If you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
Related content
SEE ALSO:
Link this page
Would you like to put a link to this lecture on your homepage?Go ahead! Copy the HTML snippet !



