Quantitative Structure-Activity Relationship (QSAR) models attempt to relate the structure of the molecule with its chemical and biological activity. We have seen in the literature, that some physical property such as boiling point of a compound can be predicted very precisely by a neural network that has been trained on inputs obtained from the structure of the molecule of the compound.
Just like there are various atomic descriptors to tell apart different elements in the periodic table, the differences in the structure of a molecule is captured by the molecular descriptors. These descriptors relate the structure of the molecule to the biological activity of the molecule. At openscience.org, there are at least 50 descriptor types that give over 280 measurements of each molecule. Using classification algorithms, one may try to determine which descriptors predict the specific activity, if at all any of them do. If after using a suite of algorithms, with them, they are not good enough to predict the response variable, then the conclusion we may draw, is that new molecular descriptors are required.
The dataset in question has 9204 molecules in SMILES format. However, converting from Pdf format to CSV format has been quite an obstacle. I tried free converters and commercial converters, but there were always errors in the conversion, which resulted in data loss in terms of incorrect rows in the dataset. Finally, Tabula from Tabula Technology appears to do the job well. I found the link to Tabula at the openscience.org website, which champions open source solutions for Science projects.
These molecules need to be classified into two categories that are given for 9107 molecules. For the following 97 molecules, the category has been blinded. Naïve Bayesian classification is traditionally the first approach to classification, but it may be difficult to assume independence of variables, with 287 input variables. Logistic Regression (log reg), and Support Vector Machine models (SVM) may also be tried, before resorting to deep neural networks. The rattle package in R offers many of these functionalities. Initial trial shows that the R2 is not greater than 51%. Does an R2 of 0.51 mean that one could as well toss a coin to guess the class of the molecule? It does not mean that. One can conclude that the descriptors can only explain about 51% of the category, and that may be good news if an association is related to causation. Tables 1 and 2 display the classification accuracy on validation and test data, which were 15% each of the total dataset. The logistic regression classifies the blocker molecules much more accurately. In the given dataset, 87% of the molecules are of one class, and 13% of the molecules are of the other. Therefore the classification algorithm is predisposed to fit more accurately to the majority of molecules. Data. If the weightage can be adjusted to give more significance to the smaller class, the classification algorithm is bound to do better.
PREDICTED | |||
ACTUAL | Blocker | Non-Blocker | Error |
Blocker | 1061 | 34 | 3.1 |
Non-Blocker | 145 | 76 | 65.6 |
PREDICTED | |||
ACTUAL | Blocker | Non-Blocker | Error |
Blocker | 1053 | 30 | 2.8 |
Non-Blocker | 171 | 63 | 73.1 |
Author:
Dr. Badri Toppur
Associate Professor, Rajalakshmi School of Business, Chennai
Email – badri.toppur@rsb.edu.in