A Hybrid Approach to NLP in Drug Discovery
Artificial Intelligence-powered technologies like NLP are becoming critical to the pharmaceutical and life sciences industries as they become overwhelmed with volumes of data, almost 80 per cent of which exists as inaccessible and unusable unstructured text. The availability of domain-driven, easy-to-use NLP technologies plays a central role in enabling businesses to mobilise unstructured data at scale and to embrace a truly data-driven approach to insight generation and innovation.
NLP solutions are now being used at all stages of drug discovery, from analyzing clinical trial digital pathology data to identifying predictive biomarkers. These technologies have been proven to significantly reduce cost and cycle times, enhance the scope and accuracy of analysis and provide new insights that accelerate the development of new drugs.
However, NLP in drug discovery is not a monolithic concept. There are several possible approaches, each of which may be particularly suited for specific applications. Moreover, any comprehensive solution for integrated enterprise-wide analysis will likely require a blended or hybrid NLP approach.
So here‘s a quick dive into some of the key approaches to NLP in drug discovery.
Key NLP approaches
NLP consists of two main phases, data preprocessing and algorithm development. NLP algorithms can be classified under three main types, rules-based, ML-based and Hybrid approaches.
These systems depend on carefully curated sets of linguistic rules designed by experts to classify content into relevant categories. This approach emerged during the early days of NLP development and is still in use today. However, a rules-based approach requires a lot of manual input and is best suited for linguistic tasks where the rule base is readily available and/or manageably small. It becomes practically impossible to manually generate and maintain rules for complex environments.
ML-based algorithms use statistical methods to learn from large training datasets. These algorithms learn from pre-labeled examples to understand the relationships between different parts of texts and make the connections between specific inputs and required outputs.
Based on their approach to learning, ML-based methods can be further classified under supervised, unsupervised and self-supervised NLP.
Supervised NLP models are trained using well-labeled, or tagged, data. These models learn to map the function between known data inputs and outputs and then use this to predict the best output that corresponds to new incoming data. Supervised NLP works best with large volumes of readily available labelled data. However, building, deploying, and maintaining these models requires a lot of time and technical expertise.
This is a more advanced and computationally complex approach to analyzing, clustering and discovering patterns in unlabeled data without the need for any manual intervention. Unsupervised NLP enables the extraction of value from the predominance of unlabeled text and can be especially important for common NLP tasks like PoS tagging or syntactic parsing. However, unsupervised NLP methods cannot be used for tasks like classification without substantial retraining with annotated data.
Self-supervised learning is still a relatively new concept that has had a significant impact on NLP. In this technique, part of an input dataset is concealed and self-supervised learning algorithms then analyse the visible part to create the rules that will enable them to predict the hidden data. This process, also known as predictive or pretext learning, auto-generates the labels required for the system to learn thereby converting an unsupervised problem into a supervised problem. A key distinction between unsupervised and self-supervised learning is that in the former the focus is on the model rather than on the data while in the latter it is the other way around.
In recent times, ML-based approaches have evolved into the NLP deep learning age driven by the explosion in digital text, increased processing power in the form of GPUs and TPUsand improved activation functions for neural networks. As a result, deep learning (DL) has become the dominant approach for a variety of NLP tasks. Today, there is a lot of focus on developing DL techniques for NLP tasks that are best expressed with a graph structure. One of the biggest breakthroughs in NLP in recent times has been the transformer, a deep learning model that leverages attention mechanisms to reinvent textual analytics. DL may not be the most efficient or effective solution for simple NLP tasks but it produced some groundbreaking results in named entity recognition, document classification and sentiment analysis.
With hybrid NLP, the focus is on combining the best of rule- and ML-based approaches without having to compromise between the advantages and drawbacks of each. A hybrid system could integrate a machine-learning root classifier with a rules-based system with rules added to the latter for tags that have been incorrectly modelled by the former. Techniques like self-supervised learning can help reduce the human effort required for building models which in turn can be channelled into creating more scalable and accurate solutions. Combining top-down, symbolic, structured knowledge-based approaches with bottom-up, data-driven neural models will enable organizations to optimize resource usage, increase the flexibility of their models and accelerate time to insight.
Hybrid NLP with BioStrand Lensai
The BioStrand Lensai Platform is not designed around a singular technique as a monolithic solution. Instead, it is a careful amalgamation of different knowledge-based and neural models integrated seamlessly within a single pipeline. The core design philosophy is to provide researchers with access to the best NLP components, techniques and models that are most relevant to the objectives and outcomes of their project.
For instance, we took a rules-based approach to semantic parsing. Therefore, the semantic rules that have been encoded into the engine are based on linguistics and have been refined over a period of more than 10 years. This kind of algorithm is very similar to a standard algorithm, requires no training and is easily understood by human beings. For gene enrichment, our technology utilizes pure standard statistical methods. The query-based graph extractors simply transform the data stored in relational tables into a graph format.
Lensai, therefore, is not a technology or technique-based approach to biomedical NLP. It is an outcome-based hybrid NLP model designed to maximize analytical productivity by automatically mapping the best NLP technologies and techniques to the task and objectives at hand.