Natural Language Understanding (NLU) - Basics and Applications in Bioinformatics
Natural language understanding (NLU) is an AI-powered technology that allows machines to understand the structure and meaning of human languages.
NLU, like natural language generation (NLG), is a subset of Natural Language Processing (NLP) that focuses on assigning structure, rules, and logic to human language so machines can understand the intended meaning of words, phrases, and sentences in text. NLG, on the other hand, deals with generating realistic written/spoken human-understandable information from structured and unstructured data.
Since the development of NLU is based on theoretical linguistics, the process can be explained in terms of the following linguistic levels of language comprehension.
Linguistic levels in NLU
Phonology is the study of sound patterns in different languages/dialects, and in NLU it refers to the analysis of how sounds are organized, and their purpose and behavior.
Lexical or morphological analysis is the study of morphemes, indivisible basic units of language with their own meaning, one at a time. Indivisible words with their own meaning, or lexical morphemes (e.g.: work) can be combined with plural morphemes (e.g.: works) or grammatical morphemes (e.g.: worked/working) to create word forms. Lexical analysis identifies relationships between morphemes and converts words into their root form.
Syntactic analysis, or syntax analysis, is the process of applying grammatical rules to word clusters and organizing them on the basis of their syntactic relationships in order to determine meaning. This also involves detecting grammatical errors in sentences.
While syntactic analysis involves extracting meaning from the grammatical syntax of a sentence, semantic analysis looks at the context and purpose of the text. It helps capture the true meaning of a piece of text by identifying text elements as well as their grammatical role.
Discourse analysis expands the focus from sentence-length units to look at the relationships between sentences and their impact on overall meaning. Discourse refers to coherent groups of sentences that contribute to the topic under discussion.
Pragmatic analysis deals with aspects of meaning not reflected in syntactic or semantic relationships. Here the focus is on identifying intended meaning readers by analyzing literal and non-literal components against the context of background knowledge.
Common tasks/techniques in NLU
There are several techniques that are used in the processing and understanding of human language. Here’s a quick run-through of some of the key techniques used in NLU and NLP.
Tokenization is the process of breaking down a string of text into smaller units called tokens. For instance, a text document could be tokenized into sentences, phrases, words, subwords, and characters. This is a critical preprocessing task that converts unstructured text into numerical data for further analysis.
Stemming and lemmatization are two different approaches with the same objective: to reduce a particular word to its root word. In stemming, characters are removed from the end of a word to arrive at the “stem” of that word. Algorithms determine the number of characters to be eliminated for different words even though they do not explicitly know the meaning of those words. Lemmatization is a more sophisticated approach that uses complex morphological analysis to arrive at the root word, or lemma.
Parsing is the process of extracting the syntactic information of a sentence based on the rules of formal grammar. Based on the type of grammar applied, the process can be classified broadly into constituency and dependency parsing. Constituency parsing, based on context-free grammar, involves dividing a sentence into sub-phrases, or constituents, that belong to a specific grammar category, such as noun phrases or verb phrases. Dependency
parsing defines the syntax of a sentence not in terms of constituents but in terms of the dependencies between the words in a sentence. The relationship between words is depicted as a dependency tree where words are represented as nodes and the dependencies between them as edges.
Part-of-speech (POS) tagging, or grammatical tagging, is the process of assigning a grammatical classification, like noun, verb, adjective, etc., to words in a sentence. Automatic tagging can be broadly classified as rule-based, transformation-based, and stochastic POS tagging. Rule-based tagging uses a dictionary, as well as a small set of rules derived from the formal syntax of the language, to assign POS. Transformation-based tagging, or Brill tagging, leverages transformation-based learning for automatic tagging. Stochastic refers to any model that uses frequency or probability, e.g. word frequency or tag sequence probability, for automatic POS tagging.
Name Entity Recognition (NER) is an NLP subtask that is used to detect, extract and categorize named entities, including names, organizations, locations, themes, topics, monetary, etc., from large volumes of unstructured data. There are several approaches to NER, including rule-based systems, statistical models, dictionary-based systems, ML-based systems, and hybrid models.
These are just a few examples of some of the most common techniques used in NLU. There are several other techniques like, for instance, word sense disambiguation, semantic role labeling, and semantic parsing that focus on different levels of semantic abstraction,
NLP/NLU in biomedical research
NLP/NLU technologies represent a strategic fit for biomedical research with its vast volumes of unstructured data — 3,000-5,000 papers published each day, clinical text data from EHRs, diagnostic reports, medical notes, lab data, etc., and non-standardized digital real-world data.
NLP-enabled text mining has emerged as an effective and scalable solution for extracting biomedical entity relations from vast volumes of scientific literature. Techniques, like named entity recognition (NER), are widely used in relation extraction tasks in biomedical research with conventionally named entities, such as names, organizations, locations, etc., substituted with gene sequences, proteins, biological processes, and pathways, drug targets, etc.
The unique vocabulary of biomedical research has necessitated the development of specialized, domain-specific BioNLP frameworks. At the same time, the capabilities of NLU algorithms have been extended to the language of proteins and that of chemistry and biology itself. A 2021 article detailed the conceptual similarities between proteins and language that make them ideal for NLP analysis. More recently, an NLP model was trained to correlate amino acid sequences from the UniProt database with English language words, phrases, and sentences used to describe protein function to annotate over 40 million proteins. Researchers have also developed an interpretable and generalizable drug-target interaction model inspired by sentence classification techniques to extract relational information from drug-target biochemical sentences.
Large neural language models and transformer-based language models are opening up transformative opportunities for biomedical NLP applications across a range of bioinformatics fields including sequence analysis, genome analysis, multi-omics, spatial transcriptomics, and drug discovery.
Most importantly, NLP technologies have helped unlock the latent value in huge volumes of unstructured data to enable more integrative, systems-level biomedical research. Read more about NLP’s critical role in facilitating systems biology and AI-powered data-driven drug discovery. If you want more information on seamlessly integrating advanced BioNLP frameworks into your research pipeline, please drop us a line here.