The implications of the Covid-19 crisis on infectious disease analysis
In 2020, we have clearly seen the incredible capabilities for disease identification. Multiple tools and technologies enable us to quickly identify the building blocks of viruses and even now, in the early stages of a global pandemic, we were able to quickly identify the genetic code of the coronavirus causing COVID-19. This first step of identification is critical in the chain of events that follow. It is only after knowing the exact nucleotide sequence that we can begin to create effective diagnostic tests, or develop vaccines and treatments.
We see how the gathering of data is not the bottleneck when it comes to addressing disease. Global accessible databases enable researchers to quickly share and collaborate on existing data, meaning, as soon as there is any data available the whole world can take a look at it. When starting research on these data, we are almost instantly faced with a major problem: It is very hard to quickly compare a genetic sequence with all other sequences that are located in these databases, and to their related metadata.
Why data analysis is a bottleneck for infectious disease researchers
Today, once a virus is sequenced, it takes too much time to conduct all the downstream analyses, and most importantly, valuable results can be missed due to lack of accuracy. The algorithms used to detect variations are approximations, and their capability of processing vast amounts of data is generally insufficient.
By comparing a new virus to known data, we can identify certain characteristics that are known within other viruses, such as certain functions. If we can quickly identify possible treacherous characteristics, then we can enable quicker and more coordinated global responses to these threats.
Based on similarities, we can check from which ancestors the virus originated. This is crucial in order to check if there are known treatments or cures that were successful in prior outbreaks, and can be replicated for a current outbreak. But even more importantly, the comparative analysis allows us to see what parts of the new virus have mutated and what parts have remained stable. Being able to compare massive amounts of data in a robust way and cluster them intelligently is key.
We know if the target regions for prior treatments and diagnostics changed, new diagnostic and treatment targets have to be discovered that are specific for the new virus. Fast and accurate detection of mutations is also necessary to track and trace the spread of an epidemic. Mutations serve as fingerprint patterns that can be used to follow viruses within neighbourhoods and across the globe. So, providing fast and accurate analyses, is also a key factor in controlling outbreaks.
Running through this process of discovery and comparisons only once is not enough. It must become a continuous process to detect how a virus is mutating. We need this continuously updated information to adequately track and trace and control outbreaks, to determine if the cures and diagnostics remain useful. And importantly, we must be able to quickly change the vaccine development process based on essential mutations that are uncovered during the process.
This is where we can run into issues with our current systems: when processing is slow, and database comparison can take months, how can we be able to adapt our diagnostics and vaccine development in order to be the most accurate and effective in treating disease? While generating the genetic data about a disease is relatively simple, we do not have the means to conduct proper analyses in a way that is good enough in the face of widespread disease such as COVID-19.
How we can change genetic data analysis for future events, post COVID‑19
What we need is a continuous general virus monitoring process. As we expect that the occurrence of other pandemics will be likely, detecting potentially ‘malignant’ and dangerous mutations to known viruses in an early stage would help in regards to prevention and treatment should there be a subsequent breakout.
To close the gaps and gain efficiency in the development of diagnostic tests and treatments of infectious diseases, scalable solutions are needed to process all the data available in a fast and accurate way.
The current genomic data analysis algorithms all take shortcuts, such as with heuristics, to make the calculations computationally feasible. But these are the shortcuts that lead to a lack of accuracy.
To make real progress, we have a desperate need for integration of knowledge to reduce these shortcuts - in other words, we must be able to merge accurate data analyses results with knowledge about disease mechanisms and pathways, chemical data, protein structure prediction, and more, in an intelligent and comprehensive way.
There is an urgent need for an approach that can finally handle the critical computational processes needed to extract knowledge from global databases without having to wait months.
As we navigate COVID-19 and look to develop a vaccine and potential treatments, plus more accurate diagnostics, we are made aware of how woefully ill-equipped our current disease analysis systems are.
While we are able to quickly identify the genetic code of a disease-causing virus, the process of downstream analyses and knowledge extraction needs to be improved with regard to speed of the execution but also with regard to accuracy. Only then, researchers will be able to gain precious time and make fundamental progress in developing diagnostics, vaccines, treatments, and more.
This bottleneck in analysis must be addressed if we are to handle future disease outbreaks and global pandemics in a methodical, effective, and efficient way.