Paper Summary - Rapidly Deploying a Neural Search Engine for the COVID-19 Open Research Dataset

Paper Summary of covidex.ai
information-retrieval
deep-learning
papers
Published

May 25, 2020

Paper Summary - Rapidly Deploying a Neural Search Engine for the COVID-19 Open Research Dataset

Overview of covidex.ai

This is a paper summary of deploying a Neural Search Engine to answer questions from the COVID-19 dataset.

Neural Covidex applies state-of-the-art neural network models and artificial intelligence (AI) techniques to answer questions using the COVID-19 Open Research Dataset (CORD-19) provided by the Allen Institute for AI (data release of April 3, 2020). This project is led by Jimmy Lin from the University of Waterloo and Kyunghyun Cho from NYU, with a small team of wonderful students: Edwin Zhang and Nikhil Gupta. Special thanks to Colin Raffel for his help in pretraining T5 models for the biomedical domain.”

alt text

Motivation

The ongoing pandemic crisis poses a huge challenge to get timely information for public health officials, clinicians, researchers, virologists. In order to respond to this challenge, Allen AI publishes a COVID-19 data set (CORD-19) in collaboration with other research groups. The source for this data set is both research articles published about coronovirus and other related research articles. The aim of this effort is to bring researchers

  • to apply language processing techniques in order to generate insights & make data driven decisions.
  • to provide ways for the front line to consume the recent developments in a digestible form & apply in the field.

Outcomes

Jimmy Lin & his research team responded to this call. The two strategies adopted were

  • Real time users should be able to find answers to any questions associated with COVID
  • Other Researchers should be able reuse the components they build. Providing a modular and reusable components is set as part of the requirements.

The team decided to build end to end real time search application called covidex.ai . They developed the components that powers this engine in a short span of time for the information retrieval need .

  • keyword based search interface : This also provides faceted navigation in the form of filters like author, article source, time range and highlighting words from the results that matches with user query.
  • neural ranking component that sorts the results with the top most results answering user’s question.

Neural Covidex for reranking

The research group was already working on the neural architectures specifically applying transfer learning on retrieval/ranking based problems.

  • BERT for query based Passage Ranking : Applying transfer learning for passage reranking pre-trained on MS-MARCO dataset
  • BERTserini for retrieval-based question answering: Incorporating Anserini Retriever to retrieve the top K segments of text followed by BERT based pre-trained model to retrieve the answer span.

Typically the task of reranking is turn the problem into a classification task where we take the query, candidate_document & predict the target as (relevant, not-relevant). To avoid the costly operation of performing classification on the entire corpus, this is applied at the reranking stages. The engine gets the top K documents from the previous retrieval stage and rerank them using machine learning model. As part of reranking stage, the team leveraged Sequence to Sequence Model for reranking (Nogueira et al. 2020).

Stages involved in training the reranking model using Transfer Learning Methodology

Pre-training →Fine Tuning → Inference

  • [Pre-training] Transformer based Language Model trained on MS Marco dataset
  • [Fine Tuning] Given a query q, document D, the model is fine tuned to predict the output as either true or false as targets indicating the relevance.
  • [Inference] In reranking setting, for each candidate documents, predict the prob distribution of (relevant, non-relevant) and sort the scores of relevant doc(true outputs) alone.

Training a language model and the encoder from this fined tuned language model is normally used for the downstream tasks like Classification in transfer learning methodology. But this method of applying Sequence to Sequence model (based on T5) is quite new for document ranking setting (Nogueira et al., 2020).

The reasoning provided was the predicted target words can capture the relatedness through pre-training. This is based on encoder-decoder architecture & uses a similar masked language modeling objective. Given a query, document the model is fine tuned to produce true or false if the document is relevant or not to the query.

Challenges in Results Evaluation

  • The authors rightfully mention that the individual components comprising such a system is evaluated against various test datasets. But as this is specific to an evolving dataset like CORD-19, there is no such existing test collections.

  • It is not always necessary that ranking is the most important for such an end to end system. We have to switch to an outcome based measure rather than a single output based measure like batch retrieval evaluations(MRR, nDCG). Eg: “Did the researchers, practitioners get their questions answered?” How many of them are not finding the answers? So involving human in the loop to qualitatively evaluate the results is essential to know if the system is really contributing towards the efforts fighting the pandemic.

  • What if the exploratory users do not know the right type of keywords to use ? In that case ranking is a wrong goal to pursue.

  • Current challenge is all the targeted users are working on the front line and hard to provide qualitative feedback about search experience. So the author asks for more hallway usability testing to gather insights from the users.

Author Reflections

An end to end system like Covidex is not possible without the power of the current Open Source Software ecosystem, Open culture of curating, cleaning & sharing the data with the community (Thanks to CORD-19 by Allen AI) and pre-trained language models like MS-Marco etc.

Good software engineering practices is the foundation for a team and ensure that the underlying software components can be replicated & reused to provide this system. This is essential to rapidly explore and experiment with new ideas.

Building a strong research culture to produce the results in the form of open source software artifacts aid the community in reproducing the results and build on top of it.

Reminder about the mismatch between producing research code for conference and building a system for a real users. For example concerns like latency of search requests, throughput about the number of users, deploying & managing a system in production, user experience does not arise in a research setting.

Insights & Takeaways from this paper

  • Lack of proper training data & human annotators is a common challenge. Leveraging pre-trained models on MS-MARCO is critical for ranking tasks in this type of situation.
  • The experimentation mindset need to be adopted and one need to interactive computing tools like pyserini to experiment with search index. This allows the search practitioners to constantly iterate and learn from these experiments.
  • Adoption of Openness in not only the source but science and data & the way we work is truly inspiring.