Large Scale Biomedical Literature Mining in the Cloud

Ahmad Pahlavan Tafti
University of Wisconsin

Ahmad Pahlvan Tafti, Max He
Center for Human Genetics

Research area:  Genetics, Biomedical Informatics

Background: Literature mining is a specialized data mining method that extracts information (e.g. facts or data) from text, such as scientific literature.  Literature mining can generate new hypotheses by systematically scrutinizing large numbers of abstracts or full-text versions of scientific articles. This project focused on developing practical data mining methods for extracting information, and developing prediction models to classify information extracted from published biomedical articles. 

Methods: We used natural language processing (NLP), machine learning strategies, and Big Data infrastructure to design and develop a scalable framework to extract information (e.g., breast, prostate, and/or lung cancers) from published scientific articles. First, we converted the articles into a format that suits for the machine learning algorithm.  The bag-of-words vector and the term frequency-inverse document frequency (TF-IDF) scores were then used to train a classifier.  We then employed three different classification algorithms including Naïve Bayes, Support Vector Machine (SVM), and Logistic Regression to make a prediction model based on the 15,983 abstracts as well as the 11,017 full-text articles published between 2012 and 2014 from PubMed Central ( The proposed framework was developed on a Big Data infrastructure including an Apache Hadoop cluster, together with Apache Spark component and Cassandra Database. The program was implemented using Java2SE 8 programming language.

Results: On average, the accuracy of predicting a cancer type using abstracts is 96.6%, while its accuracy using full-text articles is 94.8%.  The running time using Big Data platform for 11,017 full-text articles was about 9 minutes, while it took 8.5 hours without using any Big Data infrastructure.

Conclusions: The accuracy and time efficiency of our proposed framework were promising. Large-scale biomedical text mining on a Big Data infrastructure speeds up biomedical literature mining largely.  It has the potential to provide solid benefits to biomedical community and clinical research.