Development of an Oral Cancer Risk Assessment Tool Using Machine Learning Algorithms

Reihaneh Rostami
University of Wisconsin

Reihaneh Rostami1, Harshad Hedge2, Neel A. Shimpi1, Gary Pack1, Amit Acharya1
1Institute for Oral and Systemic Health, 2Biomedical Informatics Research Center

Research area: Institute for Oral and Systemic Health

Background: Studies show that oral cancers are often diagnosed at late stages; hence assessing their risk in advance will prove beneficial in reducing the high mortality rate. To evaluate such risk, multiple causative factors need to be considered simultaneously. Literature review shows that machine learning (ML) can be used as an effective technique in evaluating the future risk of oral cancer through extracting the complicated relationship between these factors. We investigated four ML algorithms to develop an oral cancer risk assessment tool. 

Methods: The dataset was mined from Marshfield Clinic data warehouse (1979 to 2015). After preprocessing, the final dataset included 526 cases, 526 controls, and 15 features (etiological causes collected as structured data). 14 different models were generated using feature selection and dimensionality reduction. The performance of four ML algorithms namely Multi-Layer Perceptron (MLP), K-Nearest Neighbors (KNN), Decision tree (DT), and Adaboost were compared using 10-fold cross-validation. Sensitivity analysis was also performed to make sure that results were reliable. Java programming language and Weka library were used to implement the classifiers. Also a prototypical web application was developed to interact with the implemented tool.

Results: Accuracy, recall, precision, and specificity for Adaboost were 77%, 63%, 88%, and 91%, respectively. The metrics for MLP were the same except specificity was 92%. The highest and lowest metrics, specificity and recall, were respectively 94% and 40% for DT and 95% and 36% for KNN.

Conclusion: The MLP outperformed the other classifiers on the 11-feature model. Due to the lack of details about some of the risk factors such as tobacco and alcohol in the extracted dataset, the accuracy and recall were not reasonable enough. Mining key features from the clinical narrative documents from the Electronic Health Record (EHR) and incorporating them in a broader range of ML algorithms, can improve the results.