Machine Learning Application on Prediction of Male Breast Cancer with PLCO Dataset

Authors

  • Juntao Li Cannon High School
  • Dr. Ganesh Mani Mentor, Carnegie Mellon University

DOI:

https://doi.org/10.47611/jsrhs.v10i3.2199

Keywords:

Machine Learning, Male Breast Cancer, PLCO

Abstract

The objective of the paper is to explore and examine the applicability of machine learning models on Male Breast Cancer with PLCO dataset. People who are unaware of the potential danger of getting breast cancer like males would not have the medical awareness beforehand for predictions. Therefore, the PLCO trials dataset consisting of ages, prostate status, marriage status etc. from National Institute of Cancer is used in this research for detection. The main purpose of using PLCO test is to discover the potential risk of getting an Male Breast Cancer (MBC) as soon as possible with low cost and easy collection. It is the rarity of MBC that imposes the threat for males who are unaware of the danger. To explore the relatively most suitable models to use for detecting MBC using non-traditional PLCO test dataset, different existing models including decision tree, random forest, DBSCAN, One Class SVM and so on were used to fit the data. Due to its extremity of imbalance, evaluation comes from the combination of standard accuracy and Area Under the Receiver Operating Characteristics(AUROC) for the overall accuracy of those models mentioned above. K-means and Logistic Regression models performed best with the AUC score of 0.62 and 0.67. Results suggested that more efficient approaches for common male breast cancer diagnosis or more advanced models and algorithms are needed in further study.

Downloads

Download data is not yet available.

References or Bibliography

References

Al-Masri, A. (2019). How does k-means clustering in machine learning work? TowardsDataScience. https://towardsdatascience.com/how-does-k-means-clustering-in-machine-learning-work-fdaaaf5acfa0

Breast cancer wisconsin (diagnostic) data set. (1995). UCI Machine Learning Repository.

Cardoso, F. (2017). Characterization of male breast cancer: Results of the eortc 10085/tbcrc/big/nabcg international male breast cancer program. ScienceDirect. https://doi.org/10.1093/annonc/mdx651

Choudhury, A. (Ed.). (2021, January 14). Top xgboost interview questions for data scientists. Retrieved August 31, 2021, from https://analyticsindiamag.com/top-xgboost-interview-questions-for-data-scientists/

Decision tree learning pros and cons. (n.d.). Orelly. https://www.oreilly.com/library/view/machine-learning-with/9781787121515/697c4c5f-1109-4058-8938-d01482389ce3.xhtml

Doshi, N. (2019). Spectral clustering. Towards Data Science. https://towardsdatascience.com/spectral-clustering-82d3cff3d3b7

Gao, Y. (2019). Breast cancer screening in high-risk men: A 12-year longitudinal observational study of male breast imaging utilization and outcomes. Radiology. https://doi.org/10.1148/radiol.2019190971

Goonewardana, H. (2019). PCA: Application in machine learning. Apprentice Journal. https://medium.com/apprentice-journal/pca-application-in-machine-learning-4827c07a61db

Hill, T. D. (2005). Comparison of male and female breast cancer incidence trends, tumor characteristics, and survival. ScienceDirect. https://www.sciencedirect.com/science/article/abs/pii/S1047279705000128?via%3Dihub

Innab, R. (2019, October 31). Why do decision trees have a tendency to overfit to the training set? [Online forum post]. Quora. https://www.quora.com/Why-do-decision-trees-have-a-tendency-to-overfit-to-the-training-set

Karatsalos, C. (2018, March 27). What is the time complexity of spectral clustering and why is it so? [Online forum post]. StackExchange. https://stats.stackexchange.com/questions/348512/what-is-the-time-complexity-of-spectral-clustering-and-why-is-it-so

Kunanbaeva, A. (2019). What is ROC AUC and how to visualize it in python. Medium. https://medium.com/@kunanba/what-is-roc-auc-and-how-to-visualize-it-in-python-f35708206663

M, S., & Radhika, S. (2020). Machine learning techniques for prediction from various breast cancer datasets. IEEE. https://sci-hub.st/https://ieeexplore.ieee.org/abstract/document/9167657/

Male breast cancer. (2020). National Breast Cancer. https://www.nationalbreastcancer.org/male-breast-cancer

Markman, M. (2021). BRCA1 and brca2. Cancer Treatment Center of America. https://www.cancercenter.com/cancer-types/breast-cancer/risk-factors/brca1-and-brca2

Narkhede, S. (2018). Understanding logistic regression. Towards Data Science. https://towardsdatascience.com/understanding-logistic-regression-9b02c2aec102

Omene, C. (2010). Chapter 42 - the differences between male and female breast cancer. ScienceDirecta. https://doi.org/10.1016/B978-0-12-374271-1.00042-3

Prado, K. (2017). How DBSCAN works and why should we use it? TowadsDataScience. https://towardsdatascience.com/how-dbscan-works-and-why-should-i-use-it-443b4a191c80

Prostate cancer screening results from the prostate, lung, colorectal, and ovarian cancer randomized screening trial: Questions and answers. (2009, March 19). Retrieved August 3, 2021, from https://www.cancer.gov/types/prostate/research/plco-screening-results-qa#:~:text=Cancer%20Screening%20Trial%3F-,The%20Prostate%2C%20Lung%2C%20Colorectal%2C%20and%20Ovarian%20(PLCO),%2C%20colorectal%2C%20and%20ovarian%20cancer.

Regularization (mathematics). (n.d.). Wikipedia. Retrieved August 31, 2021, from https://en.wikipedia.org/wiki/Regularization_(mathematics)

Sasco, A. (1993). Review article: Epidemiology of male breast cancer. A meta-analysis of published case-control studies and discussion of selected aetiological factors. International Journal of Cancer. https://onlinelibrary.wiley.com/doi/10.1002/ijc.2910530403

Scholkopf, B. (2000). Support vector method for novelty detection. MIT Press. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.675.575&rep=rep1&type=pdf

Sharma, A. (2020). How to master the popular dbscan clustering algorithm for machine learning. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2020/09/how-dbscan-clustering-works/

Sklearn.cluster.DBSCAN. (n.d.). Scikit-learn. Retrieved August 31, 2021, from https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

Sklearn.ensemble.RandomForestClassifier. (n.d.). Sklearn. Retrieved August 31, 2021, from https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Verma, P. (2020). Isolation forest algorithm for anomaly detection. Heatbeat. https://heartbeat.fritz.ai/isolation-forest-algorithm-for-anomaly-detection-2a4abd347a5

Vermeulen, M. A. (2017). Pathological characterisation of male breast cancer: Results of the eortc 10085/tbcrc/big/nabcg international male breast cancer program. European Journal of Cancer. https://doi.org/10.1016/j.ejca.2017.01.034

Wening, P. (2018). Local outlier factor for anomaly detection. TowardsDataScience. https://towardsdatascience.com/local-outlier-factor-for-anomaly-detection-cc0c770d2ebe

Yalaza, M. (2016). Male breast cancer. US National Library of Medicine National Institutes of Health. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5351429/#b3-jbh-12-1-1

Yan, D. (2009). Fast approximate spectral clustering. Association for Computing Machinery. https://doi.org/10.1145/1557019.1557118

Published

11-20-2021

How to Cite

Li, J., & Mani, G. (2021). Machine Learning Application on Prediction of Male Breast Cancer with PLCO Dataset. Journal of Student Research, 10(3). https://doi.org/10.47611/jsrhs.v10i3.2199

Issue

Section

HS Research Articles