Machine Learning Application on Prediction of Male Breast Cancer with PLCO Dataset


  • Juntao Li Cannon High School
  • Dr. Ganesh Mani Mentor, Carnegie Mellon University



Machine Learning, Male Breast Cancer, PLCO


The objective of the paper is to explore and examine the applicability of machine learning models on Male Breast Cancer with PLCO dataset. People who are unaware of the potential danger of getting breast cancer like males would not have the medical awareness beforehand for predictions. Therefore, the PLCO trials dataset consisting of ages, prostate status, marriage status etc. from National Institute of Cancer is used in this research for detection. The main purpose of using PLCO test is to discover the potential risk of getting an Male Breast Cancer (MBC) as soon as possible with low cost and easy collection. It is the rarity of MBC that imposes the threat for males who are unaware of the danger. To explore the relatively most suitable models to use for detecting MBC using non-traditional PLCO test dataset, different existing models including decision tree, random forest, DBSCAN, One Class SVM and so on were used to fit the data. Due to its extremity of imbalance, evaluation comes from the combination of standard accuracy and Area Under the Receiver Operating Characteristics(AUROC) for the overall accuracy of those models mentioned above. K-means and Logistic Regression models performed best with the AUC score of 0.62 and 0.67. Results suggested that more efficient approaches for common male breast cancer diagnosis or more advanced models and algorithms are needed in further study.


Download data is not yet available.

References or Bibliography


Al-Masri, A. (2019). How does k-means clustering in machine learning work? TowardsDataScience.

Breast cancer wisconsin (diagnostic) data set. (1995). UCI Machine Learning Repository.

Cardoso, F. (2017). Characterization of male breast cancer: Results of the eortc 10085/tbcrc/big/nabcg international male breast cancer program. ScienceDirect.

Choudhury, A. (Ed.). (2021, January 14). Top xgboost interview questions for data scientists. Retrieved August 31, 2021, from

Decision tree learning pros and cons. (n.d.). Orelly.

Doshi, N. (2019). Spectral clustering. Towards Data Science.

Gao, Y. (2019). Breast cancer screening in high-risk men: A 12-year longitudinal observational study of male breast imaging utilization and outcomes. Radiology.

Goonewardana, H. (2019). PCA: Application in machine learning. Apprentice Journal.

Hill, T. D. (2005). Comparison of male and female breast cancer incidence trends, tumor characteristics, and survival. ScienceDirect.

Innab, R. (2019, October 31). Why do decision trees have a tendency to overfit to the training set? [Online forum post]. Quora.

Karatsalos, C. (2018, March 27). What is the time complexity of spectral clustering and why is it so? [Online forum post]. StackExchange.

Kunanbaeva, A. (2019). What is ROC AUC and how to visualize it in python. Medium.

M, S., & Radhika, S. (2020). Machine learning techniques for prediction from various breast cancer datasets. IEEE.

Male breast cancer. (2020). National Breast Cancer.

Markman, M. (2021). BRCA1 and brca2. Cancer Treatment Center of America.

Narkhede, S. (2018). Understanding logistic regression. Towards Data Science.

Omene, C. (2010). Chapter 42 - the differences between male and female breast cancer. ScienceDirecta.

Prado, K. (2017). How DBSCAN works and why should we use it? TowadsDataScience.

Prostate cancer screening results from the prostate, lung, colorectal, and ovarian cancer randomized screening trial: Questions and answers. (2009, March 19). Retrieved August 3, 2021, from,The%20Prostate%2C%20Lung%2C%20Colorectal%2C%20and%20Ovarian%20(PLCO),%2C%20colorectal%2C%20and%20ovarian%20cancer.

Regularization (mathematics). (n.d.). Wikipedia. Retrieved August 31, 2021, from

Sasco, A. (1993). Review article: Epidemiology of male breast cancer. A meta-analysis of published case-control studies and discussion of selected aetiological factors. International Journal of Cancer.

Scholkopf, B. (2000). Support vector method for novelty detection. MIT Press.

Sharma, A. (2020). How to master the popular dbscan clustering algorithm for machine learning. Analytics Vidhya.

Sklearn.cluster.DBSCAN. (n.d.). Scikit-learn. Retrieved August 31, 2021, from

Sklearn.ensemble.RandomForestClassifier. (n.d.). Sklearn. Retrieved August 31, 2021, from

Verma, P. (2020). Isolation forest algorithm for anomaly detection. Heatbeat.

Vermeulen, M. A. (2017). Pathological characterisation of male breast cancer: Results of the eortc 10085/tbcrc/big/nabcg international male breast cancer program. European Journal of Cancer.

Wening, P. (2018). Local outlier factor for anomaly detection. TowardsDataScience.

Yalaza, M. (2016). Male breast cancer. US National Library of Medicine National Institutes of Health.

Yan, D. (2009). Fast approximate spectral clustering. Association for Computing Machinery.



How to Cite

Li, J., & Mani, G. (2021). Machine Learning Application on Prediction of Male Breast Cancer with PLCO Dataset. Journal of Student Research, 10(3).



HS Research Articles