An Analysis of the k-Nearest Neighbor Classifier to Predict Benign and Malignant Breast Cancer Tumors

Authors

  • Sahasra Chatakondu Metea Valley High School
  • Kevin Zhai

DOI:

https://doi.org/10.47611/jsrhs.v12i4.5577

Keywords:

k-Nearest Neighbor, k-Folds Cross Validation, Breast Cancer, Machine Learning, Accuracy

Abstract

Because of Breast Cancer's high mortality rate and being a leading cause of death among women worldwide, there has been importance given to machine learning (ML) algorithms to detect early signs of benign and malignant tumors effectively. Assistance from ML classifiers allows for a more efficient evaluation of mammographic results, surpassing the capabilities of radiologists who manually classify extensive patient data. This study aims to evaluate the effectiveness of the k-Nearest Neighbor (kNN) classifier in characterizing cancer tumor stages based on concavity, texture, area, perimeter, and smoothness. We employ scatterplots to differentiate between benign and malignant classes using the Breast Cancer Wisconsin Dataset (WBCD) from the University of California at Irvine Machine Learning Repository. Employing the k-Fold Cross Validation (k-FCV) technique, we determine the optimal value for k to assign anonymous data to their respective categories. The analysis conducted in this study finds that the most favorable value for the hyperparameter k is 12, resulting in a highly effective diagnostic outcome from administering four distinct tests. Given the absence of a predefined value for the k parameter, guesswork could lead to accuracy errors and misdiagnosis; therefore, employing k-FCV provides a more precise approach to determining the optimal class for unknown tumor attributes. Additionally, preprocessing of this dataset and measuring how different data splits impact accuracy are used to organize the data effectively and achieve reliable results. Recognizing that early detection is essential in preventing Breast Cancer-related deaths, ML techniques like kNN can greatly reduce mortality rates associated with the disease.

Downloads

Download data is not yet available.

References or Bibliography

Preventing cancer. (n.d.). World Health Organization (WHO). Retrieved July 10, 2023, from ‘WHO | Breast cancer’, WHO. http://www.who.int/cancer/prevention/diagnosis-screening/breast-cancer/en/ (accessed Feb. 18, 2020).

Rafid, A. K. M. R. H., Azam, S., Montaha, S., Karim, A., Fahim, K. U., & Hasan, M. Z. (2022, November 11). An Effective Ensemble Machine Learning Approach to Classify Breast Cancer Based on Feature Selection and Lesion Segmentation Using Preprocessed Mammograms. NCBI. Retrieved July 11, 2023, from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9687739/

Abdulla, S. H., Sagheer, A. M., & Veisi, H. (2021, August 14). 1979Breast Cancer Classification Using Machine Learning Techniques: A Review. View of Breast Cancer Classification Using Machine Learning Techniques: A Review. Retrieved July 10, 2023, from Abdulla, S. H., Sagheer, A. M., & Veisi, H. (2021, August 19). Breast Cancer Classification Using Machine Learning Techniques: A Review. urkish Journal of Computer and Mathematics Education. Retrieved June 29, 2023, from https://turcomat.org/index.php/turkbilmat/article/view/10604/8162

Ehsani1, R., & Drabløs, F. (2020, September 19). Robust Distance Measures for kNN Classification of Cancer Data. Cancer Informatics. Retrieved July 10, 2023, from Ehsani, R., & Drabløs, F. (2020, September 19). Robust Distance Measures for kNN Classification of Cancer Data. Cancer Informatics. Retrieved June 30, 2023, from https://journals.sagepub.com/doi/pdf/10.1177/1176935120965542

Bolandraftar, M., & Imandoust, S. B. (2017, December 7). Application of K-nearest neighbor (KNN) approach for predicting economic events theoretical background. ResearchGate. Retrieved July 10, 2023, from Imandoust, S. B., & Bolandraftar, M. (2013). Application of K-nearest neighbor (KNN) approach for predicting ... International Journal of Engineering Research and Applications. https://www.researchgate.net/profile/Mohammad-Bolandraftar/publication/304826093_Application_of_K-nearest_neighbor_KNN_approach_for_predicting_economic_events_theoretical_background/links/5a296efba6fdccfbbf816edf/Application-of-K-nearest-neighbor-KNN-approach-for-predicting-economic-events-theoretical-background.pdf

Wettschereck, D., Aha, D. W., & Mohri, T. (n.d). A Review and Empirical Evaluation of Feature Weighting Methods for a Class of Lazy Learning Algorithms. Citeseerx. Retrieved July 10, 2023, from https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=5675f05a2e10e436218a0432678cb0416e606306

Ajanki, A. (2007, May 28). File:KnnClassification.svg. Wikimedia Commons. Retrieved July 11, 2023, from https://commons.wikimedia.org/wiki/File:KnnClassification.svg

Li, Y., & Zhang, X. (2011). Improving k Nearest Neighbor with Exemplar Generalization for Imbalanced Classification. Springer Link. Retrieved July 10, 2023, from https://link.springer.com/chapter/10.1007/978-3-642-20847-8_27

James, G., Witten, D., Hastie, T., & Tibshirani, R. (n.d, n.d n.d). Corrected 7th Printing. Squarespace. Retrieved July 28, 2023, from https://static1.squarespace.com/static/5ff2adbe3fe4fe33db902812/t/6062a083acbfe82c7195b27d/1617076404560/ISLR%2BSeventh%2BPrinting.pdf

Asri, H., Mousannif, H., Moatassime, H. A., & Noel, T. (2016). Using Machine Learning Algorithms for Breast Cancer Risk Prediction and Diagnosis. Procedia Computer Science, 83, 1064-1069. Retrieved July 10, 2023, from Asri, H., Mousannif, H., Moatassime, H. A., & Noel, T. (2016). Using Machine Learning Algorithms for Breast Cancer Risk Prediction and Diagnosis. Procedia Computer Science, 83, 1064-1069. https://doi.org/10.1016/j.procs.2016.04.224

Kharya, S. (2015). BREAST CANCER DIAGNOSIS AND RECURRENCE PREDICTION USING MACHINE LEARNING TECHNIQUES. IJRET. Retrieved July 10, 2023, from https://ijret.org/volumes/2015v04/i04/IJRET20150404066.pdf

Shah, C., & Jivani's, A. G. (2015, July 22). (PDF) Comparison of data mining classification algorithms for breast cancer prediction. ResearchGate. Retrieved July 10, 2023, from https://www.researchgate.net/publication/269270867_Comparison_of_data_mining_classification_algorithms_for_breast_cancer_prediction

Amrane, M., Oukid, S., Gagaoua, I., & Ensarİ, T. (2018). Breast cancer classification using machine learning. IEEE Xplore. Retrieved July 10, 2023, from M. Amrane, S. Oukid, I. Gagaoua and T. Ensarİ, "Breast cancer classification using machine learning," 2018 Electric Electronics, Computer Science, Biomedical Engineerings' Meeting (EBBT), Istanbul, Turkey, 2018, pp. 1-4, doi: 10.1109/EBBT.2018.8391453.

Tembusai, Z. R., Mawengkang, H., & Zarlis, M. (2021, January 11). K-Nearest Neighbor with K-Fold Cross Validation and Analytic Hierarchy Process on Data Classification | International Journal of Advances in Data and Information Systems. ijadis. Retrieved July 10, 2023, from http://www.ijadis.org/index.php/IJADIS/article/view/k-nearest-neighbor-with-k-fold-cross-validation-and-analytic-hie

Machine Learning, U. (2016, September 25). Breast Cancer Wisconsin (Diagnostic) Data Set. Kaggle. Retrieved July 10, 2023, from Learning, U. M. (2016, September 25). Breast cancer wisconsin (diagnostic) data set. Kaggle. https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data

Alfeilat, H. A., Hassanat, A. B. A., Lasassmeh, O., Tarawneh, A. S., Alhasanat, M. B., Salman, H. S. E., & Prasath, V. B. S. (2019, December 7). Effects of Distance Measure Choice on K-Nearest Neighbor Classifier Performance: A Review. PubMed. Retrieved July 10, 2023, from Lewis, H. G., & Brown, M. (2010, November 25). A generalized confusion matrix for assessing area estimates from remotely sensed data. Taylor & Francis Online. Retrieved July 10, 2023, from https://www.tandfonline.com/doi/epdf/10.1080/01431160152558332?needAccess=true

Lewis, H. G., & Brown, M. (2010, November 25). A generalized confusion matrix for assessing area estimates from remotely sensed data. Taylor & Francis Online. Retrieved July 10, 2023, from https://www.tandfonline.com/doi/epdf/10.1080/01431160152558332?needAccess=true

n.d. (n.d.). Margin of Error - Definition, Usage, and Calculator. Zoho. Retrieved July 11, 2023, from https://www.zoho.com/survey/margin-of-error.html

Henderi, H., Wahyuningsih, T., & Rahwanto, E. (2021, March 1). Comparison of Min-Max normalization and Z-Score Normalization in the K-nearest neighbor (kNN) Algorithm to Test the Accuracy of Types of Breast Cancer | Henderi. International Journal of Informatics and Information Systems. Retrieved July 10, 2023, from http://ijiis.org/index.php/IJIIS/article/view/73

Wong, T. T., & Yeh, P. Y. (2020, August 1). Reliable Accuracy Estimates from k-Fold Cross Validation. Research NCKU. Retrieved July 11, 2023, from https://researchoutput.ncku.edu.tw/en/publications/reliable-accuracy-estimates-from-k-fold-cross-validation

Published

11-30-2023

How to Cite

Chatakondu, S., & Zhai, K. . (2023). An Analysis of the k-Nearest Neighbor Classifier to Predict Benign and Malignant Breast Cancer Tumors. Journal of Student Research, 12(4). https://doi.org/10.47611/jsrhs.v12i4.5577

Issue

Section

HS Review Articles