Machine Learning based on Probabilistic Models Applied to Medical Data: The Case of Prostate Cancer
Abstract
The growth in the amount of data in companies puts analysts in difficulties when extracting hidden knowledge from data. Several models have emerged that focus on the notion of distances while ignoring the notion of conditional probability density. This research study focuses on segmentation using mixture models and Bayesian networks for medical data mining. As enterprise data becomes large, there is a way to apply data mining methods to make sense of it using classification methods. We designed different models with different architectures and then applied these models to the medical database. The algorithms were implemented for the real data. The objective is to classify individuals according to the conditional probability density of random variables, in addition to identifying causalities between traits from tests of conditional independence and a correlation measure, both based on χ2. After a quick illustration of several models (decision tree, SVM, K-means, Bayes), we applied our method to data from an epidemiological study (done at the University of Kinshasa University clinics) of case-control of prostate cancer. Thus, we found after interpretation of the results followed by discussion that our model allows us to classify a new individual with an accuracy of 96%.
References
H. Serhal, N. Abdallah, J.-M. Marion, P. Chauvet, and M. Oueidat, "Overview on prediction, detection, and classification of atrial fibrillation usingwavelets and AI on ECG," Computers in Biology and Medecine, vol. 142, no. 105168, p. 105e168, 2022.
E. Lincker, C. Guinaudeau, O. Pons, J. Dupire, H. Céline, V. Mousseau and H. Caroline, "Automatic classification of unbalanced and noisy data: application to textbook exercises," in 20th International Conference on Content-based Multimedia Indexing (CBMI 2023), 2023.
C. Mélina and L. Benoît, "Artificial intellignce in nutrition research: perspectives on current and future applications," Applied Physiology, Nutrition, and Metabolism, vol. 47, no. 1, pp. 1-8, 2022.
O. Guiliang, H. Yulin and Z. H. Joshua, "A compressed naive hidden Bayesian classifier," in International Joint Conference onNeural Networks (IJCNN)., 2021.
K. Fizazi, J. M Piulats, M. N. Reaume, P. Ostler, R. McDermott and al, "Rucaparib or physician's choice in metastatic prostate cancer," New England Journal of Medecine, vol. 388, no. 8, pp. 719-732, 2023.
J. T. Wei, D. Barocas, S. Carlsson, F. Coakley, S. Eggener, R. Etzioni and al, "Early detection of prostate cancer: AUA/SUO guideline part I: prostate cancer screening," The Journal of Urology , vol. 210, no. 1, pp. 46-53, 2023.
C. Biernacki, G. Celeux et G. L. F. Govaert, «Model-based cluster and discriminant analysis with the MIXMOD software,» Computational Statistics & Data Analysis, vol. 51, n° %12, p. 589, 2006.
F. Gerard, B. Hugonnier and S. Varin, "Bayesian networks and causal discovery: what lessons for the synthetic indicator of the quality of education systems in OECD countries," International Journal of Research in Social Sciences, vol. 13, no. 1, p. 76, 2023.
W. Yuan, B. Eckart, K. Kim, V. Jampani, D. Fox and J. Kautz, "Deepgmr: Learning latent gaussian mixture models for registration," in Computer Vision-ECCV 2020: 16th European Coference, Glasgow, UK, August 23-28, 2020.
D. A. Boiko, A. S. Kashin, V. R. Sorokin, Y. V. Agaev, R. G. Zaytsev and V. P. Ananikov, "Analysing ionic liquid systems using real-time electron microscopy and a computational framework combining deep learning and classic computer vision techniques," Journal of Molecular Liquids, vol. 376, no. 121, p. 407, 2023.
D. Daneshvar and A. Behnood, "Estimation of the dynamic modulus of asphalt concretes using random forests algorithm," International Journal of Pavement Engineering, vol. 23, no. 3, pp. 57-58, 2022.
M. McIsaac and J. R. Cook, "Statistical methods for incomplete data: Some results on model misspecification," Statistical Methods in Medical Research, vol. 26, no. 1, p. 253, 2017.
W. H. Organization, International Statistical Classification of Diseases and related health problems: Alphabetical index, World Health Organization, 2004.
S. K. BOUNEBACHE, C. QUANTIN, É. BENZENINE and al., "Bibliographic Review of Database Linking Methods: Applications and Perspectives in the Case of Public Health Data," Journal de la société française de statistique, vol. 159, no. 3, pp. 81-82, 2018.
P. C. Sen, M. Hajra and M. Ghosh, "Supervised classification algorithms in machine learning: A survey and review," in Emerging Technology in Modelling and Graphics: Proceedings of IEM Graph 2018, 2020.
C. S. Lee and P. Y. S. Cheang, "Predictive analytics in business analytics: decision tree," Advances in Decision Sciences, vol. 26, no. 1, pp. 24-25, 2022.
C. Noel and J. Schiltz, Finite mixture models for an underlying BETA distribution with application to COVID-19 data, Joint work with Jang SCHILTZ, University of Luxembourg, 2022.
P. Clin, F. Grognard, D. Andrivon, L. Mailleret and M. F. Mamelin, "The proportion of resistant hosts in mixtures should be biased towards the resistance with the lowest breaking cost," PLOS Computational Biology, vol. 19, no. 5, p. 234, 2023.
S. S. Prasetiyowati, "Performance Analysis of the Hybrid Voting Method on the Classification of the Number of Cases of Dengue Fever," International Journal on Information and Communication Technology (IJoICT), vol. 8, no. 1, pp. 12-14, 2022.
R. Hermansyah and R. Sarno, "Sentiment analysis about product and service evaluation of pt telekomunikasi indonesia tbk from tweets using textblob, naive bayes & K-NN Method," in International Seminar on Application for Technology of Information and Communication (iSemantic), 2020.
M. Sheykhmousa, M. Mahdianpari, H. Ghanbari and al., "Support vector machine versus random forest for remote sensing image classification: A meta-analysis and systematic," IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing , vol. 13, no. 1, pp. 6315-6321, 2020.
K. Taunk, S. De, S. Verma and A. Swetapadma, "A brief review of nearest neighbor algorithm for learning and classificcation," in international conference on intelligent computing and control systems (ICCS), 2019.
S. Lauritzen, "Propagation of Probabilistics, Means and Variances in Mixed Graphical Association Models," Journal of the American Statistical Association, vol. 87, no. 1, p. 1106, 2016.
W. Gerych, T. Hartvigsen, L. Buquicchio, E. Agu and E. A. Rundensteiner, "Recurrent bayesian classifier chains for exact multi-label," Advances in Neural Information Processing Systems, vol. 34, no. 1, pp. 1590-1591, 2021.
G. B. Marcot and M. A. Hanea, "What is an optimal value of k in k-fold cross-validation in discrete Bayesian network analysis?," Computaational Statistics, vol. 36, no. 3, pp. 2013-2015, 2021.
Z. Luo and Y. Deng, "A matrix method of basic belief assignment's negation in Dempster-Shafer theory," IEEE Transactions on Fuzzy Systems, vol. 28, no. 9, pp. 2272-2273, 12 3 2019.
Copyright (c) 2023 Remy Mutapay Tshimona
This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).