Machine Learning based on Probabilistic Models Applied to Medical Data: The Case of Prostate Cancer

The growth in the amount of data in companies puts analysts in difficulties when extracting hidden knowledge from data. Several models have emerged that focus on the notion of distances while ignoring the notion of conditional probability density. This research study focuses on segmentation using mixture models and Bayesian networks for medical data mining. As enterprise data becomes large, there is a way to apply data mining methods to make sense of it using classification methods. We designed different models with different architectures and then applied these models to the medical database. The algorithms were implemented for the real data. The objective is to classify individuals according to the conditional probability density of random variables, in addition to identifying causalities between traits from tests of conditional independence and a correlation measure, both based on χ 2. After a quick illustration of several models (decision tree, SVM, K-means, Bayes), we applied our method to data from an epidemiological study (done at the University of Kinshasa University clinics) of case-control of prostate cancer. Thus, we found after interpretation of the results followed by discussion that our model allows us to classify a new individual with an accuracy of 96%.

The growth in the amount of data in companies puts analysts in difficulties when extracting hidden knowledge from data.Several models have emerged that focus on the notion of distances while ignoring the notion of conditional probability density.This research study focuses on segmentation using mixture models and Bayesian networks for medical data mining.As enterprise data becomes large, there is a way to apply data mining methods to make sense of it using classification methods.We designed different models with different architectures and then applied these models to the medical database.The algorithms were implemented for the real data.The objective is to classify individuals according to the conditional probability density of random variables, in addition to identifying causalities between traits from tests of conditional independence and a correlation measure, both based on χ2.After a quick illustration of several models (decision tree, SVM, K-means, Bayes), we applied our method to data from an epidemiological study (done at the University of Kinshasa University clinics) of case-control of prostate cancer.Thus, we found after interpretation of the results followed by discussion that our model allows us to classify a new individual with an accuracy of 96%.

INTRODUCTION
In recent years, an increasing amount of data has been generated by companies, whether it is medical data, such as patient records seen in a hospital and their pathologies, banking data, such as credit card transactions, or industrial data, such as measurements from sensors on a production line, or any other type of data imaginable.In each case, the analyst seeks to extract hidden information from the data.But no matter how big the data is, it will not allow him to do this.This gave rise to the idea of classifying the data so that the most similar data are grouped in a single class.This technique is called automatic classification [1], [2].Automatic data classification is the algorithmic categorization of objects.It consists of assigning a class or category to each object (or individual) to be classified, based on statistical data.For a long time, this mathematical technique was based on the notion of distance, i.e. two elements were closer according to the distance separating them, as presented by E Lincker, C Guinaudeau, O Pons, J Dupirein [2] and C. Mélina and L. Benoît in [3].Then, the problem was to know which metric to choose for a good ranking.Moreover, the choice of the number of classes was very crucial and sometimes depended on one analyst to JINITA Vol. 5, No. 2, December 2023 DOI: doi.org/10.35970/jinita.v5i2.1879another.Another problem was the sensitivity to atypical data, i.e. data that deviated a little from the others.The other problem was how to define the probability of an individual belonging to one class of data and not to another.To do this, a question was raised, namely: How to classify individuals according to the conditional probability density of the random variables?
In an attempt to address some of the above problems, probabilistic models have emerged.Thus, our study will focus on the probabilistic approach [4] which has largely corrected the problem posed above.This approach is the use of the notion of conditional probability density as a Gaussian, or a mixture of Gaussians in classification.This time, it is the statistical parameters such as the mean or the variance that allow us to say that two elements are very close according to their proximity to their mean.The above methods will be applied to prostate cancer, which is the most common cancer in men over 50 in industrialized countries.As diagnostic practices and treatment options continue to evolve, small tumors can be detected, and targeted treatments can be guided to minimize the morbidity of therapy [5], [6].The abovementioned set of methods will be crucial for the automated processing of medical data to assist and guide the practitioner in the diagnostic decision-making and therapeutic procedure for specific treatments of prostate cancer.This research study focuses on segmentation using mixture models and Bayesian networks for medical data mining.The aim is to classify individuals according to the conditional probability density of random variables, in addition to identifying causalities between traits using conditional independence tests and a correlation measure, both based on the χ2, which gives us a reliability of 96%, considering how advantageous our method, unlike the algorithms developed in [1], [2], [3] which fall short as limiting their classification on elementary statistical parameters such as distance, variance, etc.Our contribution is twofold: on the scientific side, we present a probabilistic approach applied to automatic classification, presenting the virtues it brings.Secondly, a medical aspect: we have created a decision-support tool for the segmentation of Prostate Cancer data.

METHODOLOGY
In our research, we have opted for algorithms based on the mixture model [7].Here we describe the motivation and the detailed overview of the tasks and the proposed approach.Today, it remains to be seen that prostate cancer has serious effects on men already beyond the age of 40, even leading to death.Our approach addresses methods that help to understand and extract knowledge from masses of data on prostate cancer [8], which can enable a province, or a country to understand the harms of this disease based on other factors affecting it and their probabilities on the evolution of this disease.

Segmentation by Mixture Model
The mixture model, by definition, is a statistical model used to parametrically estimate the distribution (density function) of random variables by modeling them as a sum of several other simple distributions as presented in [7], [8].This definition means that it is assumed that the whole population is represented by a probability distribution which is a mixture of C probability distributions associated with the classes.The main objective is to identify the C distributions by estimating their parameters.This identification consists of assuming that the observed data are realizations of a random vector X of unknown distribution P. The objective is to reconstruct P from its realizations.The probability density f(x, Θ) at a point is unknown.The principle of the model is to decompose this density into a sum of K components f k ( x, θ k ) (k =1, . .., c) corresponding to the C classes whose parameters will be estimated θ k (k =1, . . ., c) from a sample X. Mathematically, a finite mixture on a space X is a convex combination of probability laws f k ( x, θ k ) (where the f k ( x, θ k ) is called mixing components as explained in [7], [8], defined on X by Where π k is the a priori probability of component k.It does check the probability conditions such that ∑ , and 0 ≤ π k ≤1.Θ and θ k are respectively the parameters of the model f(x, Θ) and If X is a continuous distribution, i.e. we have an infinite number of components, then (1) is written (2) The probability densities f k ( x, θ k ) can be a distribution (parameter) from the large family of statistical distributions such as (Gaussian, Student).But in the following, we will detail the Gaussian mixture because it has some advantages compared to other distributions.In particular, during practice, the specification of the components of a mixture is not always guided by the singularity of the data to be modeled.Gaussian mixtures, which assume conditional populations distributed according to a normal distribution, are of great interest because of their flexibility, their ability to approximate a wide variety of densities, their mathematically simple use, and the generality of the normal distribution as shown by the central limit theorem.

Gaussian Mixture Model
To describe Gaussian mixing, we first define the density function of the Gaussian distribution.We then introduce the notion of Gaussian mixing [8].

Gaussian distribution
The Gaussian distribution, also known as the normal distribution, is the best-known of the probabilistic distributions.It has been widely used to model the distribution of continuous random variables [9].In the case of a simple random variable X, the density function of the Gaussian distribution can be written as follows: Where μis the mean σ 2 and is the variance.If the distribution is multidimensional, the density function of the multivariate Gaussian distribution takes the following form:  Here is a representation of the Gaussian components and the continuous curve shows the mixing density of these two Gaussians.After a brief overview of Gaussian mixing, we now want to explain how the parameters of the Gaussian mixing model can be estimated, using the most well-known technique in this field: the EM (Estimation-Maximization) algorithm.

2.2.2.
Estimation of the parameters of mixing models JINITA Vol. 5, No. 2, December 2023 DOI: doi.org/10.35970/jinita.v5i2.1879 Several estimation methods have already been developed in inferential statistics to estimate the parameters of the probability distribution of a sample, including point estimation, confidence interval estimation, etc., but the best known is the maximum likelihood method which is defined as follows: Let the sample X= { x 1 . .., x n } with x 1 . .., x n independent realizations of a random vector X, the likelihood law of the data for the model of parameter Θ is written : It is easier to maximize the log-likelihood instead of the likelihood function itself.If Θ maximises ln(L( Θ)), then it also maximises L( Θ).This is due to the monotonicity of the log-likelihood function.The log-likelihood function is written as: This method is centralized by the EM algorithm in the case of Gaussian mixtures:

EM algorithm
Proposed by DA Boiko and AS Kashin in [10], [11] allows structural learning to evaluate the parameters and the score of all the networks in this neighborhood, and to choose the best one for the next iteration to achieve a good ranking.

It is represented as follows:
Input: C the number of components in the mixture, X a set of observations, ] an initial value of the model parameters.
Estimation of the average μ k of class k Estimation of the Σ k covariance matrix of class k 13. Next 14.S ← s+1 15.Up to convergence The convergence of the algorithm is achieved in one of two cases: either the difference in the likelihood function between two consecutive steps is negligible, or the new estimated parameters do not change for the previous step [10], [12].Each individual is assigned to the class to which he or she is most JINITA Vol. 5, No. 2, December 2023 DOI: doi.org/10.35970/jinita.v5i2.1879likely to belong and this is determined by γ( z ik ).The EM algorithm surely converges to a local optimum likelihood.This convergence depends directly on the initialization phase of the algorithm.A bad initialization could lead to a bad model, while a good initialization will rather lead to a convergence towards a global likelihood optimum.To overcome the problems associated with the EM algorithm, a few variants have been proposed, including the SEM, and CEM algorithms [8], [11], [12], which will be developed in our next articles.

RESULTS AND INTERPRETATIONS
In our application the objective is to classify people living with prostate cancer and to determine the level of intervention and follow-up, the correlation between the different variables that are taken into account; as well as to predict the class to be assigned to a new individual.Based on the DataSet we received, we ran several machine learning algorithms developed in Python to arrive at the interpretation of the results.

Data Source
The research sample was drawn from the database of the University Clinics of the University of Kinshasa.The data collection period was from January 2017 to December 2021.The first step was to collect data by selecting all patients diagnosed with prostate disease.To determine the classification of disease types, compiled based on a system of categories and grouped into a disease according to predetermined criteria known as the International Statistical Classification of Diseases and Related Health Problems, tenth revision [13], [14].

The correlation matrix
The correlation matrix is represented as follows: Figure 2: Correlation matrix The main objective of this matrix is to reduce some irrelevant variables, effectively those that are more correlated with our class of interest which is the diagnosis_result, we notice here all the variables have a low correlation with the class, hence the interest to use them in the following.

Visualisation of training and test data
The visualization of training and test data can be represented as follows:   Here, we used a set of well-known supervised learning algorithms [15], namely SVM, decision tree [16], KNN, and Naive Bayes [9], all used to predict a new individual with prostate cancer.We note that all these algorithms were developed on the Python framework.In a comparative study, we note an improvement in the curve for the Bayes classifier, which gives us a result with 96% better accuracy than the other models, unlike the methods developed in [2], [3], which remained at 76%.

Decision tree
This algorithm is, as its name suggests, a decision support tool that allows a population of individuals to be divided into homogeneous groups according to discriminating attributes to a fixed and known objective.It allows predictions to be made based on known data on the problem by reducing, level by level, the domain of solutions [15], [17], [18].
Here, each internal node of the decision tree relates to a discriminant attribute of the elements to be classified which allows to distribute these elements homogeneously between the various sons of this node to its sons represent the discriminant values of the attribute of the node.And finally, the leaves of a decision tree are its predictions concerning the data to be classified.

Support Vector Machine (SVM)
It is one of the most powerful supervised learning algorithms based on the concept of hyperplane and is a simple classifier called the maximum margin classifier.This algorithm generates, for a given training sample, an optimal hyperplane that maximizes the margin between data points of different classes.For a two-class data set, the closest objects in each class must be well separated from the decision boundary [17], [19], [20], [21].

K-nearest neighbor (KNN)
It is one of the most widely used supervised learning algorithms for its simplicity, much more so in classification problems than in regression.It has weaknesses in that it uses all the data for training while making predictions [17], [19], [22].

Naive Bayes (NB)
It is a probabilistic classifier, based on Bayes' theorem.In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is not related to the presence of any other feature.To this, each feature makes an independent and equal contribution to the result [17], [20] 6), the main objective of which is to reduce certain irrelevant variables, effectively those that are more correlated with our class of interest, which is the diagnosis_result.We note here that all the variables have a low correlation with the class, hence the interest in using them in the following.We can say here that our model performs well based on the results of our confusion matrix, on 4 images of patients with prostate cancer the model correctly identified them and on 16 scanned images of patients without pain, the model correctly identified 12 and got 4 wrong.

Validation of the model with the Rock Curve
After training with the models on the data, we found the results to be 96% satisfactory.This is shown in the figure below by the rocky curve.By seeing this figure, we can say that our models have learned well and that they can classify a new individual with good accuracy (prostate cancer or not).

CONCLUSION
The growing volume of medical data is putting analysts in a difficult position when it comes to extracting hidden knowledge from data.Several models have emerged, focusing on the notion of distances while ignoring the notion of conditional probability density.Our work has focused on segmentation using mixture models and Bayesian networks for medical data mining.We designed different models with different architectures and then applied these models to the medical database.The algorithms were implemented for real data.The aim was to classify individuals according to the conditional probability density of random variables, in addition to identifying causalities between traits using tests of conditional independence and a correlation measure, both based on χ 2. After a brief illustration of several models (decision tree, SVM, K-means, Bayes), we applied our method to data from an epidemiological case-control study of prostate cancer (carried out at the University of Kinshasa's university clinics).After interpretation and discussion, we found that our model enabled us to classify a new individual with 96% accuracy, unlike non-probabilistic models which classify with 78% accuracy.These results show that the methods we have developed are sufficiently accurate, fast, and robust to be used in a clinical context.These tools prove their ability to offer a gain in turnaround time and reproducibility of prostate diagnostic and therapeutic decisions.To close, we noted the slowness of the EM algorithm in case of joint estimation of certain parameter subscriptparameters μ k , Σ k , v k .Thus, in our future research we are thinking of separating the estimation of means and covariance matrices from the estimation of degrees of freedom using the ECM algorithm to speed up the result.
Vol. 5, No. 2, December 2023 DOI: doi.org/10.35970/jinita.v5i2.1879 ) with Σ k the variance-covariance matrix, μ k the matrix of means, and d the dimension of the Space of individuals.Gaussian mixing has long been used in statistical learning.It can model any numerical data set but with arbitrary precision.The Gaussian mixture model is a convex combination of several Gaussian components.It is particularly used in cases where the data under study cannot be modeled by a simple Gaussian.In other words, if the data structure is naturally composed of several groups, it is necessary to represent them by a Gaussian mixture model rather than a simple Gaussian distribution [9].To formalize this, simply replace each f k ( x, θ k ) with the density function of the Gaussian distribution.The density function of the Gaussian distribution is then written as follows: f(x, Θ) = ∑ π k C k=1 (x, μ k , Σ k ) (5) Where ( x, μ k , Σ k ) is the k component of the Gaussian mixture.In this context, we define the set of weights as π = { π k }, the set of means by µ= { μ k }, and the kth class covariance matrices we denote Σ = { Σ k }.We also define Θ = { θ k } and θ k = { π k , μ k , Σ k }.

Figure 3 .
Figure 3. Training and test data

Figure 4 .
Figure 4. Application of the models to the data

Figure 5 .
Figure 5. Segmentation with the Gaussian Mixture ModelOur data are initially classified by our model into two classes: positive, and negative.This leads us to a co-relation matrix (figure6), the main objective of which is to reduce certain irrelevant variables, effectively those that are more correlated with our class of interest, which is the diagnosis_result.We note here that all the variables have a low correlation with the class, hence the interest in using them in the following.