From Text to Insights: NLP-Driven Classification of Infectious Diseases Based on Ecological Risk Factors

The Framework-Based Method


INTRODUCTION
Numerous factors can affect the development of infectious diseases that emerge.While many are the result of natural procedures, such as the gradual emergence of viruses over time, certain ones are the result of human activity [1].Several reasons have contributed to these changes, including population growth, urbanization of rural areas, international air travel, global poverty, armed conflicts, and unfavorable changes in the environment brought on by economic growth and land use.Furthermore, Public health concerns are growing and getting more complex, in part because of the social and environmental risks posed by worldwide environmental alterations brought on by rapid industrialization, population growth, excessive exploitation of natural resources, and improper technological application [4].The life-sustaining resources of the environment are being used unsustainably and in significant amounts.According to the Millennium Ecosystem Assessment, over the next 50 years, these disruptions might get worse and currently affect the well-being of individuals [2].Human health and illness are impacted by numerous complex factors.Infectious diseases transmitted by humans and animals through interaction, contaminated environments, contaminated food, and contaminated water pose risks to public health.Since health is dynamic, varies over JINITA Vol. 5, No. 2, December 2023 DOI: doi.org/10.35970/jinita.v5i2.2084time, and has many diverse elements, ecological opinions on food and the environment which include aquaculture, agriculture, and the total food systems are under a lot of strain [3].
In 2013, infectious diseases caused over 9 million fatalities and over 45 million years of lost productivity due to disability [5].Nevertheless, Healthcare systems including the surrounding area are home to a wide variety of illnesses and microorganisms.Although viruses and bacteria vary greatly, the method by which germs spread from one individual to someone else is continuous within an environment.Therefore, exposure to environmental pollutants has been linked to a variety of human diseases and conditions such as infectious diseases.Finding disorders that may be related to environmental contaminants and determining the data sources that are already available regarding these diseases are essential steps in the effort to more precisely characterize links connecting environmental exposures and adverse health consequences [6].
On this note, this study investigates the diverse factors influencing the development of infectious diseases, distinguishing between natural and human-induced processes.Examining the ecological aspect of human activities to understand its role in encouraging disease transmission within the ecosystem.Also, develop and apply a Framework-Based Method (FBM) for the structured and reproducible classification of infectious diseases, encompassing data collection, preprocessing, and model training.Conduct a comparative analysis of classification models, evaluating their performance and the integration of Deep Learning model BERT with the best-performing classification model to create an interactive interface, enhancing user experience and accessibility in infectious disease classification.

RELATED LITERATURE
Many studies have been carried out on infectious diseases.We delve into the key findings and broader implications of these studies within the context of infectious disease research and its impact on public health.[7] examined Japan's infectious disease surveillance system, unveiling the significant role of legal amendments in reducing illness rates.However, their research's timeframe-specific nature limits the applicability of their findings.Nevertheless, it underscores the significance of policy adjustments in disease control.Also [8] embarked on a study to uncover the global trends in emerging infectious diseases (EIDs), underscoring the influence of socioeconomic, environmental, and ecological factors in the emergence of EIDs.Their work accentuates the necessity of a multidisciplinary approach to comprehend and mitigate EID risks.[9] proposed a theoretical framework for amalgamating data and models in infectious disease research.There was an emphasis on data gathering to enhance disease modeling while underscoring the critical importance of comprehensive data for precise predictions.Also, an assessment by [10] regarding China's capability to manage infectious diseases highlights the need for comprehensive prevention and response strategies.This study underscores the urgency of proactive measures to tackle future disease threats.It was [11] that investigated the utility of mathematical models in understanding the intricate dynamics of infectious diseases on a global scale.Their work elucidates the interconnectedness of infectious diseases and the potential for regional and global repercussions if interventions fall short.Again, the suggestion of leveraging mobile phone data to connect movement patterns to infectious diseases was presented in [12] introducing novel possibilities for characterizing population behavior and predicting disease outbreaks.This innovative approach may revolutionize disease tracking and response strategies.Furthermore, an exploration of the interplay between climate and infectious diseases, suggesting the potential for interdisciplinary cooperation between biology and climate research to gain deeper insights into disease dynamics was presented in [13].Identifying dynamics can help create patterns in infectious disease occurrence, therefore a study conducted by [14] centers on identifying patterns in the occurrence of infectious disease syndromes in Mongolia.Their application of predictive models to detect rising disease rates underscores the potential of syndrome-based assessments in forecasting disease trends.Furthermore, in the modeling of infectious disease, a study conducted by [15] offers an overview of modeling infectious disease transmission.Their discussion on incorporating intricate data and advanced inference techniques underscores the significance of adapting modeling approaches to real-world disease spread.Therefore, with the evolving landscape of global changes and their possible ramifications for infectious diseases.There is a call for research adaptation to underscore the need to anticipate and address emerging challenges in disease prevention and control [16].Also [17] provides the basis for the classification of infectious diseases based on semantic natural language processing.This study forms the basis of this research work.This study was limited to using one machine algorithm in the semantic classification.Hence this research will expand JINITA Vol. 5, No. 2, December 2023 DOI: doi.org/10.35970/jinita.v5i2.2084further by introducing more algorithms for the comparative classification of infectious diseases using Natural Language Processing through ecological risk factors.

METHOD
In this research, we used a Framework-Based Method (FBM) to carry out the task of classification of infectious diseases.A FBM is a widely used strategy in computer science research, where a structured system of concepts is employed to guide and enhance research studies [18].In Figure 1 below the frameworks used in this research are presented.The framework in Figure 1 presents all the different components or constitutes that show the different processes that this research carried out from the point of data collection, creating of infectious disease corpus, preprocessing of the data, creation of the document term matrix, text analysis, and visualization, training, and testing, comparative analysis, performance evaluation and deployment of the model.Nevertheless, all these individual sections are available the framework is discussed below in this research.

Data collection
Research is the process of gathering experiences or observations.Based on the data acquired, a researcher may assess their hypothesis.The three processes that were used to split data collection into this part are; 3.1.1Epidemiology: Epidemiology examines the distribution (who, when, and where) and trends of health and illness situations in a certain population.As a data source for this study, we will use the epidemiological background of each disease that is transmissible.

Prisma Flow:
A Prisma flow diagram was used in the representation of the literature that was selected from the epidemiology of infectious disease which is presented in [17]

Ecological Factors:
The biotic and abiotic variables are represented by this.Anything that affects the natural world is considered to be an environmental or ecological factor.Environmental factors include things like water, air, soil, climate, local vegetation, and landforms.Environmental damage, forest loss, sewage contamination, warming temperatures, and climate change are the top five ecological problems affecting the well-being and health of humans.Therefore, we will identify all the ecological components that will make up a causative in the infectious disease given a given extracted epidemiology of the chosen infectious diseases from journal articles.JINITA Vol. 5, No. 2, December 2023 DOI: doi.org/10.35970/jinita.v5i2.2084

Keyword Annotation:
In this study, we will extract and annotate all the keywords that represent ecological aspects for each infectious disease that is chosen.We will accomplish this by using headers and terms in bold and searching for the most significant points, arguments, and supporting evidence.

Corpus
Corpus construction entails creating a machine-readable text compilation that mirrors a specific language or field.This process serves as a valuable resource for advancing and assessing natural language processing (NLP) algorithms and applications.In this research from the epidemiology data, we form bags of words that will be used in the NLP task

Data Preprocessing
Preprocessing data represents a data strategy that is commonly used to turn data into a beneficial and effective form.In this research, our processing phase follows; Removing Tags, Removing Stop Words, Removing Punctuation, removing whitespaces, and stemming which forms a standard preprocessing phase in NLP tasks.

Document Term Matrix
A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that appear throughout a collection of documents.The documents in a document-term matrix are denoted by rows, and the terms are represented by columns.The document-phrase matrix is essentially a matrix that lists the frequency of each term over the whole corpus of written texts.In this research, we transform the corpus known as bags of words to represent each term in the matrix as a unique binary equivalent.

Text Analysis
The technique of analyzing and understanding human-written text using computer technologies to generate business insights is known as text analysis.Text analysis software can automatically classify, sort, and extract data from texts to discover patterns, connections, emotions, and other important information.We will employ word frequency in this study, which involves measuring the number of times a specific word is found in a text or collection of texts.Term frequency evaluates the relevance of each word.

Text Visualization
Text visualization is a means of visually presenting textual data using graphs, charts, or word clouds.This summarizes the content, detects trends and patterns across documents, and provides quick access to the most significant terms in a text.Word clouds are a fantastic place for beginning when displaying qualitative data.They can be used for exploratory research to identify what can be found in a data set to develop labeling requirements for more extensive text analysis and visualization, as well as to provide basic quick insights.

Training and Testing
The main difference between training data and testing data is that the previous type is a subset of the original data used to train the machine learning model, whereas the second is used to evaluate the model's correctness.The training dataset is often larger than the testing dataset.Train and test datasets are often divided 80:20, 70:30, or 90:10.In this research we trained all the eight machine learning models in Figure 1 based on the document term matrix created earlier in D.

Classification
The most common Machine Learning technique is categorization, which builds a model from a set of pre-classified instances that can classify the whole set of records.This technology is especially wellsuited for applications such as medical records and sickness risk analysis.In this research, we employed eight machine learning algorithms which are: XGBoost, Support vector machine, Random Forest, Artificial neural network, Decision Tree, Gradient Boosting Machine, Linear Discriminant Analysis, and K-nearest Neighbor were all training data which the document term matrix

Performance Evaluation
In machine learning, the effectiveness of machine learning models is determined using performance assessment measures or metrics.This helps us determine how well our machine-learning model will perform on a dataset it has not seen before.When analyzing the performance of machine learning models on new datasets, performance evaluation criteria are critical.There's a good possibility the model will continually perform better on the dataset you trained it on.However, in this research, machine learning evaluation metrics such as confusion metrics and kappa statistics were adopted in the evaluation of the trained model's performance.JINITA Vol. 5, No. 2, December 2023 DOI: doi.org/10.35970/jinita.v5i2.2084

Model Deployment
In this section, we deployed an interface using the DistilBert pre-trained model alongside our trained model Where a user can easily supply ecological sample text and classify infectious diseases.

RESULTS AND DISCUSSION 4.1 Exploratory Data Analysis
Exploratory Data Analysis (EDA) holds utmost significance in the realm of machine learning research as it involves an in-depth exploration and comprehension of a dataset before deploying any modeling techniques [21].It facilitates researchers in acquiring profound insights, recognizing patterns, and detecting anomalies within the dataset.From the data gathered in this research [17], EDA was conducted on the data to get more insight into the data.Furthermore, from a total of 342 epidemiological articles that report ecological factors that affect infectious disease on the selected 9 diseases which are (Malaria, Tuberculosis, Measles, Polio, Avian_Influenza, and Cholera) from 1998 to 2022, figure 2: represents the ecological factors count based on the 9-disease selected in this research.Furthermore, some articles reported ecological factors based on their respective years of publications and it was observed that deforestation was mostly reported across the selected articles in the research from 1998 to 2022.Again, the total number of journals that reported ecological factors that affect infectious disease yearly, shows that over 112 journals from 2016 to 2017 recorded the highest ecological factors for infectious disease in the 9 selected diseases in this research.

Text Analysis and Visualizations
Text analysis and visualization are vital components of natural language processing (NLP) that play a crucial role in extracting valuable insights and understanding from unstructured text data.They provide essential techniques for processing and interpreting large amounts of textual information, enabling businesses and researchers to uncover patterns, trends, sentiments, and connections within the data.We present below the results of the text analysis using word frequency, world clouds, histograms, etc. Figure 3 represents word frequencies from the ecological sentences using a document term matrix for the text analysis.

Classification
Machine learning, a subset of computer science, delves into the realm of algorithms that acquire knowledge from data examples.Classification, a key process within this field, involves applying machine learning techniques to decipher how and where to assign classifiers to instances within a given problem domain [19].To illustrate, consider the straightforward task of classifying traffic as either "Yes" or "No" for a specific route.Within machine learning, a myriad of categorization tasks exist, each demanding distinct modeling approaches.However, the choice of the most suitable approach for a given classification task lacks a universal model-to-problem mapping.Instead, practitioners are encouraged to conduct controlled experiments to ascertain which algorithm and its configuration yield optimal results.The performance of classification predictive modeling algorithms hinges on the examination of their outcomes.Commonly, classification accuracy serves as a prevalent metric for assessing model performance, primarily relying on projected class labels.Hence, we present the classification process in this section below.

Model Training
Model training in machine learning is the procedure of instructing a mathematical model using a given dataset, enabling it to make precise predictions or decisions [20].This training process involves exposing the model to labeled examples, enabling it to grasp the underlying patterns and relationships within the data.Throughout training, the model adjusts its internal parameters to minimize the disparity between its predictions and the actual labels.A total of 342 observations with 2 variables and 80% was used as the train set with a total of 273.6.Again, before we trained our model, to be able to classify JINITA Vol. 5, No. 2, December 2023 DOI: doi.org/10.35970/jinita.v5i2.2084ecological text, we needed to preprocess our text data while removing stopwords and also looking at the term the sparsity of for the model where sparsity is good to yield statistical benefit for the model and also help and make it easily interpreted by human.Furthermore, we present the cross-section of the document term matrix that we use for the transformation of the ecological sentences that were present in the corpus.Figure 6 represents the document term matrix.We present the the neural network architecture which was trained on the ecological sample text in Figure 10.
Figure 10.Architecture of the train neural network Furthermore, the overall statistics accuracy of the confusion matrix is presented below in Figure 11.Again, this research also carried out a comparative analysis using a Linear Discriminant Analysis Machine classifier to give a comparison of the performance of each algorithm in the ecological classification of infectious diseases, we present the performance of Linear Discriminant Analysis in Figure 14.

Comparative Analysis of Our Model
Additionally, a comparative tabulated analysis was carried out to compare different Classification algorithms used in this study, to observe the different accuracy variations based on the accuracy of the classification which is presented in Table 1.From Table 1 we can see that after the cross-validation accuracy, XGBoost has the highest percentage accuracy as compared to all other machine classification models.

Figure 1 .
Figure 1.Framework for NLP-Driven Classification of Infectious Diseases

Figure 2 .
Figure 2. Ecological factors count based on disease.

Figure 3 .
Figure 3. Word frequency based on ecological sentence Figure 4. CM of infectious diseaseCorrelation analysis on the text-based classes of infectious disease in other to determine the level of correlation among the various classes of the disease using the correlation matrix (CM) which is depicted in Figure4.Furthermore, text visualization was carried out using word clouds to detect the most frequent ecological words that were present in the sentences.Figure5represents ecological factors from sentences that describe infectious diseases.

Figure 5 .
Figure 5. word cloud for ecological factors in infectious disease.

Figure 6 .
Figure 6.Document Matrix.4.3.2Machine Learning Trained Infectious Disease Classifiers (XGBoots, Random Forest, Support Vector Machine , Artificial Neural Network, Decision Tree, Gradient boosting Algorithm, Linear Analysis, K-Nearest Neighbor) In this research, we carried out a comparative analysis of eight machine learning classifiers together were used in the training of document term matrices from NLP data.we present the comparative performance below; From the use of the Xgboots Algorithm in training the data, we present the result in Figure 7.

Figure 7 .
Figure 7. XGboost Training Results Again, from the use of the Random Forest Algorithm in training the data, we present the result of the training in Figure 8.

Figure 8 .
Figure 8. Random Forest Training Results

Figure 11 .Figure 12 .
Figure 11.Neural Network AccuracyFurthermore, this research also carried out a comparative analysis using a decision tree classifier to give a comparison of the performance of each algorithm in the ecological classification of infectious diseases, we present the performance of the decision tree in Figure12

Figure 14 .
Figure 14.Linear Discriminant Analysis Accuracy Nevertheless, this research also carried out a comparative analysis using the KNN Machine classifier to give a comparison of the performance of each algorithm in the ecological classification of infectious diseases, we present the performance of KNN in Figure 15.

4. 5
Model DeploymentFurthermore, with the aid of the DistilBert model and XGBoost algorithm, we deployed the model for Realtime accessibility and classification of infectious disease.Nevertheless, we present the final Application Programming Interface for the ecological infection disease classification where a user can describe the ecological factors within his or her environment and proceed to click the classifier button which then classifies the disease with a confidence interval among all the classes of the disease.Figure16below provides an Ecological Classification Interface where a user can easily describe his or her environment and click a button to classify which kind of infectious disease the ecological risk factors belong to.

Figure 16 :
Figure 16: Ecological Classification Interface Furthermore, we present the model interpretability results of the classification of infectious disease how the decision was made, and the criteria it uses in the decision making which is depicted in Figure 17.