Big Data Architectures and Concepts

and


INTRODUCTION
The use of new devices and tools (hardware and software) by companies or individuals generates data that today, according to statisticians, will amount to 47 zettabytes in 2020 and could reach 150 zettabytes by 2025, where every day we generate several bytes of data through messages we send or receive, videos we post, posts and comments on social networks, GPS signals, climate information, and many others.This makes the way of dealing with data in business take a different direction as the data to be processed has become voluminous [19].Every company aims to maximize its profits and to achieve its objectives better, it must manage the data at its disposal well.That's why they need to put in place and maintain a solid architecture that allows them to do so.This explosion of data is increasing year on year, so to process it more effectively we need to use appropriate processing and analysis tools.This data comes from devices connected to fixed and mobile computer networks [16], such as tablets, smartphones, computers, etc., and can help the company obtain information about users' locations, movements, interests, consumption habits, etc.Given that in the context of this work we are aiming to process big data quickly, we cannot ignore the importance of data in the life of a company or an individual, given that it is increasing all the time.As Big Data is a new field, understanding and manipulating these concepts required a lot of reading time, and in the face of existing software, we had to spend a lot of time manipulating and exploiting big data processing software such as the Hadoop platform.BigData is a revolution in the field of digital information processing.The term Big Data is used to designate a significant volume of structured or unstructured data [10].Big Data is based on internal company data, analysis of customers' data, information from online services, and consumer opinions posted on social networks.It is also linked to the development of technology, which has led to an explosion in the amount of data, making it necessary to develop the means to store and manage this huge amount of data.It is therefore defined in terms of how large masses of data can be processed and exploited optimally.The concept of Big Data is characterized by several aspects, including the management of large amounts of data, the variety that this data can have, i.e. the data can be structured, semi-structured, or even unstructured, the time taken to process this data, etc.Many IT managers and authorities in the sector tend to define Big Data in terms of three main characteristics: Volume, Speed, and Variety: Volume, Speed, and Variety [12].BigData offers an opportunity to exploit immense data, while storing and using datasets using distributed systems in which the different parts of the data are stored in different places but brought together using software, in the case of our work, we used Hadoop as the software.BigData refers to the speed at which data is generated, captured, shared, and updated.Evolving technologies mean that businesses and consumers alike are generating data in a short space of time.Data and results are often available in realtime.For this reason, we used six virtual machines to develop our work, to respond to the concept linked to the speed of data processing.

JINITA
Regarding the volume of data to be stored, we used HDFS (Hadoop Distributed File System), one of the main components of Hadoop, which operates on the master/slave principle, in a cluster where the data and services are stored on several different machines [14].The Hadoop distributed file system is made up of [17], [18] : • A single NameNode that plays the role of the master, managing the various client file accesses and performing operations such as opening, closing, and renaming files.The NameNode contains information about the data stored in the various nodes (the metadata).The application interacts only with the NameNode, and the latter interrogates the corresponding nodes to obtain the information requested by the application and then provides it.• One or more DataNodes, which act as slaves, storing data and performing file system operations if requested by the client, as well as creating, replicating, and blocking files when requested by the NameNode.• A secondary Namenode, which in the event of a NameNode failure, will continue the work done by the NameNode.
Datasets stored in the Hadoop Distributed File System (HDFS) are processed by MapReduce.It automatically slices a dataset into data fragments of the same size [14] and then applies an algorithm to these fragments to process them at the same time on available nodes in the cluster.It provides fault tolerance in that the faulty node can be restarted or the task can be assigned to another node.

RELATIVE WORDS
Many research projects deal with Big Data, in particular those [7], [20], [21] by Boumraou Kahina and Kedjar Hakim [7], HADJARI Imane, Benbachir Meriem, Boukhatem Fatima [20] and Shravya Nethula [21].Boumraou Kahina and Kedjar Hakim [7] have implemented an interface managing the communication and connectivity of the cluster nodes, they have not used a backup master server (secondary NameNode) to remedy the failure that this master server may have and their study of the execution time involved 3 and 9 nodes.Shravya Nethula [21] compared the performance between the MapReduce algorithm on Compuverde shared storage (Compuverde File System -CVFS) and the MapReduce algorithm on HDFS and she used four nodes.By way of comparison, the purpose of our article is to implement Hadoop in a virtual environment, with an additional backup master server (secondary NameNode), and to analyze the improvement in performance using the MapReduce algorithm fed by two datasets of different sizes successively with one, two, three and then four data nodes.

BIG DATA ARCHITECTURES
Since traditional database systems do not meet the requirements of Big Data processing, they are not capable of handling massive data of various kinds in real-time.To take advantage of the benefits of Big JINITA Vol. 5, No. 2, December 2023 DOI: doi.org/10.35970/jinita.v5i2.1876Data in business, it is necessary to push back the limits of the systems, particularly in terms of the volume of data to be analyzed, the processing speed, and the variety of data to be managed [2].The implementation of a big data architecture within an enterprise allows for batch processing of data sources in real-time, giving the possibility of exploiting voluminous data, while transforming unstructured data into structured data to facilitate its exploitation, and also to centralize existing data and those from different sources in different formats to promote predictive analysis and which allows for tasks based on machine learning and artificial intelligence technologies [1].Previously, the use of such an architecture was reserved for the major web players such as Google, Facebook, LinkedIn, and Yahoo, as it was very expensive and required the company to have a large number of data scientists, analysts, and architects [3].It is thanks to the work carried out by Doug Cutting and his colleagues that Big data technologies were made open to all with the support of Yahoo.They worked on a project called Hadoop which was eventually adopted by a large number of operators who made it the reference platform for Big data.

Type of Big Data Architecture 3.1.1. Lambda architecture
It is an architecture invented by Nathan Marz, to designate a generic, scalable, and fault-tolerant data processing architecture, based on his experience working on the BackType and Twitter distributed data processing systems [3], [5].This architecture is the most commonly used for processing and managing large data in real-time and batch mode simultaneously.It allows functional separation of storage, consumption, and complex real-time processing with the ability to store and process large volumes of data (batch) while integrating the most recent data into the results [2], [6].However, it is the most widely used because of the parameters it offers for successful processing, such as resistance to failures, fast response time during processing, scaling, and the possibility of merging block data processing (batch) and new data input (realtime).The idea of this architecture is to build a model of a real-time data processing system as a series of three layers: a batch layer, a velocity layer, and a service layer to get a perfect view of the data [6].The purpose of a lambda architecture is not only to store data but also to make it available to other applications to exploit and extract value from it.It provides complete views of the data set.
a. Batch layer This layer takes care of storing all the data, as the information keeps coming into the data system, this incoming data is stored as it is without any derivation or transformation i.e. in its raw form in the Batch layer.Any new data stream that arrives at the Batch layer is calculated and processed using MapReduce or machine learning.The result of this processing is stored as a batch view [1], [2].

b. Speed layer (Real Time)
This layer processes only recent data and provides more recent results incrementally using view computation to complement batch views, and also has the role of removing obsolete real-time views (post batch processing) [6], [15].It supports the service layer to reduce latency in responding to requests.As its name suggests, the speed layer has low latency because it only processes real-time data and has a lower computational load.

c. Serving Layer
This layer is used to store and present to clients the views created by the batch and real-time layers [1].In this layer, the following tools can be used Apache Cassandra, MongoDB, ElasticSearch, CouchBase.

Kappa architecture
It is an architecture based on the principle of merging the real-time and batch layers, with all data passing through a single path using a stream processing system.It is based on the streaming architecture in which a series of incoming data is first stored in a messaging engine such as Apache Kafka.From there, a streaming engine reads and transforms the data into an analyzable format and then stores it in an analytical database that end users can query [1], [2].Kappa is a simplified, dedicated data processing architecture used in streaming layer deployment models where data sources are both batch and real-time and where end-to-end latency requirements are very strict.

Big Data Architecture Implementation Technologies
The figure below illustrates some of the implementation technologies of the Big Data architecture.For the elaboration of our article, we will base ourselves on open-source technology and in particular on Hadoop [14], [17], [18].To implement the basic components of Hadoop while respecting the concept of the master and slave nodes evoked by HDFS, we installed Hadoop in an Ubuntu virtual machine that we configured as the master.We then administered it remotely with ssh (Secure Shell) to copy the same version of Hadoop and all the configurations to the five other machines using a classic tool for copying files in an encrypted manner between remote computers, which is "scp".We have stored two datasets [8], [9] in HDFS to which we are going to apply parallel processing with MapReduce, one of the main components of Hadoop; its main role is to retrieve large data from the HDFS and then carry out parallel processing on it.It has two main functions, Map and Reduce; Map is used to decompose and map the data, while reduce is used to mix and calculate [22].We used WordCount to process our datasets [8], [9], The WordCount MapReduce algorithm is used to count the number of occurrences of each observation and also to group similar observations from each dataset.Next, using the experimental method, we are going to carry out a comparative study, with graphs, on the processing times of MapReduce successively with one, two, three, and four data nodes in our cluster for each dataset.JINITA Vol. 5, No. 2, December 2023 DOI: doi.org/10.35970/jinita.v5i2.1876

RESULT
The figures below show the launch of the Hadoop cluster when executing a MapReduce job and using hdfs (Hadoop distributed file system) in a distributed node.

Processing time graphs
❖ For the first dataset, we noted the following: The results were obtained by using the dataset [8] with 1, 2, 3, and 4 data nodes in the cluster.We can see that as the number of data nodes increases, execution time decreases.For the second dataset, we noted the following figure 7 and   The results obtained using dataset [9], show us that the execution time is smaller than the execution time of dataset [8] given the difference in size of the two datasets.

CONCLUSION
Big data architectures are based on distributed architectures, which make it possible to divide the storage and processing load of a single machine between several machines to improve speed, responsiveness, and performance.In this article, we deployed Hadoop in a kappa architecture where we focused on the batch layer.We worked with 6 virtual machines, one of which played the role of a name node, four others as data nodes, and one as a secondary name node in a virtualized environment.For the two datasets [8], [9] we applied MapReduce processing by comparing the processing time in a cluster made Vol. 5, No. 2, December 2023 DOI: doi.org/10.35970/jinita.v5i2.1876 Vol. 5, No. 2, December 2023 DOI: doi.org/10.35970/jinita.v5i2.1876

Figure 3 :
Figure 3: Configuration of the hdfs-site.xmlfile NameNode communicates with the 4 data nodes and a Secondary NameNode to execute and analyze various processes that make use of mapreduce, Hdfs (Hadoop Distributed File System), and Yarn (Yet Another Resource Negotiator) which are core components of Hadoop.

Figure 5 :Figure 6 :
Figure 5: Processing time in minutes for 1,2,3 and 4 data nodes in the cluster using the movies.csvdataset

Figure 7 :Figure 8 :
Figure 7: Processing time in minutes for 1,2,3 and 4 data nodes in the cluster using the taxi.csvdataset