Big Data and Hadoop Tutorial 2: Learning about Hadoop
In our previous article related to big data and Hadoop we have discussed about big data. In this article we will learn about Hadoop tool used to manage large amount of data. Firstly let’s learn what Hadoop is. Hadoop is an open source framework which allows processing big data in a distributed manner on clusters of commodity hardware. We will learn what cluster is further in this article. It is an open source framework which means it is free to use. Hadoop can also be defined as a data management tool which makes use of scale out storage. It is a data management software which means using it you can easily store, edit, manipulate or analyse the data. We will also learn about scale out storage further.
Hadoop cluster means how the data is stored or managed in a systemized environment. One thing to note is size of the data is important to define Hadoop cluster. Before managing the data you first need to decide how much data you need to analyse in next few months. For example if one organization is going to deal with 10TB of data for next 3 months. They will arrange 5 systems in which system will manage 2 TB of data individually. Generally Hadoop works in a master slave architecture. These 5 system analyse and manage each data and report to a master system above them. Master will decide the flow of data and makes all decision processes. Now if their assumption goes wrong and they get 2 TB more data that is now in total the organization got 12 TB to manage. It can easily tackle this situation as it is a distributed approach. It simply adds a new system to the architecture as a slave and it will now manage the extra 2 TB of data.
There are two main versions of Hadoop engineered till now. They are Hadoop 1 and Hadoop 2. Now let’s learn about different components of Hadoop. Hadoop 1 has components like HDFS and MapReduce. HDFS stands for Hadoop distributed file system. HDFS is used to store data. MapReduce is a framework that manages the processing of data. Hadoop 2 has components like HDFS which is discussed above. It is used for data storage function using Hadoop. Second is MRv2 which stands for Map Reduce version 2 and it is used for processing the data collected and stored by HDFS. It also has an extra component named YARN which stands for yet another resource negotiator. We will discuss about it in detail later.
Now let’s learn about daemons of Hadoop. Hadoop has daemons which varies from version to version. In Hadoop 1 we have daemons like name node, data node, secondary name node, job tracker and task tracker. Here name node and data node handle the processes carried out for storage of big data and job tracker and task trackers performs the processing part. Hadoop 2 has the daemons like name node, data node, secondary name node, resource manager and node manager. Here also name node and data node are used for data storage purpose. Here resource manager and node manager does the work of processing of stored data. Name node acts as a master node in HDFS and data node acts as slave in HDFS. In the same way Resource manager acts as a master node and node manager acts as a slave node in map reduce or YARN. The daemons which run on master system can be called as master daemon and the daemons that run on slave machines are called slave daemon. Here Resource manager and name node runs on master machine and is called master daemons whereas node manager and data node runs in slave machines so they are termed as slave daemons.
Secondary name node is not a backup of name node as you think. You might be thinking that is name node crashes secondary name node will take control of it but it is not correct scenario. Function of secondary name node is to take regular backups every hour and store it in fsimage named file. Secondary name node operated on master system and so is a master daemon.
Last topic is modes of operation in Hadoop. There are basically three modes of operation in Hadoop namely standalone mode, pseudo distributed mode and fully distributed mode. In standalone mode there is no daemon running in background. In pseudo distributed mode we use a single PC as master and slave and perform every task in it and in fully distributed mode we have separate systems for master and slave functioning. Here we end this article and will continue with tutorials further. If you have any doubt please comment below.