Big Data and Hadoop Tutorial 3: Hadoop1 vs Hadoop2
In our last two tutorials we have learned about some basic facts and introduction regarding big data and Hadoop. If you remember in my previous tutorial I mentioned about two versions of Hadoop namely Hadoop1 and Hadoop2 but we didn’t learned about it in detail. So in this article we will learn about Hadoop1 and Hadoop2 as it is very much necessary to continue on this tutorial.
Comparing them the first factor on which we can differentiate them is components of Hadoop1 and Hadoop2. Hadoop1 has components like HDFS and MapReduce. Hadoop2 has components like HDFS and YARN/MRv2. HDFS is Hadoop Distributed File System which is used for storage. MapReduce is a programming framework. YARN stands for Yet Another Resource Negotiator. Components of Hadoop1 and Hadoop2 works differently. We will learn more about them later in this article. Next factor of comparison is based on daemons. Hadoop1 has daemons like NameNode, DataNode, Secondary NameNode, Job Tracker and Task Tracker. Hadoop2 has daemons like NameNode, DataNode, Secondary NameNode, Resource Manager and Node Manager. You can clearly see in above two statement that two daemons of Hadoop1 is replaced by other in Hadoop2. We will learn about this daemons in our next article. You might be wondering why we are learning about this theories and why not directly jump into practical. The reason is you need to know about the functioning of different Hadoop to work with Hadoop later in practical.
Let’s discuss about their working. In Hadoop1 HDFS does the data storing work. MapReduce does the task of resource management and data processing. In Hadoop2 also HDFS does the work of storing. Here YARN performs resource management. Now what is resource management? Resource management means when we need to perform a task the task would need some memory, storage etc. The work of allocating and deallocating these resources is done by YARN. MapReduce works at top of both of them.
Now let’s learn about limitations of Hadoop1. Hadoop1 has limitations like Single point of failure, low resource utilization and less scalability when compared to Hadoop2. When we compare Hadoop1 to Hadoop2 in resource utilization it doesn’t go well and it lags behind Hadoop2. Less scalability means you can create big cluster or loop using Hadoop2 but Hadoop1 can only be used for less number of nodes. For example the highest cluster of Hadoop1 was of 4200 node developed and utilized by Yahoo previously. Hadoop2 can easily make large clusters. Hadoop is a master slave architecture. If your master crashes then the whole system fails. No data can be recovered as the master controlled entire slaves. In Hadoop2 there is a standby master node with the original master to handle any crash happened with master node. You may add many standby master nodes with original master nodes. If your master crashes the standby node handles the client thus no loss of data.
Structure of Hadoop1 contains Oozie in top which can be called as work flow scheduler. It decides which work to execute when. Below it comes PIG, HIVE and Mahout which are tools which run over Hadoop. They basically serves as user interface with which you interact with Hadoop system. After it comes MapReduce in which you can write java code to process data. Sqoop comes after it and is used to import and export structured data. Flume is used to import unstructured and streaming data into HDFS. In Hadoop2 only difference in structure is in the interface with which you are going to interact namely OtherYARN, frameworks MPI and Giraph. These was all for this article and we will continue with it in our next tutorial. If you have any doubt please comment below.