Thursday, July 9, 2020
Hadoop 3
Hadoop 3 Whats New in Hadoop 3.0 Enhancements in Apache Hadoop 3 Back Home Categories Online Courses Mock Interviews Webinars NEW Community Write for Us Categories Artificial Intelligence AI vs Machine Learning vs Deep LearningMachine Learning AlgorithmsArtificial Intelligence TutorialWhat is Deep LearningDeep Learning TutorialInstall TensorFlowDeep Learning with PythonBackpropagationTensorFlow TutorialConvolutional Neural Network TutorialVIEW ALL BI and Visualization What is TableauTableau TutorialTableau Interview QuestionsWhat is InformaticaInformatica Interview QuestionsPower BI TutorialPower BI Interview QuestionsOLTP vs OLAPQlikView TutorialAdvanced Excel Formulas TutorialVIEW ALL Big Data What is HadoopHadoop ArchitectureHadoop TutorialHadoop Interview QuestionsHadoop EcosystemData Science vs Big Data vs Data AnalyticsWhat is Big DataMapReduce TutorialPig TutorialSpark TutorialSpark Interview QuestionsBig Data TutorialHive TutorialVIEW ALL Blockchain Blockchain TutorialWhat is BlockchainHyperledger FabricWhat Is EthereumEthereum TutorialB lockchain ApplicationsSolidity TutorialBlockchain ProgrammingHow Blockchain WorksVIEW ALL Cloud Computing What is AWSAWS TutorialAWS CertificationAzure Interview QuestionsAzure TutorialWhat Is Cloud ComputingWhat Is SalesforceIoT TutorialSalesforce TutorialSalesforce Interview QuestionsVIEW ALL Cyber Security Cloud SecurityWhat is CryptographyNmap TutorialSQL Injection AttacksHow To Install Kali LinuxHow to become an Ethical Hacker?Footprinting in Ethical HackingNetwork Scanning for Ethical HackingARP SpoofingApplication SecurityVIEW ALL Data Science Python Pandas TutorialWhat is Machine LearningMachine Learning TutorialMachine Learning ProjectsMachine Learning Interview QuestionsWhat Is Data ScienceSAS TutorialR TutorialData Science ProjectsHow to become a data scientistData Science Interview QuestionsData Scientist SalaryVIEW ALL Data Warehousing and ETL What is Data WarehouseDimension Table in Data WarehousingData Warehousing Interview QuestionsData warehouse architectureTalend T utorialTalend ETL ToolTalend Interview QuestionsFact Table and its TypesInformatica TransformationsInformatica TutorialVIEW ALL Databases What is MySQLMySQL Data TypesSQL JoinsSQL Data TypesWhat is MongoDBMongoDB Interview QuestionsMySQL TutorialSQL Interview QuestionsSQL CommandsMySQL Interview QuestionsVIEW ALL DevOps What is DevOpsDevOps vs AgileDevOps ToolsDevOps TutorialHow To Become A DevOps EngineerDevOps Interview QuestionsWhat Is DockerDocker TutorialDocker Interview QuestionsWhat Is ChefWhat Is KubernetesKubernetes TutorialVIEW ALL Front End Web Development What is JavaScript â" All You Need To Know About JavaScriptJavaScript TutorialJavaScript Interview QuestionsJavaScript FrameworksAngular TutorialAngular Interview QuestionsWhat is REST API?React TutorialReact vs AngularjQuery TutorialNode TutorialReact Interview QuestionsVIEW ALL Mobile Development Android TutorialAndroid Interview QuestionsAndroid ArchitectureAndroid SQLite DatabaseProgramming s New In Hadoop 3.0... B ig Data and Hadoop (168 Blogs) Become a Certified Professional AWS Global Infrastructure Introduction to Big Data What is Big Data? - A Beginner's Guide to the World of Big DataInfographics: How Big is Big Data?Big Data Tutorial: All You Need To Know About Big Data!Big Data Analytics â" Turning Insights Into ActionReal Time Big Data Applications in Various DomainsWhat is the difference between Big Data and Hadoop? Introduction to Hadoop What is Hadoop? Introduction to Big Data s New in Hadoop 3.0 Enhancements in Apache Hadoop 3 Last updated on May 22,2019 29.4K Views Shubham Sinha Shubham Sinha is a Big Data and Hadoop expert working as a... Shubham Sinha is a Big Data and Hadoop expert working as a Research Analyst at Edureka. He is keen to work with Big Data... Bookmark 4 / 5 Blog from Introduction to Hadoop Become a Certified Professional This Whats New in Hadoop 3.0 blog focus on the changes that are expected in Hadoop 3, asits still in alpha phase. Apache commun ity has incorporatedmanychanges and is still working on some of them. So, we will be taking a broader look at the expected changes.The major changes that we will be discussing are:Minimum Required Java Version in Hadoop 3 is 8Support for Erasure Encoding in HDFSYARN Timeline Service v.2Shell Script RewriteShaded Client JarsSupport for Opportunistic ContainersMapReduce Task-Level Native OptimizationSupport for More than 2 NameNodesDefault Ports of Multiple Services have been ChangedSupport for Filesystem ConnectorIntra-DataNode BalancerReworked Daemon and Task Heap ManagementApache Hadoop 3 is going to incorporate a number of enhancements over the Hadoop-2.x. So, let us move ahead and look at each of the enhancements.Changes Expected in Hadoop 3 | Getting to Know Hadoop 3 Alpha | Edureka//videoken.com/embed/N4WzQ1H5h5I1. Minimum Required Java Version in Hadoop 3 is Increased from 7 to 8In Hadoop 3, all Hadoop JARs are compiled targeting a runtime version of Java 8. So, users who are still using Java 7 or below have to upgrade to Java 8 when they will start working with Hadoop 3.Now let us discuss one of the important enhancement of Hadoop 3, i.e. Erasure Encoding, which will reduce the storage overhead while providing the same level of fault tolerance as earlier.2. Support for Erasure Encoding in HDFSSo now let us first understand what is Erasure Encoding. Generally, in storage systems, Erasure Codingis mostly used in Redundant Array of Inexpensive Disks (RAID).As you can see in the above image, RAID implements EC through striping, in which the logically sequential data (such as a file) is divided into smaller units (such as bit, byte, or block) and stores consecutive units on different disks.Then for each stripe of original data cells, a certain number of parity cells are calculated and stored. This process is called encoding. The error on any striping cell can be recovered through decoding calculation based on surviving data cells and parity cells.As we got a n idea of Erasure coding, now let us first go through the earlier scenario of replication in Hadoop 2.x. The default replication factor in HDFS is 3 in which one is the original data block and the other 2 are replicas which require 100% storageoverhead each. So that makes200% storage overhead and it consumes other resources like network bandwidth.However, the replicas of cold datasets which have low I/O activities are rarely accessed during normal operations, but still, consume the same amount of resources as the original dataset.Erasure coding stores the data and provide fault tolerance with less space overhead as compared to HDFS replication. Erasure Coding (EC) can be used in place of replication, which will providethe same level of fault-tolerance with less storage overhead.Integrating EC with HDFS can maintain the same fault-tolerance with improved storage efficiency.As an example, a 3x replicated file with 6 blocks will consume 6*3 = 18 blocks of disk space. But with EC (6 dat a, 3 parity) deployment, it will only consume 9 blocks (6 data blocks + 3 parity blocks) of disk space. This only requires the storage overhead up to 50%.SinceErasure coding requires additional overhead in the reconstruction of the data due to performing remote reads, thus it is generally used for storing less frequently accessed data. Before deploying Erasure code, users should consider all the overheads like storage, network and CPU overheads of erasure coding.Now to support the Erasure Coding effectively in HDFS they made some changes in the architecture. Lets us take a look at the architectural changes.HDFS Erasure Encoding: ArchitectureNameNode Extensions The HDFS files are striped into block groups, which have a certain number of internal blocks. Now to reduce NameNode memory consumption from these additional blocks, a new hierarchical block naming protocol was introduced. The ID of a block group can be deduced from the ID of any of its internal blocks. This allows management at the level of the block group rather than the block.Client Extensions After implementing Erasure Encoding in HDFS,NameNode workson block group level the client read and write paths were enhanced to work on multiple internal blocks in a block group in parallel. On the output/write path, DFSStripedOutputStream manages a set of data streamers, one for each DataNode storing an internal block in the current block group. A coordinator takes charge of operations on the entire block group, including ending the current block group, allocating a new block group, etc. On the input/read path, DFSStripedInputStream translates a requested logical byte range of data as ranges into internal blocks stored on DataNodes. It then issues read requests in parallel. Upon failures, it issues additional read requests for decoding.DataNode Extensions The DataNode runs an additional ErasureCodingWorker (ECWorker) task for background recovery of failed erasure coded blocks. Failed EC blocks are detected by the NameNode, which then chooses a DataNode to do the recovery work. Reconstruction performs three key tasks:Read the data from source nodesand reads only the minimum number of input blocks parity blocks for reconstruction.New data and parity blocks are decoded from the input data. All missing data and parity blocks are decoded together.Once decoding is finished, the recovered blocks are transferred to target DataNodes.ErasureCoding policy To accommodate heterogeneous workloads, we allow files and directories in an HDFS cluster to have different replication and EC policies. Information about encoding decoding files is encapsulated in an ErasureCodingPolicy class. It contains 2 pieces of information, i.e. theECSchema the size of a stripping cell.The second most important enhancement in Hadoop 3 is YARN Timeline Service version 2 from YARN version 1 (in Hadoop 2.x). They are trying to make many upbeat changes in YARN Version 2.3. YARN Timeline Service v.2Hadoop is introducing a major revision of YARN Timeline Service i.e. v.2. YARN Timeline Service. It is developed to address two major challenges:Improving scalability and reliability of Timeline ServiceEnhancing usability by introducing flows and aggregationYARN Timeline Service v.2 can be tested by the developers to provide feedback and suggestions. It should be exploited only in a test capacity.The security is not enabled in YARN Timeline Servicev.2.So, let us first discuss scalability and then we will discuss flows and aggregations.YARN Timeline Service v.2: ScalabilityYARN version 1 is limited to a single instance of writer/reader and does not scale well beyond small clusters. Version 2 uses a more scalable distributed writer architecture and a scalable backend storage.It separates the collection (writes) of data from serving (reads) of data. It uses distributed collectors, essentially one collector for each YARN application. The readers are separate instances that are dedicated to serving queries via REST API.YARN Timeline Service v.2 chooses Apache HBase as the primary backing storage, as Apache HBase scales well to a large size while maintaining good response times for reads and writes.YARN Timeline Service v.2:Usability ImprovementsNow talking about usability improvements, in many cases, users are interested in the information at the level of flows or logical groups of YARN applications. It is much more common to launch a set or series of YARN applications to complete a logical application. Timeline Service v.2 supports the notion of flows explicitly. In addition, it supports aggregating metrics at the flow level as you can see in the below diagram.Now lets us look at the architectural level, how YARN version 2 works.YARN Timeline Service v.2:ArchitectureYARN Timeline Service v.2 uses a set of collectors (writers) to write data to the backend storage. The collectors are distributed and co-located with the application masters to which they are dedicated, as you can see in the below image. All data that belong to that application are sent to the application level timeline collectors with the exception of the resource manager timeline collector.For a given application, the application master can write data for the application to the co-located timeline collectors. In addition, node managers of other nodes that are running the containers for the application also write data to the timeline collector on the node that is running the application master.The resource manager also maintains its own timeline collector. It emits only YARN-generic life cycle events to keep its volume of writes reasonable.The timeline readers are separate daemons separate from the timeline collectors, and they are dedicated to serving queries via REST API. 4. Shell Script RewriteThe Hadoop shell scripts have been rewritten to fix many bugs, resolve compatibility issues and change in some existing installation. It also incorporates some new features. So I will list some of the importan t ones:All Hadoop shell script subsystems now execute hadoop-env.sh, which allows for all of the environment variables to be in one location.Daemonization has been moved from *-daemon.sh to the bin commands via the daemon option. In Hadoop 3 we can simply use daemon start to start a daemon, daemon stop to stop a daemon, and daemon status to set $? to the daemons status.For example, hdfs daemon start namenode.Operations which trigger ssh connections can now use pdsh if installed.${HADOOP_CONF_DIR} is now properly honored everywhere, without requiring symlinking and other such tricks.Scripts now test and report better error messages for various states of the log and pid dirs on daemon startup. Before, unprotected shell errors would be displayed to the user.There are many more features you will know when Hadoop 3 will be in the beta phase. Now let us discuss the shaded client jar and know their benefits.5. Shaded Client JarsThehadoop-client available in Hadoop 2.x releases pulls Hadoop s transitive dependencies onto a Hadoop applications classpath. This can create a problem if the versions of these transitive dependencies conflict with the versions used by the application.So in Hadoop 3, we havenewhadoop-client-apiandhadoop-client-runtimeartifacts that shade Hadoops dependencies into a single jar. hadoop-client-api is compile scope hadoop-client-runtime is runtime scope, which contains relocated third party dependencies from hadoop-client. So, that you can bundle the dependencies into a jar and test the whole jar for version conflicts. This avoids leaking Hadoops dependencies onto the applications classpath. For example,HBase can use to talk with a Hadoop cluster without seeing any of the implementation dependencies.Now let us move ahead and understand one more new feature, which has been introduced in Hadoop 3, i.e. opportunistic containers.6. Support for Opportunistic Containers and Distributed SchedulingA new ExecutionTypehas been introduced, i.e. Opportunisti c containers, which can be dispatched for execution at a NodeManager even if there are no resources available at the moment of scheduling. In such a case, these containers will be queued at the NM, waiting for resources to be available for it to start. Opportunistic containers are of lower priority than the defaultGuaranteedcontainers and are therefore preempted, if needed, to make room for Guaranteed containers. This should improve cluster utilization.Guaranteed containerscorrespond to the existing YARN containers. They are allocated by the Capacity Scheduler, and once dispatched to a node, it is guaranteed that there are available resources for their execution to start immediately. Moreover, these containers run to completion as long as there are no failures.Opportunistic containers are by default allocated by the central RM, but support has also been added to allow opportunistic containers to be allocated by a distributed scheduler which is implemented as an AMRMProtocol intercep tor. Now moving ahead, let us take a look how MapReduce performance has been optimized.7. MapReduce Task-Level Native OptimizationIn Hadoop 3, a native Java implementation has been added in MapReduce for the map output collector. For shuffle-intensive jobs, this improves the performance by 30% or more.They added a native implementation of the map output collector.For shuffle-intensive jobs, this may provide speed-ups of 30% or more.They areworking on native optimization for MapTask based on JNI.The basic idea is to add a NativeMapOutputCollector to handle key value pairs emitted by the mapper, therefore sort, spill, IFile serialization can all be done in native code. They are still working on theMerge code. Now let us take a look, how Apache communityis trying to make Hadoop 3 more fault tolerant.8. Support for More than 2 NameNodesIn Hadoop 2.x,HDFS NameNode high-availability architecture hasa single active NameNode and a single Standby NameNode. By replicating edits to a quorum of three JournalNodes, this architecture is able to tolerate the failure of any one NameNode.However, business criticaldeployments require higher degrees of fault-tolerance. So, in Hadoop 3 allows users to run multiple standby NameNodes. For instance, by configuring three NameNodes (1 active and 2 passive) and five JournalNodes, the cluster can tolerate the failure of two nodes.Next, we will look at default ports of Hadoop services that have been changed in Hadoop 3.9. Default Ports of Multiple Services have been ChangedEarlier, the default ports of multiple Hadoop services were in the Linux ephemeral port range (32768-61000). Unless a client program explicitly requests a specific port number, the port number used is an ephemeral port number. So at startup, services would sometimes fail to bind to the port due to a conflict with another application.Thus the conflicting ports with ephemeral range have been moved out of that range, affecting port numbers of multiple services, i.e. the N ameNode, Secondary NameNode, DataNode, etc. Some of the important ones are: There are few more which are expected. Now moving on, let us know what are new Hadoop 3 file system connecters.10. Support for Filesystem ConnectorHadoop now supports integration with Microsoft Azure Data Lake and Aliyun Object Storage System. It can be used as an alternative Hadoop-compatible filesystem. FirstMicrosoft Azure Data Lake was added and then they addedAliyun Object Storage System as well. You might expect some more.Let us understand how Balancer have been improved within multiple disks in a Data Node.11. Intra-DataNode BalancerA single DataNode manages multiple disks. During a normal write operation, data is divided evenly and thus, disks arefilled up evenly. But adding or replacing disks leads to skew within a DataNode. This situation was earlier not handled by the existing HDFS balancer. This concerns intra-DataNode skew.Now Hadoop 3 handles this situation by the new intra-DataNode balancing f unctionality, which is invoked via the hdfs diskbalancer CLI.Now let us take a look how various memory managements have taken place.12. Reworked Daemon and Task Heap ManagementA series of changes have been made to heap management for Hadoop daemons as well as MapReduce tasks.New methods for configuring daemon heap sizes. Notably, auto-tuning is now possible based on the memory size of the host, and the HADOOP_HEAPSIZE variable has been deprecated.In its place, HADOOP_HEAPSIZE_MAX and HADOOP_HEAPSIZE_MIN have been introduced to set Xmx and Xms, respectively.All global and daemon-specific heap size variables now support units. If the variable is only a number, the size is assumed to be in megabytes.Simplification of the configuration of map and reduce task heap sizes, so the desired heap size no longer needs to be specified in both the task configuration and as a Java option. Existing configs that already specify both are not affected by this change.I hope this blog was informative an d added value to you. Apache community isstill working on multiple enhancements which might come up until beta phase. We will keep you updated and come up with more blogs and videos on Hadoop 3.Now that you know the expected changes in Hadoop 3, check out theHadooptrainingby Edureka,a trusted online learning companywith a network of more than250,000satisfied learnersspread acrossthe globe. The Edureka Big Data Hadoop Certification Training coursehelps learners becomeexpert in HDFS, Yarn, MapReduce, Pig, Hive, HBase, Oozie, Flume and Sqoop using real-time use cases on Retail, Social Media, Aviation, Tourism, Finance domain.Got a question for us? Please mention it in the comments section and we will get back to you.Recommended videos for you Is It The Right Time For Me To Learn Hadoop ? Find out. Watch Now What is Big Data and Why Learn Hadoop!!! Watch Now Bulk Loading Into HBase With MapReduce Watch Now Streaming With Apache Spark and Scala Watch Now Top Hadoop Interview Questions an d Answers Ace Your Interview Watch Now Improve Customer Service With Big Data Watch Now Logistic Regression In Data Science Watch Now Tailored Big Data Solutions Using MapReduce Design Patterns Watch Now Ways to Succeed with Hadoop in 2015 Watch Now Hadoop for Java Professionals Watch Now Big Data Processing With Apache Spark Watch Now Spark SQL | Apache Spark Watch Now Hadoop Architecture Hadoop Tutorial on HDFS Architecture Watch Now Hadoop Tutorial A Complete Tutorial For Hadoop Watch Now Filtering on HBase Using MapReduce Filtering Pattern Watch Now Administer Hadoop Cluster Watch Now What is Apache Storm all about? Watch Now MapReduce Tutorial All You Need To Know About MapReduce Watch Now HBase Tutorial A Complete Guide On Apache HBase Watch Now Big Data Processing with Spark and Scala Watch NowRecommended blogs for you NameNode High Availability with Quorum Journal Manager Read Article Hive Tutorial Hive Architecture and NASA Case Study Read Article Big Data Testing: A Perfect Guide You Need to Follow Read Article Big Data Applications-Sears Case Study Read Article MapReduce Tutorial Fundamentals of MapReduce with MapReduce Example Read Article Apache Flume Tutorial : Twitter Data Streaming Read Article DBInputFormat to Transfer Data From SQL to NoSQL Database Read Article Apache Hadoop 2.0 and YARN Read Article Tutorial: Setting Up a Virtual Environment in Hadoop Read Article Why do we need Hadoop for Data Science? Read Article Scala Functional Programming Read Article Apache Spark Lighting up the Big Data World Read Article Apache Hadoop HDFS Architecture Read Article RDDs in PySpark Building Blocks Of PySpark Read Article Jobs In Hadoop Read Article Splunk Architecture: Tutorial On Forwarder, Indexer And Search Head Read Article 7 Ways Big Data Training Can Change Your Organization Read Article Spark GraphX Tutorial Graph Analytics In Apache Spark Read Article Is Big Data the Right Move for You? Read Article How To Install MongoDB On Windows Operating System? Read Article Comments 0 Comments Trending Courses in Big Data Big Data Hadoop Certification Training158k Enrolled LearnersWeekend/WeekdayLive Class Reviews 5 (62900)
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.