1、1,Hadoop: A Framework for Data-Intensive Distributed Computing,2,Hadoop Infrastructure,Copyright 2013 Accenture All rights reserved., Hadoop is a software framework for distributed processing of large datasets,across large clusters of computers, Hadoop is open-source implementation for Google MapRed
2、uce, Hadoop is based on a simple programming model called MapReduce, Hadoop is based on a simple data model, any data will fit, Hadoop framework consists on two main layers, Distributed file system (HDFS) Execution engine (MapReduce),Hadoop Infrastructure, Hadoop is a distributed system like distrib
3、uted,databases, However, there are several key differences between,the two infrastructures, Data model, Computing model Cost model, Design objectives,3,How Data Model is Different?,Distributed Databases Deal with tables and relations Must have a schema for data Data fragmentation & partitioning,Hado
4、op Deal with flat files in any format No schema for data Files are divide automatically intoblocks,How Computing Model is Different?,Distributed Databases Notion of a transaction Transaction properties ACID Distributed transaction,Hadoop Notion of a job divided into tasks Map-Reduce computing model
5、Every task is either a map or reduce,Hadoop: Big Picture,6,Distributed File system,Execution engine,High-level languages,Distributed,light-weight DB,Centralized tool for coordination,HDFS + MapReduce are enough to have things working,What is Next?, Hadoop Distributed File System (HDFS), MapReduce La
6、yer, Examples, Word Count Join, Fault Tolerance in Hadoop,7,HDFS: Hadoop Distributed File System,8, Single namenode and many datanodes, Namenode maintains the file system,metadata, Files are split into fixed sized blocks and,stored on data nodes (Default 64MB), Data blocks are replicated for faultto
7、lerance and fast access (Default is 3), Datanodes periodically send heartbeats,to namenode, HDFS is a master-slave architecture, Master: namenode, Slaves: datanodes (100s or 1000s of nodes),HDFS: Data Placement and Replication, Default placement policy: Where to put a given block?, First copy is wri
8、tten to the node creating the file (write affinity) Second copy is written to a data node within the same rack Third copy is written to a data node in a different rack Objectives: load balancing, fast access, fault tolerance,9,Datanodes can be organized into racks,What is Next?, Hadoop Distributed F
9、ile System (HDFS), MapReduce Layer, Examples, Word Count Join, Fault Tolerance in Hadoop,10,MapReduce: Hadoop Execution Layer,11, MapReduce is a master-slave architecture, Master: JobTracker, Slaves: TaskTrackers (100s or 1000s of tasktrackers), Every datanode is running a tasktracker, Jobtracker kn
10、ows everything about submitted jobs, Divides jobs into tasks and decides where to run,each task, Continuously communicating with tasktrackers, Tasktrackers execute tasks (multiple per node), Monitors the execution of each task, Continuously sending feedback to Jobtracker,12 Hadoop Computing Model Tw
11、o main phases: Map and Reduce Any job is converted into map and reduce tasks Developers need ONLY to implement the Map and Reduce classes,Map tasks (one for each block),Reduce tasks,Shuffling and Sorting,Blocks of the input file in HDFSOutput is writtento HDFS,Data Flow,Hadoop Computing Model (Contd
12、),13, Mapper and Reducers consume and produce (Key, Value) pairs Users define the data type of the Key and Value, Shuffling & Sorting phase:, Map output is shuffled such that all same-key records go to the same reducer Each reducer may receive multiple key sets, Each reducer sorts its records to gro
13、up similar keys, then process each group,What is Next?, Hadoop Distributed File System (HDFS), MapReduce Layer, Examples, Word Count Join, Fault Tolerance in Hadoop,14,Map Tasks,Reduce Tasks,15 Word Count Job: Count the occurrences of each word in a data set,Reduce phase is optional: Jobs can be Map
14、Only,Joining Two Large Datasets,Dataset A,Dataset B,Different join keys,HDFS stores data blocks (Replicas are not shown),MapperM+N,Mapper2,Mapper1,Mapper3,- Each mapper produces the join key and the record pairs,Reducer 1,Reducer 2,Reducer N,Reducers perform the actual join,Shuffling and Sorting Pha
15、se,Shuffling and sorting over,the network- Each mapper processes one block (split),Joining Large Dataset (A) with Small Dataset (B),Dataset A,Dataset B,Different join keys,HDFS stores data blocks (Replicas are not shown),MapperN,Mapper1,Mapper2, Every map task processes one block from A and the enti
16、re B Every map task performs the join (MapOnly job) Avoid the shuffling and reduce expensive phases,What is Next?, Hadoop Distributed File System (HDFS), MapReduce Layer, Examples, Word Count Join, Fault Tolerance in Hadoop,18, Both namenode and jobtracker detect the failure All tasks on the failed
17、node are re-scheduled Namenode replicates the users data to another node What if a namenode or jobtracker fails? The entire cluster is down,Hadoop Fault Tolerance Intermediate data between mappers and reducers arematerialized to simple & straightforward fault tolerance What if a task fails (map or r
18、educe)? Tasktracker detects the failure Sends message to the jobtracker Jobtracker re-schedules the task What if a datanode fails?,Intermediate data (materialized),20 Reading/Writing Files Recall: Any data will fit in Hadoop, so how does Hadoop understand/read the data? User-pluggable class “Input F
19、ormat” Input formats know how to parse and read the data (convert byte stream to records) Each record is then passed to the mapper for processing Hadoop provides built-in Input Formats for reading text & sequence files,Map codeInput FormatThe same for writing “Output Formats”,Input Formats can do al
20、ot of magic to changethe job behavior,Back to Joining Large & Small Datasets,Dataset A,Dataset B,Different join keys,HDFS stores data blocks (Replicas are not shown),MapperN,Mapper1,Mapper2, Every map task processes one block from A and the entire B How does a single mapper reads multiple splits (fr
21、om different datasets)? Customized input formats,Using Hadoop, Java language, High-Level Languages, Hive (Facebook) Pig (Yahoo) Jaql (IBM),22,Job configuration,Reduce class,23 Java Code Example,Import Hadoop libs,Map class,Hive Language, High-level language on top of Hadoop, Like SQL on top of DBMSs
22、, Support structured data, e.g., creating tables, as well as,extensibility for un-structured data,24,Create Table user (userID int,age int,gender char),Row Format Delimited Fields;,Load Data Local Inpath /user/ local/users.txt into Table user;,From Hive To MapReduce,25,Hive: Group By,26,Summary, Had
23、oop is a distributed systems for processing large-scale datasets, Scales to thousands of nodes and petabytes of data, Two main layers, HDFS: Distributed file system(NameNode is centralized) MapReduce: Execution engine (JobTracker is centralized), Simple data model, any format will fit, At query time
24、, specify how to read (write) the data using input (output) formats, Simple computation model based on Map-Reduce phases, Very efficient in aggregation and joins, Higher-level Languages on top of Hadoop, Hive, Jaql, Pig,27,Distributed Databases,Hadoop,Computing ModelData Model,- - - - -,Notion of tr
25、ansactions Transaction is the unit of work ACID properties, Concurrency controlStructured data with known schema Read/Write mode,- - - - -,Notion of jobs Job is the unit of work No concurrency controlAny data will fit in any format (un)(semi)structured,-,Cost ModelFault Tolerance,- - -,Expensive ser
26、versFailures are rare Recovery mechanisms,- -,Cheap commodity machinesFailures are common over thousands of machines,-,Key Characteristics,- Efficiency, optimizations, fine-tuning,- Scalability, flexibility, fault tolerance, Cloud Computing A computing model where any computing infrastructurecan run on the cloud Hardware & Software are provided as remote services Elastic: grows and shrinks based on the users demand Example: Amazon EC2,28Summary: Hadoop vs. Other Systems,