Hadoop Ecosystem Tutorial
Introduction to Hadoop Ecosystem
The Hadoop Ecosystem is a suite of tools and frameworks that help in processing and managing large datasets. The core of the ecosystem is the Hadoop framework, which consists of two main components:
- Hadoop Distributed File System (HDFS) - A distributed file system that stores data across multiple machines.
- MapReduce - A programming model for processing large datasets in parallel.
Key Components of Hadoop Ecosystem
Here are some of the key components of the Hadoop Ecosystem:
- HDFS - A distributed file system that provides high-throughput access to application data.
- YARN - A resource management platform responsible for managing compute resources in clusters and using them for scheduling users' applications.
- MapReduce - A YARN-based system for parallel processing of large datasets.
- Hive - A data warehouse infrastructure that provides data summarization and ad hoc querying.
- Pig - A high-level platform for creating MapReduce programs used with Hadoop.
- HBase - A distributed, scalable, big data store that provides random, real-time read/write access to data.
- Spark - An open-source cluster-computing framework for real-time processing.
- Flume - A service for efficiently collecting, aggregating, and moving large amounts of log data.
- Sqoop - A tool designed for efficiently transferring bulk data between Hadoop and structured datastores such as relational databases.
- Oozie - A workflow scheduler system to manage Hadoop jobs.
- ZooKeeper - A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
HDFS (Hadoop Distributed File System)
HDFS is the primary storage system used by Hadoop applications. It provides high throughput access to application data and is suitable for applications that have large datasets.
Example: Basic HDFS Commands
Here are some basic HDFS commands:
hdfs dfs -ls /
hdfs dfs -mkdir /user/hadoop
hdfs dfs -put localfile /user/hadoop
hdfs dfs -get /user/hadoop/remote_file local_file
$ hdfs dfs -ls / Found 3 items drwxr-xr-x - hadoop supergroup 0 2023-01-01 12:00 /user drwxr-xr-x - hadoop supergroup 0 2023-01-01 12:00 /tmp drwxr-xr-x - hadoop supergroup 0 2023-01-01 12:00 /data
MapReduce
MapReduce is a programming model for processing large datasets with a distributed algorithm on a cluster. It consists of a Map function that processes key-value pairs to generate a set of intermediate key-value pairs, and a Reduce function that merges all intermediate values associated with the same intermediate key.
Example: Word Count in MapReduce
Here is a simple example of a MapReduce program that counts the number of occurrences of each word in a text file.
hadoop jar hadoop-mapreduce-examples.jar wordcount input output
// Mapper public static class TokenizerMapper extends Mapper
Hive
Hive is a data warehouse infrastructure built on top of Hadoop. It provides data summarization, query, and analysis capabilities. Hive uses a SQL-like language called HiveQL for querying data stored in various databases and file systems that integrate with Hadoop.
Example: Basic Hive Commands
Here are some basic Hive commands:
CREATE TABLE employee (id INT, name STRING, age INT, salary FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
LOAD DATA INPATH '/user/hive/warehouse/employees.csv' INTO TABLE employee;
SELECT * FROM employee;
hive> SELECT * FROM employee; OK 1 John 30 4000.0 2 Jane 25 3500.0 3 Dave 35 4500.0
Pig
Pig is a high-level platform for creating MapReduce programs used with Hadoop. The language for Pig is called Pig Latin, which is a data flow language that supports nested data types such as bags, tuples, and maps.
Example: Pig Script
Here is an example of a Pig script that loads data, filters it, and stores the results:
pig -x local
// Pig Script data = LOAD 'input.txt' USING PigStorage(',') AS (name:chararray, age:int, gpa:float); filtered_data = FILTER data BY age > 20; STORE filtered_data INTO 'output.txt' USING PigStorage(',');
HBase
HBase is a distributed, scalable, big data store that provides random, real-time read/write access to data. It is designed to handle large amounts of data across many commodity servers without a single point of failure.
Example: HBase Shell Commands
Here are some basic HBase shell commands:
create 'employee', 'personal_data', 'professional_data'
put 'employee', '1', 'personal_data:name', 'John'
put 'employee', '1', 'professional_data:role', 'Developer'
scan 'employee'
hbase(main):001:0> scan 'employee' ROW COLUMN+CELL 1 column=personal_data:name, timestamp=1234567890, value=John 1 column=professional_data:role, timestamp=1234567890, value=Developer 1 row(s) in 0.1230 seconds
Spark
Spark is an open-source cluster-computing framework that provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. It is designed for fast computation and supports various data sources.
Example: Spark Word Count
Here is a simple example of a Spark program that counts the number of occurrences of each word in a text file:
spark-submit --class org.apache.spark.examples.JavaWordCount --master local[2] spark-examples_2.11-2.4.5.jar input.txt
// Spark Word Count Example import org.apache.spark.api.java.*; import org.apache.spark.SparkConf; import org.apache.spark.api.java.function.Function; import org.apache.spark.api.java.function.FlatMapFunction; import org.apache.spark.api.java.function.PairFunction; import scala.Tuple2; import java.util.Arrays; import java.util.Iterator; public class JavaWordCount { public static void main(String[] args) { String inputFile = args[0]; SparkConf conf = new SparkConf().setAppName("wordCount"); JavaSparkContext sc = new JavaSparkContext(conf); JavaRDDinput = sc.textFile(inputFile); JavaRDD words = input.flatMap( new FlatMapFunction () { public Iterator call(String s) { return Arrays.asList(s.split(" ")).iterator(); } } ); JavaPairRDD counts = words.mapToPair( new PairFunction () { public Tuple2 call(String s) { return new Tuple2 (s, 1); } } ).reduceByKey( new Function2 () { public Integer call(Integer x, Integer y) { return x + y; } } ); counts.saveAsTextFile("output"); } }
Flume
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows.
Example: Flume Configuration
Here is an example of a Flume configuration file:
# Define the source, sink, and channel agent.sources = r1 agent.sinks = k1 agent.channels = c1 # Describe/configure the source agent.sources.r1.type = netcat agent.sources.r1.bind = localhost agent.sources.r1.port = 44444 # Describe the sink agent.sinks.k1.type = hdfs agent.sinks.k1.hdfs.path = hdfs://localhost:9000/flume/events agent.sinks.k1.hdfs.fileType = DataStream # Use a channel which buffers events in memory agent.channels.c1.type = memory agent.channels.c1.capacity = 1000 agent.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel agent.sources.r1.channels = c1 agent.sinks.k1.channel = c1
Sqoop
Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and structured datastores such as relational databases. It can be used to import data from external datastores into Hadoop Distributed File System (HDFS) or related systems like Hive and HBase.
Example: Sqoop Import Command
Here is an example of a Sqoop command to import data from MySQL to HDFS:
sqoop import --connect jdbc:mysql://localhost/employees --username root --table employees --m 1 --target-dir /user/hadoop/employees
$ sqoop import --connect jdbc:mysql://localhost/employees --username root --table employees --m 1 --target-dir /user/hadoop/employees ... INFO mapreduce.ImportJobBase: Transferred 1.234 KB in 12.34 seconds (100.01 KB/sec) INFO mapreduce.ImportJobBase: Retrieved 100 records.
Oozie
Oozie is a workflow scheduler system to manage Hadoop jobs. Oozie workflows combine multiple jobs sequentially into one logical unit of work. It supports different types of Hadoop jobs like MapReduce, Pig, Hive, and Sqoop.
Example: Oozie Workflow
Here is an example of an Oozie workflow XML file:
Workflow failed, error message[${wf:errorMessage(wf:lastErrorNode())}]
ZooKeeper
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. It is used by Hadoop to keep track of distributed data and services.
Example: Basic ZooKeeper Commands
Here are some basic ZooKeeper commands:
create /my_path my_data
get /my_path
set /my_path new_data
delete /my_path
[zk: localhost:2181(CONNECTED) 0] create /my_path my_data Created /my_path [zk: localhost:2181(CONNECTED) 1] get /my_path my_data [zk: localhost:2181(CONNECTED) 2] set /my_path new_data [zk: localhost:2181(CONNECTED) 3] get /my_path new_data [zk: localhost:2181(CONNECTED) 4] delete /my_path