Hadoop Ecosystem | Big Data | Datascience Tutorial

Introduction to Hadoop Ecosystem

The Hadoop Ecosystem is a suite of tools and frameworks that help in processing and managing large datasets. The core of the ecosystem is the Hadoop framework, which consists of two main components:

Hadoop Distributed File System (HDFS) - A distributed file system that stores data across multiple machines.
MapReduce - A programming model for processing large datasets in parallel.

Key Components of Hadoop Ecosystem

Here are some of the key components of the Hadoop Ecosystem:

HDFS - A distributed file system that provides high-throughput access to application data.
YARN - A resource management platform responsible for managing compute resources in clusters and using them for scheduling users' applications.
MapReduce - A YARN-based system for parallel processing of large datasets.
Hive - A data warehouse infrastructure that provides data summarization and ad hoc querying.
Pig - A high-level platform for creating MapReduce programs used with Hadoop.
HBase - A distributed, scalable, big data store that provides random, real-time read/write access to data.
Spark - An open-source cluster-computing framework for real-time processing.
Flume - A service for efficiently collecting, aggregating, and moving large amounts of log data.
Sqoop - A tool designed for efficiently transferring bulk data between Hadoop and structured datastores such as relational databases.
Oozie - A workflow scheduler system to manage Hadoop jobs.
ZooKeeper - A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

HDFS (Hadoop Distributed File System)

HDFS is the primary storage system used by Hadoop applications. It provides high throughput access to application data and is suitable for applications that have large datasets.

Example: Basic HDFS Commands

Here are some basic HDFS commands:

hdfs dfs -ls /

hdfs dfs -mkdir /user/hadoop

hdfs dfs -put localfile /user/hadoop

hdfs dfs -get /user/hadoop/remote_file local_file

$ hdfs dfs -ls /
Found 3 items
drwxr-xr-x - hadoop supergroup 0 2023-01-01 12:00 /user
drwxr-xr-x - hadoop supergroup 0 2023-01-01 12:00 /tmp
drwxr-xr-x - hadoop supergroup 0 2023-01-01 12:00 /data

MapReduce

MapReduce is a programming model for processing large datasets with a distributed algorithm on a cluster. It consists of a Map function that processes key-value pairs to generate a set of intermediate key-value pairs, and a Reduce function that merges all intermediate values associated with the same intermediate key.

Example: Word Count in MapReduce

Here is a simple example of a MapReduce program that counts the number of occurrences of each word in a text file.

hadoop jar hadoop-mapreduce-examples.jar wordcount input output

// Mapper
public static class TokenizerMapper
    extends Mapper{
        
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    
    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
        StringTokenizer itr = new StringTokenizer(value.toString());
        while (itr.hasMoreTokens()) {
            word.set(itr.nextToken());
            context.write(word, one);
        }
    }
}

// Reducer
public static class IntSumReducer
    extends Reducer {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable values,
                       Context context
                       ) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        result.set(sum);
        context.write(key, result);
    }
}

Hive

Hive is a data warehouse infrastructure built on top of Hadoop. It provides data summarization, query, and analysis capabilities. Hive uses a SQL-like language called HiveQL for querying data stored in various databases and file systems that integrate with Hadoop.

Example: Basic Hive Commands

Here are some basic Hive commands:

CREATE TABLE employee (id INT, name STRING, age INT, salary FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

LOAD DATA INPATH '/user/hive/warehouse/employees.csv' INTO TABLE employee;

SELECT * FROM employee;

hive> SELECT * FROM employee;
OK
1   John    30  4000.0
2   Jane    25  3500.0
3   Dave    35  4500.0

Pig

Pig is a high-level platform for creating MapReduce programs used with Hadoop. The language for Pig is called Pig Latin, which is a data flow language that supports nested data types such as bags, tuples, and maps.

Example: Pig Script

Here is an example of a Pig script that loads data, filters it, and stores the results:

pig -x local

// Pig Script
data = LOAD 'input.txt' USING PigStorage(',') AS (name:chararray, age:int, gpa:float);
filtered_data = FILTER data BY age > 20;
STORE filtered_data INTO 'output.txt' USING PigStorage(',');

HBase

HBase is a distributed, scalable, big data store that provides random, real-time read/write access to data. It is designed to handle large amounts of data across many commodity servers without a single point of failure.

Example: HBase Shell Commands

Here are some basic HBase shell commands:

create 'employee', 'personal_data', 'professional_data'

put 'employee', '1', 'personal_data:name', 'John'

put 'employee', '1', 'professional_data:role', 'Developer'

scan 'employee'

hbase(main):001:0> scan 'employee'
ROW           COLUMN+CELL
 1            column=personal_data:name, timestamp=1234567890, value=John
 1            column=professional_data:role, timestamp=1234567890, value=Developer
1 row(s) in 0.1230 seconds

Spark

Spark is an open-source cluster-computing framework that provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. It is designed for fast computation and supports various data sources.

Example: Spark Word Count

Here is a simple example of a Spark program that counts the number of occurrences of each word in a text file:

spark-submit --class org.apache.spark.examples.JavaWordCount --master local[2] spark-examples_2.11-2.4.5.jar input.txt

// Spark Word Count Example
import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;
import java.util.Arrays;
import java.util.Iterator;

public class JavaWordCount {
    public static void main(String[] args) {
        String inputFile = args[0];
        SparkConf conf = new SparkConf().setAppName("wordCount");
        JavaSparkContext sc = new JavaSparkContext(conf);
        JavaRDD input = sc.textFile(inputFile);
        JavaRDD words = input.flatMap(
            new FlatMapFunction() {
                public Iterator call(String s) {
                    return Arrays.asList(s.split(" ")).iterator();
                }
            }
        );
        JavaPairRDD counts = words.mapToPair(
            new PairFunction() {
                public Tuple2 call(String s) {
                    return new Tuple2(s, 1);
                }
            }
        ).reduceByKey(
            new Function2() {
                public Integer call(Integer x, Integer y) { return x + y; }
            }
        );
        counts.saveAsTextFile("output");
    }
}

Flume

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows.

Example: Flume Configuration

Here is an example of a Flume configuration file:

# Define the source, sink, and channel
agent.sources = r1
agent.sinks = k1
agent.channels = c1

# Describe/configure the source
agent.sources.r1.type = netcat
agent.sources.r1.bind = localhost
agent.sources.r1.port = 44444

# Describe the sink
agent.sinks.k1.type = hdfs
agent.sinks.k1.hdfs.path = hdfs://localhost:9000/flume/events
agent.sinks.k1.hdfs.fileType = DataStream

# Use a channel which buffers events in memory
agent.channels.c1.type = memory
agent.channels.c1.capacity = 1000
agent.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
agent.sources.r1.channels = c1
agent.sinks.k1.channel = c1

Sqoop

Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and structured datastores such as relational databases. It can be used to import data from external datastores into Hadoop Distributed File System (HDFS) or related systems like Hive and HBase.

Example: Sqoop Import Command

Here is an example of a Sqoop command to import data from MySQL to HDFS:

sqoop import --connect jdbc:mysql://localhost/employees --username root --table employees --m 1 --target-dir /user/hadoop/employees

$ sqoop import --connect jdbc:mysql://localhost/employees --username root --table employees --m 1 --target-dir /user/hadoop/employees
...
INFO mapreduce.ImportJobBase: Transferred 1.234 KB in 12.34 seconds (100.01 KB/sec)
INFO mapreduce.ImportJobBase: Retrieved 100 records.

Oozie

Oozie is a workflow scheduler system to manage Hadoop jobs. Oozie workflows combine multiple jobs sequentially into one logical unit of work. It supports different types of Hadoop jobs like MapReduce, Pig, Hive, and Sqoop.

Example: Oozie Workflow

Here is an example of an Oozie workflow XML file:


    
    
        
            
        
        
        
    
    
        Workflow failed, error message[${wf:errorMessage(wf:lastErrorNode())}]

ZooKeeper

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. It is used by Hadoop to keep track of distributed data and services.

Example: Basic ZooKeeper Commands

Here are some basic ZooKeeper commands:

create /my_path my_data

get /my_path

set /my_path new_data

delete /my_path

[zk: localhost:2181(CONNECTED) 0] create /my_path my_data
Created /my_path
[zk: localhost:2181(CONNECTED) 1] get /my_path
my_data
[zk: localhost:2181(CONNECTED) 2] set /my_path new_data
[zk: localhost:2181(CONNECTED) 3] get /my_path
new_data
[zk: localhost:2181(CONNECTED) 4] delete /my_path

Hadoop Ecosystem Tutorial