# Hadoop

Hadoop is a scalable, distributed computing solution provided by Apache. Similar to queuing systems, Hadoop allows for distributed processing of large data sets.

# Workflow

# Installing Hadoop Manually to Shared Filesystem

  • Install dependencies for Hadoop (press 'y' to confirm the installation when prompted):
[flight@chead1 (mycluster1) ~]$ sudo yum install java-1.8.0-openjdk.x86_64 java-1.8.0-openjdk-devel.x86_64
  • Download Hadoop v3.2.1:
[flight@chead1 (mycluster1) ~]$ wget -O /tmp/hadoop.tgz http://tiny.cc/hadoop321
  • Decompress the Hadoop installation to shared storage:
[flight@chead1 (mycluster1) ~]$ cd /opt/apps
[flight@chead1 (mycluster1) ~]$ tar xzf /tmp/hadoop.tgz
  • Edit line 54 in /opt/apps/hadoop-3.2.1/etc/hadoop/hadoop-env.sh to point to the Java installation as follows:
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.242.b08-0.el7_7.x86_64/jre

# Downloading the Hadoop Job

These steps help setup the Hadoop environment and download a spreadsheet of data which will Hadoop will sort into sales units per region.

  • Download and source Hadoop environment variables:
[flight@chead1 (mycluster1) ~]$ wget https://tinyurl.com/hadoopenv
[flight@chead1 (mycluster1) ~]$ source hadoopenv
  • Create job directory:
[flight@chead1 (mycluster1) ~]$ mkdir MapReduceTutorial
[flight@chead1 (mycluster1) ~]$ chmod 777 MapReduceTutorial
  • Download job data:
[flight@chead1 (mycluster1) ~]$ cd MapReduceTutorial
[flight@chead1 (mycluster1) MapReduceTutorial]$ wget -O hdfiles.zip https://tinyurl.com/hdinput1
[flight@chead1 (mycluster1) MapReduceTutorial]$ unzip -j hdfiles.zip
  • Check that job data files are present:
[flight@chead1 (mycluster1) MapReduceTutorial]$ ls
desktop.ini  hdfiles.zip  SalesCountryDriver.java  SalesCountryReducer.java  SalesJan2009.csv  SalesMapper.java

# Preparing the Hadoop Job

  • Compile java for job:
[flight@chead1 (mycluster1) MapReduceTutorial]$ javac -d . SalesMapper.java SalesCountryReducer.java SalesCountryDriver.java
  • Create a manifest file:
[flight@chead1 (mycluster1) MapReduceTutorial]$ echo "Main-Class: SalesCountry.SalesCountryDriver" >> Manifest.txt
  • Compile the final java file for job:
[flight@chead1 (mycluster1) MapReduceTutorial]$ jar cfm ProductSalePerCountry.jar Manifest.txt SalesCountry/*.class

# Starting the Hadoop Environment

  • Start the Hadoop distributed file system service:
[flight@chead1 (mycluster1) MapReduceTutorial]$ $HADOOP_HOME/sbin/start-dfs.sh
  • Start the resource manager, node manager and app manager service:
[flight@chead1 (mycluster1) MapReduceTutorial]$ $HADOOP_HOME/sbin/start-yarn.sh
  • Create directory for processing data and copy sales results in:
[flight@chead1 (mycluster1) MapReduceTutorial]$ mkdir ~/inputMapReduce
[flight@chead1 (mycluster1) MapReduceTutorial]$ cp SalesJan2009.csv ~/inputMapReduce/
  • Load the data into the distributed file system:
[flight@chead1 (mycluster1) MapReduceTutorial]$ $HADOOP_HOME/bin/hdfs dfs -ls ~/inputMapReduce

# Running the Hadoop Job

  • Execute the MapReduce job:
[flight@chead1 (mycluster1) MapReduceTutorial]$ $HADOOP_HOME/bin/hadoop jar ProductSalePerCountry.jar ~/inputMapReduce ~/mapreduce_output_sales
  • View the job results:
[flight@chead1 (mycluster1) MapReduceTutorial]$ $HADOOP_HOME/bin/hdfs dfs -cat ~/mapreduce_output_sales/part-00000 | more