#
Hadoop
Hadoop is a scalable, distributed computing solution provided by Apache. Similar to queuing systems, Hadoop allows for distributed processing of large data sets.
#
Workflow
#
Installing Hadoop Manually to Shared Filesystem
The flight environment will need to be activated before the environments can be created so be sure to run flight start
or setup your environment to automatically activate the flight environment.
- Install dependencies for Hadoop (press 'y' to confirm the installation when prompted):
[flight@chead1 (mycluster1) ~]$ sudo yum install java-1.8.0-openjdk.x86_64 java-1.8.0-openjdk-devel.x86_64
- Download Hadoop v3.2.1:
[flight@chead1 (mycluster1) ~]$ flight silo file pull openflight:hadoop/hadoop-3.2.1.tar.gz /tmp/
- Decompress the Hadoop installation to shared storage:
[flight@chead1 (mycluster1) ~]$ mkdir apps
[flight@chead1 (mycluster1) ~]$ cd apps
[flight@chead1 (mycluster1) apps]$ tar xzf /tmp/hadoop-3.2.1.tar.gz
- Edit line 54 in
apps/hadoop-3.2.1/etc/hadoop/hadoop-env.sh
to point to the Java installation as follows:
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.362.b09-2.el8_7.x86_64/jre
Return to the home directory for the next steps
[flight@chead1 (mycluster1) ~]$ cd ~
#
Downloading the Hadoop Job
These steps help setup the Hadoop environment and download a spreadsheet of data which will Hadoop will sort into sales units per region.
- Download and source Hadoop environment variables:
[flight@chead1 (mycluster1) ~]$ flight silo file pull openflight:hadoop/hadoopenv
[flight@chead1 (mycluster1) ~]$ source hadoopenv
Be sure to update line 1 in hadoopenv
if you are setting this up in a different location.
- Create job directory:
[flight@chead1 (mycluster1) ~]$ mkdir MapReduceTutorial
[flight@chead1 (mycluster1) ~]$ chmod 777 MapReduceTutorial
- Download job data:
[flight@chead1 (mycluster1) ~]$ cd MapReduceTutorial
[flight@chead1 (mycluster1) MapReduceTutorial]$ flight silo file pull openflight:hadoop/hdfiles.zip
[flight@chead1 (mycluster1) MapReduceTutorial]$ unzip -j hdfiles.zip
- Check that job data files are present:
[flight@chead1 (mycluster1) MapReduceTutorial]$ ls
desktop.ini hdfiles.zip SalesCountryDriver.java SalesCountryReducer.java SalesJan2009.csv SalesMapper.java
#
Preparing the Hadoop Job
- Compile java for job:
[flight@chead1 (mycluster1) MapReduceTutorial]$ javac -d . SalesMapper.java SalesCountryReducer.java SalesCountryDriver.java
- Create a manifest file:
[flight@chead1 (mycluster1) MapReduceTutorial]$ echo "Main-Class: SalesCountry.SalesCountryDriver" >> Manifest.txt
- Compile the final java file for job:
[flight@chead1 (mycluster1) MapReduceTutorial]$ jar cfm ProductSalePerCountry.jar Manifest.txt SalesCountry/*.class
#
Starting the Hadoop Environment
- Start the Hadoop distributed file system service:
[flight@chead1 (mycluster1) MapReduceTutorial]$ $HADOOP_HOME/sbin/start-dfs.sh
- Start the resource manager, node manager and app manager service:
[flight@chead1 (mycluster1) MapReduceTutorial]$ $HADOOP_HOME/sbin/start-yarn.sh
- Create directory for processing data and copy sales results in:
[flight@chead1 (mycluster1) MapReduceTutorial]$ mkdir ~/inputMapReduce
[flight@chead1 (mycluster1) MapReduceTutorial]$ cp SalesJan2009.csv ~/inputMapReduce/
- Load the data into the distributed file system:
[flight@chead1 (mycluster1) MapReduceTutorial]$ $HADOOP_HOME/bin/hdfs dfs -ls ~/inputMapReduce
#
Running the Hadoop Job
- Execute the MapReduce job:
[flight@chead1 (mycluster1) MapReduceTutorial]$ $HADOOP_HOME/bin/hadoop jar ProductSalePerCountry.jar ~/inputMapReduce ~/mapreduce_output_sales
- View the job results:
[flight@chead1 (mycluster1) MapReduceTutorial]$ $HADOOP_HOME/bin/hdfs dfs -cat ~/mapreduce_output_sales/part-00000 | more