A basic program which performs a distributed training using spark for a wine classification problem.
TO DEBUG: Currently the distributed training does not work when the number of nodes is greater than 2.
- Setup on Master Node
- Download JDK
- Set up scala
- Setup Hadoop/spark
- Setup clusters information on master node
- Starting spark
- Install Maven
- Training on Node
- Code from local
- Code from Github
- Compile using maven
- Run the training
- Model Save
- Standalone Prediction
- Renaming pom.xml for prediction
- Compile using maven
- Run the Prediction
- DOCKER -- PREDICTION
- CREATING DOCKER
- Docker Prediction
-
Download jdk using the following command
-
wget https://download.oracle.com/java/17/archive/jdk-17.0.6_linux-x64_bin.tar.gz
-
Extract it to the /usr/lib/jvm folder using
-
tar -xvf jdk-17.0.6_linux-x64_bin.tar.gz -C /usr/lib/jvm
-
Setup Javahome and path variables
-
export JAVA_HOME=/usr/lib/jvm/jdk-17.0.6/
-
export PATH=$PATH:/usr/lib/jvm/jdk-17.0.6/bin:/usr/local/spark/bin'
Install scala using
Sudo apt-get scala
Install Hadoop using the file from the url
-
wget https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz
-
extract it to spark-3.3.2-bin-hadoop3/ /usr/local/spark
-
set up path variable
-
export PATH=$PATH:/usr/lib/jvm/jdk-17.0.6/bin:/usr/local/spark/bin'
-
setup config file at /us r/local/spark/conf
-
Add the configuration to the spark-env file
-
export SPARK_MASTER_HOST=172.31.39.218
-
export JAVA_HOME=/usr/lib/jvm/jdk-17.0.6
-
Perform this setup on all clusters and gather all their ips
-
Add the slave nodes information on the configuration file inside the conf directory
-
# A Spark Worker will be started on each of the machines listed below.
a. master
b. worker1
c. worker2
d. worker3
Note : make sure to map the ip's of the worker nodes to an alias as described below in the /etc/hosts file
For Eg: 173.394.293.28 worker1
To start spark run the following commands
Navigate to the /usr/local/sparkk/conf folder , from there run the file called "start-all.sh"
Note : we can check the spark running on all clusters by looking at the output of "jps"
Install maven using
Sudo apt install maven
1, Copy the code to the master instance using scp using following code
Scp -I "keyname" code_folder/ <awsuser>@<awssystem>:~/Project/
Pull the latest version of code from github using
Git pull <github url for code>
-
get the xml file for training using
-
cp pom_train.xml pom.xml
-
then run the maven command using
-
mvn clean package
-
then the jar will be generated in the target folder
To run the training, submit the job using the spark submit
spark-submit --class ClassificationModel --master spark://172.31.39.218:7077 target/wineClassification-2.0T.jar
Note : Check the jar file name as given in the pom.xml file.
The model will be saved in the directory where the spark was executed
Rename the pom_pred.xml file to the pom.xml
-
get the xml file for training using
-
cp pom_train.xml pom.xml
-
then run the maven command using
-
mvn clean package
-
then the jar will be generated in the target folder
To run the pred, submit the job using the spark submit
spark-submit --class PredictionApp --master local target/wineClassification_Prediciton-3.0T.jar
FOR CREATING DOCKER image, use
docker build -t prediction_ml .
Note: the Dockerfile for this will be
COPY \*.tgz\* /app
RUN mkdir -p /usr/lib/jvm
RUN tar -xvf jdk-17.0.6_linux-x64_bin.tar.gz -C /usr/lib/jvm
RUN echo \'export JAVA_HOME=/usr/lib/jvm/jdk-17.0.6/\' \>\> \~/.bashrc
RUN echo \'export
PATH=\$PATH:/usr/lib/jvm/jdk-17.0.6/bin:/usr/local/spark/bin\'
ENV JAVA_HOME /usr/lib/jvm/jdk-17.0.6/
RUN export JAVA_HOME
ENV PATH \$PATH:/usr/lib/jvm/jdk-17.0.6/bin:/usr/local/spark/bin
RUN export PATH
RUN apt-get install -y scala
RUN tar -xvf spark-3.3.2-bin-hadoop3.tgz
RUN mv spark-3.3.2-bin-hadoop3/ /usr/local/spark
RUN apt install -y maven
RUN cp /usr/local/spark/conf/spark-env.sh.template
/usr/local/spark/conf/spark-env.sh
RUN echo \'export JAVA_HOME=/usr/lib/jvm/jdk-17.0.6\' \>\>
/usr/local/spark/conf/spark-env.sh
COPY classification_model /app/
COPY Project /app/Project
WORKDIR /app/Project
RUN apt install -y vim
Once the file is created, push it to the repo using
Docker push <image name>
-
Pull the docker image onto the ec2 instance using
-
Docker pull <url>
-
Once the docker image is pulled, login to docker using
-
docker run -it prediction_ml /bin/bash
-
inside the docker , run the command for prediction
-
spark-submit --class PredictionApp --master local target/wineClassification_Prediciton-3.0T.jar
It will generate the accuracy as given the below screenshot