A basic program which performs a distributed training using spark for a wine classification problem.
TO DEBUG: Currently the distributed training does not work when the number of nodes is greater than 2.
- Setup on Master Node
- Download JDK
- Set up scala
- Setup Hadoop/spark
- Setup clusters information on master node
- Starting spark
- Install Maven
- Training on Node
- Code from local
- Code from Github
- Compile using maven
- Run the training
- Model Save
- Standalone Prediction
- Renaming pom.xml for prediction
- Compile using maven
- Run the Prediction
- Docker Prediction
Download jdk using the following command
wget https://download.oracle.com/java/17/archive/jdk-17.0.6_linux-x64_bin.tar.gz
Extract it to the /usr/lib/jvm folder using
tar -xvf jdk-17.0.6_linux-x64_bin.tar.gz -C /usr/lib/jvm
Setup Javahome and path variables
export JAVA_HOME=/usr/lib/jvm/jdk-17.0.6/
export PATH=$PATH:/usr/lib/jvm/jdk-17.0.6/bin:/usr/local/spark/bin'
Install scala using
Sudo apt-get scala
Install Hadoop using the file from the url
wget https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz
extract it to spark-3.3.2-bin-hadoop3/ /usr/local/spark
set up path variable
export PATH=$PATH:/usr/lib/jvm/jdk-17.0.6/bin:/usr/local/spark/bin'
setup config file at /us r/local/spark/conf
Add the configuration to the spark-env file
export JAVA_HOME=/usr/lib/jvm/jdk-17.0.6
Perform this setup on all clusters and gather all their ips
Add the slave nodes information on the configuration file inside the conf directory
# A Spark Worker will be started on each of the machines listed below.
a. master
b. worker1
c. worker2
d. worker3
Note : make sure to map the ip's of the worker nodes to an alias as described below in the /etc/hosts file
For Eg: 173.394.293.28 worker1
To start spark run the following commands
Navigate to the /usr/local/sparkk/conf folder , from there run the file called "start-all.sh"
Note : we can check the spark running on all clusters by looking at the output of "jps"
Install maven using
Sudo apt install maven
1, Copy the code to the master instance using scp using following code
Scp -I "keyname" code_folder/ <awsuser>@<awssystem>:~/Project/
Pull the latest version of code from github using
Git pull <github url for code>
get the xml file for training using
cp pom_train.xml pom.xml
then run the maven command using
mvn clean package
then the jar will be generated in the target folder
To run the training, submit the job using the spark submit
spark-submit --class ClassificationModel --master spark:// target/wineClassification-2.0T.jar
Note : Check the jar file name as given in the pom.xml file.
The model will be saved in the directory where the spark was executed
Rename the pom_pred.xml file to the pom.xml
To run the pred, submit the job using the spark submit
spark-submit --class PredictionApp --master local target/wineClassification_Prediciton-3.0T.jar
docker build -t prediction_ml .
Note: the Dockerfile for this will be
COPY \*.tgz\* /app
RUN mkdir -p /usr/lib/jvm
RUN tar -xvf jdk-17.0.6_linux-x64_bin.tar.gz -C /usr/lib/jvm
RUN echo \'export JAVA_HOME=/usr/lib/jvm/jdk-17.0.6/\' \>\> \~/.bashrc
RUN echo \'export
ENV JAVA_HOME /usr/lib/jvm/jdk-17.0.6/
ENV PATH \$PATH:/usr/lib/jvm/jdk-17.0.6/bin:/usr/local/spark/bin
RUN export PATH
RUN apt-get install -y scala
RUN tar -xvf spark-3.3.2-bin-hadoop3.tgz
RUN mv spark-3.3.2-bin-hadoop3/ /usr/local/spark
RUN apt install -y maven
RUN cp /usr/local/spark/conf/spark-env.sh.template
RUN echo \'export JAVA_HOME=/usr/lib/jvm/jdk-17.0.6\' \>\>
COPY classification_model /app/
COPY Project /app/Project
WORKDIR /app/Project
RUN apt install -y vim
Once the file is created, push it to the repo using
Docker push <image name>
Pull the docker image onto the ec2 instance using
Docker pull <url>
Once the docker image is pulled, login to docker using
docker run -it prediction_ml /bin/bash
inside the docker , run the command for prediction
spark-submit --class PredictionApp --master local target/wineClassification_Prediciton-3.0T.jar
It will generate the accuracy as given the below screenshot