This project involves processing and analyzing scraped data using PySpark, Redis, and Neo4j. The aim is to store, process, and analyze text data efficiently.
-
Open Powershell in Administrator mode and run wsl:
wsl ~
-
Start Hadoop and Spark services:
start-dfs.sh
start-yarn.sh
-
Start Kafka and Zookeeper:
Note: Wait for about 30 seconds before performing the next step.
zookeeper-server-start.sh $KAFKA_HOME/config/zookeeper.properties &
kafka-server-start.sh $KAFKA_HOME/config/server.properties &
-
Switch to student:
su - student
-
Activate the virtual environment and start Jupyter Lab:
source de-prj/de-venv/bin/activate jupyter lab
-
Open 2 Powershell Terminals from Windows, then go (de-venv) student@R2D3:~/urdirectory$
-
(To show kafka working) cd into the directory both files are in!
- Producer Terminal:
python kafka_producer_show.py
- Consumer Terminal:
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.13:3.5.1 kafka_consumer_show.py
[!IMPORTANT]
DO NOT RUN "$ python kafka_producer_show.py" when scrape_aritcles_into_words.ipynb or neo4j.ipynb is running. "kafka_consumer_show.py" can run in the background.
- Producer Terminal:
-
Run the notebooks in this sequence:
scrape_articles_into_words.ipynb
neo4j.ipynb
- Stop Kafka and Zookeeper:
Note: Wait for about 30 seconds before performing the next step.
kafka-server-stop.sh
zookeeper-server-stop.sh
- Stop Hadoop and Spark services:
stop-yarn.sh
stop-dfs.sh
- What to Store: Raw scraped text data.
- Where to Store: Hadoop HDFS.
- Tool: PySpark for ingestion and Hadoop for storage.
- What to Store: Cleaned and tokenized text.
- Where to Store: Hadoop HDFS or a relational database.
- Tool: PySpark for preprocessing.
- What to Store: Words with definitions, relationships, and POS annotations.
- Where to Store: Neo4j for relationships; Redis for fast retrieval.
- Tool: Neo4j and Redis.
- What to Store: Analytical results.
- Where to Store: Local files, Neo4j, and Redis.
- Tool: Neo4j.
- What to Store: New and updated words.
- Where to Store: Kafka for message streaming.
- Tool: Kafka and Spark Structured Streaming.
- Neo4j: For storing and querying word relationships.
- Redis: For fast key-value lookups.
- Hadoop HDFS: For scalable storage of raw and processed data.