Data Engineering Assignment

All the api keys in the repo are dead! Use your own keys.

Description

This project involves processing and analyzing scraped data using PySpark, Redis, and Neo4j. The aim is to store, process, and analyze text data efficiently.

Usage

Starting Services

Open Powershell in Administrator mode and run wsl:
```
wsl ~
```
Start Hadoop and Spark services:
```
start-dfs.sh
```
```
start-yarn.sh
```

Start Kafka and Zookeeper:

Note: Wait for about 30 seconds before performing the next step.

zookeeper-server-start.sh $KAFKA_HOME/config/zookeeper.properties &

kafka-server-start.sh $KAFKA_HOME/config/server.properties &

Switch to student:
```
su - student
```

Running Notebooks (Curently not working if ru scrape_article while consumer is running.)

Activate the virtual environment and start Jupyter Lab:
```
source de-prj/de-venv/bin/activate
jupyter lab
```
Open 2 Powershell Terminals from Windows, then go (de-venv) student@R2D3:~/urdirectory$
(To show kafka working) cd into the directory both files are in!
- Producer Terminal:
```
python kafka_producer_show.py
```
- Consumer Terminal:
```
 spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.13:3.5.1 kafka_consumer_show.py
```
  [!IMPORTANT]
  DO NOT RUN "$ python kafka_producer_show.py" when scrape_aritcles_into_words.ipynb or neo4j.ipynb is running. "kafka_consumer_show.py" can run in the background.
Run the notebooks in this sequence:
- scrape_articles_into_words.ipynb
- neo4j.ipynb

Stopping Services

Stop Kafka and Zookeeper:

Note: Wait for about 30 seconds before performing the next step.
```
kafka-server-stop.sh
```
```
zookeeper-server-stop.sh
```
Stop Hadoop and Spark services:
```
stop-yarn.sh
```
```
stop-dfs.sh
```

Data Storage and Processing

Data Collection and Raw Storage

What to Store: Raw scraped text data.
Where to Store: Hadoop HDFS.
Tool: PySpark for ingestion and Hadoop for storage.

Processed Data

What to Store: Cleaned and tokenized text.
Where to Store: Hadoop HDFS or a relational database.
Tool: PySpark for preprocessing.

Lexicon

What to Store: Words with definitions, relationships, and POS annotations.
Where to Store: Neo4j for relationships; Redis for fast retrieval.
Tool: Neo4j and Redis.

Analytics

What to Store: Analytical results.
Where to Store: Local files, Neo4j, and Redis.
Tool: Neo4j.

Real-Time Updates

What to Store: New and updated words.
Where to Store: Kafka for message streaming.
Tool: Kafka and Spark Structured Streaming.

Decision Highlights

Neo4j: For storing and querying word relationships.
Redis: For fast key-value lookups.
Hadoop HDFS: For scalable storage of raw and processed data.

Name		Name	Last commit message	Last commit date
Latest commit History 204 Commits
ForumScraper.py		ForumScraper.py
GlobalSparkSession.py		GlobalSparkSession.py
README.md		README.md
UtilsCleaner.py		UtilsCleaner.py
UtilsNeo4J.py		UtilsNeo4J.py
UtilsProcessor.py		UtilsProcessor.py
UtilsRedis.py		UtilsRedis.py
UtilsWikipedia.py		UtilsWikipedia.py
WordDetailsGenerator.py		WordDetailsGenerator.py
bryan-extra-indi-pyspark.ipynb		bryan-extra-indi-pyspark.ipynb
insert_missing_word_props.ipynb		insert_missing_word_props.ipynb
kafka_consumer_show.py		kafka_consumer_show.py
kafka_producer_show.py		kafka_producer_show.py
neo4j.ipynb		neo4j.ipynb
scrape_articles_into_words.ipynb		scrape_articles_into_words.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering Assignment

All the api keys in the repo are dead! Use your own keys.

Description

Usage

Starting Services

Running Notebooks (Curently not working if ru scrape_article while consumer is running.)

Stopping Services

Data Storage and Processing

Data Collection and Raw Storage

Processed Data

Lexicon

Analytics

Real-Time Updates

Decision Highlights

About

Releases

Packages

Contributors 3

Languages

Brynlai/Data-Engineering-Assignment-RDSY2S2

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Assignment

All the api keys in the repo are dead! Use your own keys.

Description

Usage

Starting Services

Running Notebooks (Curently not working if ru scrape_article while consumer is running.)

Stopping Services

Data Storage and Processing

Data Collection and Raw Storage

Processed Data

Lexicon

Analytics

Real-Time Updates

Decision Highlights

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages