This project is directed by the University of Fribourg in the context of the course FS2023: 63091 Social Media Analytics
In order to use the Influence on Sales application, please follow the steps below for installation, user guide and information.
To install the Influence On Sales application, you will need to:
- First create a directory on your computer and open a terminal on this directory.
- Go on https://git-lfs.com/ and install the lfs tool as explained on the website.
- Alternatively, download, copy, past the dataset amazon-meta.txt after the clone (command 3) on https://github.com/qnater/InfluenceOnSales/blob/master/dataset/origin_dataset/amazon-meta.txt into the folder : ./dataset/origin_dataset/
- On the terminal, clone the GitHub repository. To do so, copy and paste this command :
git clone https://github.com/qnater/InfluenceOnSales.git
- Once clone enter inside the clone directory with the command :
cd InfluenceOnSales
- Install the requirements with the command :
pip install -r requirements.txt
- If an error appears, you may need to install pip.
- Once installed, the application can be run with one of the following commands, according to your operating system:
python project_launcher.py
for Windows,python3 project_launcher.py
for Mac - To understand how to use the different features and commands, please follow the user guide below.
The Influence on Sales application enables the analysis of datasets of Amazon products, that are registered with the Amazon Standard Identification Number (ASIN). The app consists of different modules: Pre-Processing, Enrichment, Analytics, Exploration, Persistence and Visualization. These modules are integrated into the scenarios described below.
To call a scenario, simply run the project_launcher.py file on a terminal and write the scenario to launch on the console.
In this class you can find six scenarios to conduct the analysis modules.
In this scenario, the initial dataset will be cleaned and sampled into four different graphs. This process will be displayed by providing information on the number of nodes and quality of the clustering for each graph. During the cleaning operation, we will remove unnecessary nodes (not out-edged, not in-edged and isolated).
amazon-meta.txt (700'000 nodes), dataset_off_amazon_enrichment.txt (180'000 nodes), dataset_off_amazon_big.txt (120'000 nodes), dataset_off_amazon_small.txt (60'000)
Runtime, Clustering Coefficient, number of nodes, number of edges, average degree.
In this scenario, we compare algorithms of community detection with different datasets. To do so, on each graph, three different community detection algorithms are executed (simple homemade, enhanced homemade with weight and networkX library), popular nodes are identified and community partition quality is evaluated with metrics such as Accuracy, Precision, Recall, Jaccard Similarity, Silhouette Index.
dataset_off_amazon_enrichment.txt (180'000 nodes), dataset_off_amazon_big.txt (120'000 nodes)
Runtime, silhouette index, accuracy, precision, recall, Jaccard similarity, communities detected, popular nodes of each community with centrality value.
In this scenario, a small sample of the dataset will be used to visualize the graph. After running the community detection algorithm, the graph will be plotted with communities in different colors, and the most popular node inside each highlighted.
dataset_off_amazon_test.txt (11'000 nodes)
Plot image
In this scenario, a small sample of the initial dataset will be used to conduct a deep analysis of the quality of the graph, as well as the connections between nodes and communities (paths).
dataset_off_amazon_test.txt (11'000 nodes)
Plot image
This scenario consists of analysing the report between the betweeness centrality of the popular nodes of each community and their actual sale ranks.
dataset_off_amazon_enrichment.txt (180'000 nodes), dataset_off_amazon_test.txt (11'000 nodes)
ASIN, betweeness centrality value, sale rank
This scenario consists of all four above described scenarios, which will be conducted in succession.
Based on the user choice: dataset_off_amazon_enrichment.txt (180'000 nodes), dataset_off_amazon_big.txt (120'000 nodes), dataset_off_amazon_small.txt (60'000 nodes), dataset_off_amazon_test.txt (11'000 nodes)
Runtime, Clustering Coefficient, number of nodes, number of edges, average degree, silhouette index, accuracy, precision, recall, Jaccard similarity, community detected, popular nodes of each community with centrality measures, plot images.
This file allows the unit tests of every possible implementation in the GitHub Circle CI.
This file compares the different algorithms for group-based community detection.
This file allows the merge of the main amazon dataset with the enriched dataset.
This file allows the population of the online database Neo4J.
-
You can find the database on this link : https://workspace-preview.neo4j.io/workspace/query
-
To connect, please go to "Query", click on the central button "No connection", then on "Connect".
Connection URL 95147e5a.databases.neo4j.io:7687 Database user neo4j Password GslPkJDwnmAZC_COZUcHQ1hFymVSQTzS_f6loACAyNY -
To import the queries, please go on "Saved Cypher" and import the file "./docs/neo4j_queries.csv" of the project tree.