Toxic Comment Classification Challenge

This repository contains a Jupyter notebook that tackles the problem of classifying toxic comments into six distinct categories. The goal is to develop machine learning models that outperform existing benchmarks, with a focus on state-of-the-art performance metrics like mean column-wise ROC AUC. The challenge can be found here: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/overview.

Notebook Overview

Key Features:

Exploratory Data Analysis (EDA):

Visualization of word frequencies using word clouds.
Distribution analysis of the six toxicity categories.
Correlation analysis between labels.

Data Preprocessing:

Removal of special characters and punctuation using regular expressions.
Lowercasing text for consistency.
Removing multiple, leading, and trailing spaces from all comments.

Feature Engineering:

Implementation of text vectorization techniques:
- TfidfVectorizer: Computes term frequency-inverse document frequency scores and converts all words to their vectorized scores to be used for model training.
Creation of custom embeddings using SpaCy and pre-trained word vectors.

Model Training:

Evaluation of various machine learning models:
- Logistic Regression
- Random Forest Classifier
- Linear Support Vector Classifier (SVC)
Application of OneVsRestClassifier for multi-label classification.

Evaluation Metrics:

Accuracy
Precision, Recall, F1-Score (macro/micro/weighted averages)
Mean column-wise ROC AUC (primary challenge metric)
Confusion matrix and detailed classification reports

Technologies Used:

Natural Language Processing (NLP)
Machine Learning classification algorithms
Data visualization for interpretability

Requirements

To run the notebook, install the following Python libraries:

pandas
matplotlib
scikit-learn
spacy
numpy
wordcloud
seaborn

You can install the required libraries with:

pip install pandas nltk matplotlib scikit-learn spacy numpy wordcloud seaborn

Additionally, download the NLTK stopwords and SpaCy language models (if not already installed):

python -m nltk.downloader stopwords and python -m spacy download en_core_web_sm

How to Use

Clone the repository:

git clone https://github.com/Vimal-Raghubir/Toxic-Comment-Classification-Challenge.git
Open the Jupyter notebook:

jupyter notebook toxic_comment_classification_challenge.ipynb
Follow the structured steps in the notebook to:

Preprocess the dataset.
Explore data through visualizations.
Train and test models.
Analyze and interpret performance metrics.

Performance Evaluation

The notebook includes a comprehensive analysis of model performance:

Detailed metric evaluation for each label.
Mean column-wise ROC AUC as the key metric to align with the challenge requirements.
Comparison of model results to identify the best-performing approach.

Future Enhancements

Potential improvements to the notebook could include:

Implementing deep learning techniques such as RNNs or Transformers.
Hyperparameter optimization using grid search or random search.
Integration of external datasets to enhance model training.

Acknowledgments

This notebook was inspired by the Toxic Comment Classification Challenge, which aims to improve detection and classification of toxic online content, fostering a safer and more inclusive digital environment.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
toxic_comment_classification_challenge.ipynb		toxic_comment_classification_challenge.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Toxic Comment Classification Challenge

Notebook Overview

Key Features:

Technologies Used:

Requirements

How to Use

Performance Evaluation

Future Enhancements

Acknowledgments

About

Releases

Packages

Languages

Vimal-Raghubir/Toxic-Comment-Classification-Challenge

Folders and files

Latest commit

History

Repository files navigation

Toxic Comment Classification Challenge

Notebook Overview

Key Features:

Technologies Used:

Requirements

How to Use

Performance Evaluation

Future Enhancements

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages