Hate Speech Detection using Machine Learning for Roman Urdu

Machine Learning / AI

Project Details

Project Information

Project Title: Hate Speech Detection using Machine Learning for Roman Urdu

Category: Machine Learning / AI

Semester: Fall 2024

Course: CS619

Complexity: Complex

Supervisor Details

Project Description

Hate Speech Detection using Machine Learning for Roman Urdu

 

Project Domain / Category

 

Data Science / Machine Learning / Natural Language Processing (NLP)

 

Abstract / Introduction

 

With the rise of online social media platforms, the issue of hate speech has become increasingly prevalent. Hate speech can lead to social tension and harm, especially in multilingual countries like Pakistan, where Roman Urdu is commonly used online. This project aims to develop a machine learning model to detect hate speech in Roman Urdu comments. The focus is on gathering a robust dataset of Roman Urdu comments from social media, pre-processing it, extracting relevant features, and training machine learning models to classify hate speech effectively. Additionally, a web interface will be developed post-completion to allow users to test the model's performance with real-time data.

 

Functional Requirements:

 

Admin (Student) will perform all these (Functional Requirements) tasks.

 

1.      Data-Collection

 

         For this project, student will collect data from any social media platform (such as YouTube, Facebook, Twitter, or Instagram) to detect hate speech. The dataset must contain at least 5000 comments, focusing on Roman Urdu. The data set is shared in the link below for the idea.

 

2.      Data Preparation

 

         Prepare the dataset by labelling it as "Hate Speech (HS)" or "Non-Hate Speech (NHS)." This step involves manually reviewing the data to assign appropriate labels, ensuring the dataset is clean and ready for use in machine learning.

 

3.      Data Pre-Processing

 

         As most of the data in the real world are incomplete containing noisy and missing values. Therefore, student have to apply pre-processing on data. In pre-processing, student will normalize the dataset, handle stop words, missing values, and noise & outliers, and remove duplicate values.

 

4.      Feature Extraction

 

         After the pre-processing step, student will apply the feature extraction method. Student can use Term Frequency - Inverse Document Frequency (TF-IDF), Uni-Gram (1-Gram), Bi-Grams (2-Grams), Tri-Grams (3-Grams), or N-Grams feature extraction method.

 

5.      Train & Test Data

 

         Split the dataset into 70% training and 30% testing data for the machine learning models.

 

6.      Machine learning Techniques

 

         Student must use at least three classifiers/models (e.g. Naïve Bayes, Naïve Bayes Multinomial, Poly Kernel, RBF Kernel, Decision Tree, Random Tree or Random Forest Tree etc.) of three different machine learning techniques/algorithms.

 

7.      Confusion Matrix

 

         Generate a confusion matrix to evaluate the performance of each classification model.

 

8.      Accuracy Evaluation

 

         Find the accuracy of all techniques and compare their accuracy.

 

         This project will also tell us which machine learning technique is better to detect Toxic comments.

 

9.      Web Interface Integration

 

         After the model development, integrate a web interface to allow users to test the model’s performance using real-time comments.

 

Tools/Techniques:

 

         Anaconda: Python distribution platform for development.

 

         Jupiter Notebook: For implementing machine learning models.

 

         Python: Programming language used for data pre-processing, model training, and feature extraction.

 

         Machine Learning Algorithms: For training and testing hate speech detection.

 

         Web Interface: Basic HTML/CSS, Flask, or Django.

 

Prerequisite:

 

         Knowledge of Artificial Intelligence, Machine Learning, and Natural Language Processing concepts is required. Students will cover a short course relevant to these concepts, alongside SRS and Design initial documentation or see the links below.

 

Helping Material:

 

Python:

 

https://www.python.org/

 

https://www.w3schools.com/python/

 

https://www.tutorialspoint.com/python/index.htm

 

Feature Extraction Method:

 

https://towardsdatascience.com/feature-extraction-techniques-d619b56e31be

 

https://www.analyticsvidhya.com/blog/2021/04/guide-for-feature-extraction-techniques/

 

https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-

 

real-world-dataset-796d339a4089

 

https://www.analyticsvidhya.com/blog/2021/07/feature-extraction-and-embeddings-in-nlp-

 

a-beginners-guide-to-understand-natural-language-processing/

 

http://uc-r.github.io/creating-text-features

 

Machine Learning Techniques:

 

https://towardsdatascience.com/machine-learning-an-introduction-23b84d51e6d0

 

https://towardsdatascience.com/top-10-algorithms-for-machine-learning-beginners-

 

149374935f3c

 

https://towardsdatascience.com/10-machine-learning-methods-that-every-data-scientist-

 

should-know-3cc96e0eeee9

 

https://towardsdatascience.com/machine-learning-classifiers-a5cc4e1b0623

 

https://www.youtube.com/watch?v=fG4e4TUrJ3E

 

https://www.youtube.com/watch?v=7eh4d6sabA0

 

Dataset:

 

https://drive.google.com/file/d/1Jq62ErAQiMpWfEz9_DwSkjmyYdmwWWu6/view

 

Supervisor:

 

Name: Tayyab Waqar

 

Email ID: tayyab.waqar@vu.edu.pk

 

Skype ID: maliktayyab786_1

 

Languages

  • Python Language

Tools

  • Any Modern Tools Tool

Project Schedules

Assignment #
Title
Start Date
End Date
Sample File
1
SRS Document
Friday 8, November, 2024 12:00AM
Wednesday 4, December, 2024 12:00AM
2
Design Document
Thursday 5, December, 2024 12:00AM
Thursday 27, February, 2025 12:00AM
3
Prototype Phase
Friday 28, February, 2025 12:00AM
Tuesday 18, March, 2025 12:00AM
4
Final Deliverable
Wednesday 19, March, 2025 12:00AM
Monday 5, May, 2025 12:00AM

Viva Review Submission

Review Information
Supervisor Behavior

Student Viva Reviews

No reviews available for this project.