Classification of sexual harassment personal stories .
Problem Statement
The me too or (#MeToo) movement is a social movement against sexual abuse and sexual harassment where people publicize allegations of sexual crime. The purpose of “Me Too” movement is to empower sexually assaulted people through empathy and solidarity through strength in numbers.
Over the past few years , social media have come to a wide use in social movements. The “Me Too” movement spread virally as a hashtag on social media.
From the above statistical data we can confirm that social media have helped victims come out and share their story with everyone which was not usually the case before.
With the vast amount of personal stories shared by the people on internet, it is difficult to manually sort and understand the information shared in these stories.
How Machine Learning can be used to solve this problem ?
Each personal story includes one or more tagged forms of the sexual harassment, along with the description of the incident. Our problem is to categorize these stories. Each story can fall into one or more classes or can even belong to none of the classes. We can map this to a multi label classification or we can also convert our multi label problem to multi class. We will see later in the blog how this can be achieved.
Data Overview
The data is collected from -
Safecity is a platform as a service product that powers communities, police and city government to prevent violence in public and private spaces. It is one of the largest publicly-available online forum for reporting sexual harassment.
We have been provided three csv files — Train, Dev and Test
Train.csv — This file has around 7201 training samples and four columns.
Dev.csv — For development/validation we have 990 samples.
Test.csv — For testing we have 1701 samples.
We have three classes — Commenting, Ogling/Facial Expressions/Staring and Touching/Groping.
Existing Solutions
In the research paper, deep learning architectures like CNN , RNN and hybrid CNN-RNN with word and character embeddings have been used.
Improvements to existing approaches
In the research paper they have only used the textual data . With the help of feature engineering we will generate some new features like —
- Sentiment Score
- Noun Count
- Verb Count
- Adjective Count
- Adverb Count
- Pronoun Count
We will experiment with all the classical machine learning algorithms like Logistic Regression, KNN, Support Vector Machine, Random Forest, Decision Tree and Gradient Boosting, XG Boost and Cat Boost.
Performance Metric
As we will be converting our multi label problem to multi class, so our performance metric will be log loss.
Log loss formulation — It is the average of negative log of a probability of correct class label. Its value can lie between 0 to infinity and the smaller, the better.
Precision — In layman’s term precision can be defined as no. of all the points our model predicted to be class m and which were actually class m divided by all the points which our model predicted to be belonging to class m.
Exploratory Data Analysis
Before solving the main problem, we will visualize our data and will look to find some insights which can be useful later on .
From the above plot we found that majority of our samples belong to class Commenting and we have the least amount of samples for class Ogling/Facial Expression/Staring.
We have 600 samples which belong to both commenting and ogling while we have only 145 samples for ogling and touching . From the above plots we can conclude that our data is highly imbalanced.
From the above bar plot we see the most common words are man, bus, touched, tried, touching when one of the class is Touching/Groping.
Most common words are staring,man,boys,commenting etc when one of the class is Ogling/Staring.
Another bar plot of the most common words belonging to class Commenting.
It can be seen here that group boys, took place, tried touch , ran away etc were the most common bigrams in our corpus
Most common trigrams were incident took place, survey carried safecity and red dot foundation.
We converted our text data into 300 dimension word vectors using pre trained Glove embeddings. We can notice that our data points are not completely distinguishable in 2 dimensions.
Insights from data analysis
- Our dataset is highly imbalanced with majority of the samples belonging to class Commenting.
- TSNE is not able to distinguish our samples in 2 dimensions.
Data Preprocessing and Feature Engineering
Before feature engineering, it is important to clean the data. Steps involved in cleaning the data -
- Deconcatenation of words
- Removing all the stop words
- Converting every word into lower case.
Next step will be to map our multi label problem to multi class. We have 3 classes and total 8 combinations are possible.
- Only Commenting
- Only Ogling
- Only Touching
- Commenting and Ogling
- Commenting and Touching
- Touching and Ogling
- Commenting, Touching and Ogling
- Doesn’t belong to any class
Feature Engineering —
We will convert our text data into vectors using tfidf-W2V. Pre trained Glove embeddings will be used. Some additional features will also be generated like-
- Sentiment score of our text
- Counting the no of nouns, pronouns, adverbs, adjectives and verbs in our text.
We will be having 305 features in total.
Experiments with different models
First we will be training our data using a random model. It helps in comparing its performance metrics to our actual models. The log loss for our random model came out to be 2.39 on our test dataset.
Logistic Regression
From the log loss we can see that our model is overfitting on train data set. The precision matrix shows us that our model is favoring dominant classes (1,3 and 0) and is not able to predict class 6 i.e Touching and Ogling.
XG Boost
We ran a grid search on XG Boost classifier and fine tuned hyper parameters like — n_estimators, max_depth, min_child_weight and reg_alpha.
Our log loss reduced to 1.4 from 1.66 by using XG Boost but our model is still not able to predict class 6 at all .
Results on all of our models —
From the above table we can see that XG Boost was our best model.
Custom Stacking Classifier
We created a custom stacking classifier which takes in k and list of classifiers as a hyperparameter. Suppose k = 20 then 20 times it will iterate on our list of classifiers and at every iteration it will randomly pick a model and our training samples will be trained on it. For base models we trained our custom stacking classifier on [Logistic Regression, SVM, Random Forest, Cat Boost, LightGBM, Decision Tree, XG Boost] with k = 20 and for our meta classifier our model randomly picked one classifier from [LightGBM, XG Boost, Cat Boost].
You can see the above code for creating custom stacking classifier. Our init method is similar to constructors in C++ and Java . It is used to initialize values of the class object. It is run as soon as the object of the class is instantiated. Our train_base will be used to train our base model by iterating k times on our list of classifiers and similarly train_meta will be used to train our meta classifier. Here we will randomly pick an index number between 0 to the length of the list of meta classifiers . The index number returned will help in deciding which classifier to be trained. Evaluation function is used for testing and it prints our log loss, confusion matrix, precision matrix and recall matrix.
Results
Our model is overfitting and favoring dominant classes like 0,1 and 3. But considering we had only 7201 training samples it is good enough as a first-cut solution.
End-to-End pipeline and Deployment
Our model is deployed on heroku -
All the files for deployment can be found on my github repository. You can go and check it out !!
Future Work
- Lime can be used to increase interpretability and check on what basis our model predicts a certain class.
- We can go one step ahead and use deep learning architectures like RNN and hybrid CNN-RNN .
- Extracting few more training samples from safecity forum.
References
- https://www.appliedaicourse.com/
- https://arxiv.org/abs/1809.04739
- https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python
- https://www.kaggle.com/josh24990/simple-stacking-approach-top-12-score
For any code related file you can go to my github repository —
Connect with me on Linkedin —
Thank You for reading !!!