Classification of sexual harassment personal stories .

8 min readMay 29, 2021

Problem Statement

The me too or (#MeToo) movement is a social movement against sexual abuse and sexual harassment where people publicize allegations of sexual crime. The purpose of “Me Too” movement is to empower sexually assaulted people through empathy and solidarity through strength in numbers.

Over the past few years , social media have come to a wide use in social movements. The “Me Too” movement spread virally as a hashtag on social media.

From the above statistical data we can confirm that social media have helped victims come out and share their story with everyone which was not usually the case before.

With the vast amount of personal stories shared by the people on internet, it is difficult to manually sort and understand the information shared in these stories.

How Machine Learning can be used to solve this problem ?

Each personal story includes one or more tagged forms of the sexual harassment, along with the description of the incident. Our problem is to categorize these stories. Each story can fall into one or more classes or can even belong to none of the classes. We can map this to a multi label classification or we can also convert our multi label problem to multi class. We will see later in the blog how this can be achieved.

Data Overview

The data is collected from -

swkarlekar/safecity

SafeCity: Understanding Diverse Forms of Sexual Harassment Personal Stories, EMNLP 2018 Authors: Sweta Karlekar & Mohit…

github.com

Safecity is a platform as a service product that powers communities, police and city government to prevent violence in public and private spaces. It is one of the largest publicly-available online forum for reporting sexual harassment.

We have been provided three csv files — Train, Dev and Test

Train.csv — This file has around 7201 training samples and four columns.

Dev.csv — For development/validation we have 990 samples.

Test.csv — For testing we have 1701 samples.

**Few examples from the training data set**

We have three classes — Commenting, Ogling/Facial Expressions/Staring and Touching/Groping.

Existing Solutions

In the research paper, deep learning architectures like CNN , RNN and hybrid CNN-RNN with word and character embeddings have been used.

SafeCity: Understanding Diverse Forms of Sexual Harassment Personal Stories

With the recent rise of #MeToo, an increasing number of personal stories about sexual harassment and sexual abuse have…

arxiv.org

Improvements to existing approaches

In the research paper they have only used the textual data . With the help of feature engineering we will generate some new features like —

Sentiment Score
Noun Count
Verb Count
Adjective Count
Adverb Count
Pronoun Count

We will experiment with all the classical machine learning algorithms like Logistic Regression, KNN, Support Vector Machine, Random Forest, Decision Tree and Gradient Boosting, XG Boost and Cat Boost.

Performance Metric

As we will be converting our multi label problem to multi class, so our performance metric will be log loss.

Log loss formulation — It is the average of negative log of a probability of correct class label. Its value can lie between 0 to infinity and the smaller, the better.

Precision — In layman’s term precision can be defined as no. of all the points our model predicted to be class m and which were actually class m divided by all the points which our model predicted to be belonging to class m.

Exploratory Data Analysis

Before solving the main problem, we will visualize our data and will look to find some insights which can be useful later on .

From the above plot we found that majority of our samples belong to class Commenting and we have the least amount of samples for class Ogling/Facial Expression/Staring.

We have 600 samples which belong to both commenting and ogling while we have only 145 samples for ogling and touching . From the above plots we can conclude that our data is highly imbalanced.

From the above bar plot we see the most common words are man, bus, touched, tried, touching when one of the class is Touching/Groping.

Most common words are staring,man,boys,commenting etc when one of the class is Ogling/Staring.

Another bar plot of the most common words belonging to class Commenting.

It can be seen here that group boys, took place, tried touch , ran away etc were the most common bigrams in our corpus

Most common trigrams were incident took place, survey carried safecity and red dot foundation.

**TSNE representation of our word vectors**

We converted our text data into 300 dimension word vectors using pre trained Glove embeddings. We can notice that our data points are not completely distinguishable in 2 dimensions.

Insights from data analysis

Our dataset is highly imbalanced with majority of the samples belonging to class Commenting.
TSNE is not able to distinguish our samples in 2 dimensions.

Data Preprocessing and Feature Engineering

Before feature engineering, it is important to clean the data. Steps involved in cleaning the data -

Deconcatenation of words
Removing all the stop words
Converting every word into lower case.

Next step will be to map our multi label problem to multi class. We have 3 classes and total 8 combinations are possible.

Only Commenting
Only Ogling
Only Touching
Commenting and Ogling
Commenting and Touching
Touching and Ogling
Commenting, Touching and Ogling
Doesn’t belong to any class

Feature Engineering —

We will convert our text data into vectors using tfidf-W2V. Pre trained Glove embeddings will be used. Some additional features will also be generated like-

Sentiment score of our text
Counting the no of nouns, pronouns, adverbs, adjectives and verbs in our text.

We will be having 305 features in total.

Experiments with different models

First we will be training our data using a random model. It helps in comparing its performance metrics to our actual models. The log loss for our random model came out to be 2.39 on our test dataset.

Logistic Regression

From the log loss we can see that our model is overfitting on train data set. The precision matrix shows us that our model is favoring dominant classes (1,3 and 0) and is not able to predict class 6 i.e Touching and Ogling.

XG Boost

We ran a grid search on XG Boost classifier and fine tuned hyper parameters like — n_estimators, max_depth, min_child_weight and reg_alpha.

Our log loss reduced to 1.4 from 1.66 by using XG Boost but our model is still not able to predict class 6 at all .

Results on all of our models —

From the above table we can see that XG Boost was our best model.

Custom Stacking Classifier

We created a custom stacking classifier which takes in k and list of classifiers as a hyperparameter. Suppose k = 20 then 20 times it will iterate on our list of classifiers and at every iteration it will randomly pick a model and our training samples will be trained on it. For base models we trained our custom stacking classifier on [Logistic Regression, SVM, Random Forest, Cat Boost, LightGBM, Decision Tree, XG Boost] with k = 20 and for our meta classifier our model randomly picked one classifier from [LightGBM, XG Boost, Cat Boost].

You can see the above code for creating custom stacking classifier. Our init method is similar to constructors in C++ and Java . It is used to initialize values of the class object. It is run as soon as the object of the class is instantiated. Our train_base will be used to train our base model by iterating k times on our list of classifiers and similarly train_meta will be used to train our meta classifier. Here we will randomly pick an index number between 0 to the length of the list of meta classifiers . The index number returned will help in deciding which classifier to be trained. Evaluation function is used for testing and it prints our log loss, confusion matrix, precision matrix and recall matrix.

Results

Our model is overfitting and favoring dominant classes like 0,1 and 3. But considering we had only 7201 training samples it is good enough as a first-cut solution.

End-to-End pipeline and Deployment

Our model is deployed on heroku -

Safe City

Classification of Sexual Harrasment Personal Stories

sexual-harassment-whg.herokuapp.com

All the files for deployment can be found on my github repository. You can go and check it out !!

Future Work

Lime can be used to increase interpretability and check on what basis our model predicts a certain class.
We can go one step ahead and use deep learning architectures like RNN and hybrid CNN-RNN .
Extracting few more training samples from safecity forum.

References

For any code related file you can go to my github repository —

msadiva/safe-city

Contribute to msadiva/safe-city development by creating an account on GitHub.