What Women DON’T Want.

Jonathan P. Scaccia
4 min readOct 4, 2019

--

The Wandersman Center has been working with multiple bases across the U.S. Department of Defense, helping to implement sexual assault prevention activities. In one of these sites, the scope of work has focused specifically on helping people to know the difference between sexual harassment and sexual assault, the behaviors associated with each, and importantly where to go to report about it.

The project coordinator at this site found the following thread on Reddit that seemed relevant: Women of Reddit, what do men do that they think is okay but is actually creepy? He said,

“I came across an incredibly unscientific treasure trove of possible content for sexual harassment training. I apologize for the vulgarity of some of the comments, but it appears to be an almost comprehensive list of how men (most aren’t really gender-specific, but it’s the way the question is framed) make women feel uncomfortable by coming across as creepy. No idea what to do anything with this, but it certainly appears to be of some value. After my analytics seminar, and I caveat that this is completely unreasonable, I want to scrape it somehow and use it for some version of evidence-based reasoning for scanning social media posts or something glamorous in that regard.”

CHALLENGE ACCEPTED.

First, we had to find a way to pull the data from Reddit. We tried a few methods, including cutting and pasting the thread into a word document. We ultimately settled on using the RedditExtractoR package so we could preserve the meta-data (like comment ID#). Due to limitations with Reddit’s API, only the top 500 threaded comments were available. At the time we ran this analysis, the dataset constituted less than 1% of the comments in the overall thread. So, it’s only a sliver, but hopefully a useful sample.

We first started by using a process called “tokenization” to create unigrams (single words). We also removed stop-words, which are widespread phrases like “the,” and “about” that don’t add useful information. After some additional cleaning to take care of things like pluralizations, we ended up with the below graph.

So it’s cool, but not that interesting.

We then ran a topic model on the unigram database. Topic modeling presumes that documents are made up of topics and that topics are made up of words. To determine the optimal number of topics, we wrote a function using the Latent Dirichlet Allocation (LDA) algorithm that pulled the minimum perplexity score from multiple LDAs. After finding the optimal sorting, we graphed the results. This method yielded the below grouping.

Now we’re cooking with gas. We can see some potential behaviors emerging here. Topic 1 appears to be about unwelcome touching. Topic 8 might be related to walking someone home. However, we can do much better using bigrams.

Bigrams are two-word phrases. We used a similar tokenization and stop-word removal process to create these. A pretty clear pattern emerges in the graph below.

A related visualization is a network chart. We used this to see whether there were any chains of words. I highlighted the somewhat obvious one.

We then replicated the topic modeling method with bigrams. The topics this time were a bit messier, and we could have done a little bit more to clean these. Topics 1 and 14 are pretty straightforward. Not sure what is going on with topic 8!

And last, but not least, we did a quick sentiment analysis of the data. Sentiment analysis corresponds specific words to an underlying emotion. We used the “bing” lexicon, which assigns words as being positive or negative. The graph below shows the top 10 words under each sentiment. At this point, the message is really hammering home.

There are a few conclusions from this analysis. This analysis is the first time we’ve pulled and studied data from Reddit. Increasingly, we are getting interested in observation and unobtrusive measures that give us information about specific topics areas. Using natural language processing seems to be a promising method for looking at this data.

Most importantly: DO NOT SEND DICK PICS.

--

--

Jonathan P. Scaccia
Jonathan P. Scaccia

Written by Jonathan P. Scaccia

What helps organizations function better to make an impact in the community? Views and analyses my own. Sometimes cross-posted to www.dawnchorusgroup.com

No responses yet