Can we forecast ‘meme stocks’ movements using Reddit data? Sentiment Analysis of WSB.

5 min readApr 6, 2021

Unless you virtually ignore any news about the stock market, you should have heard about GME (very briefly: people in one of the brunches of Reddit coordinated and pushed the stock price of a dying computer games firm ‘to the moon’). In this post, we’ll explore what happened on Reddit during these volatile times.

In order to do so we’ll:

Scrape comments from Reddit’s Wall Street Bets branch. Alternatively, it’s possible to use this dataset, but it’s not very extensive
Perform sentiment analysis using VADER Sentiment and Bert model
Get some interesting graphs and talk about them
Create an RNN to see if we can use the data above to forecast the direction of change of stock price for the next hour. Spoiler: No.

1. Scrapping and initial analysis

The dataset we obtain by our own scrapper (or from Kaggle) will look like this:

It consists of around 90 000 non-empty posts scrapped for the last 3 month using Reddit API.

We can also get a nice representation using wordcloud:

Once we extracted tickers (by downloading all possible tickers from NASDAQ and NYSE and basically looking for them in all words that either start with ‘$’ or contain only capital letters), we can get their distribution

As you can notice, more than half of the posts belong to GME, that’s why we’ll mostly focus on this stock from now on.

2. Sentiment

Now, let’s add a sentiment analysis. The graphs below will be based on VADER Sentiment — a pretrained model meant for analysing posts on social media. We’ll use it in the framework of nltk. In the link to the notebook provided in the end, you can also see a similar analysis using the BERT model.

Here, we extracted 4 parameters using VADER, but we’ll stick to ‘compound’ — that’s the best metric if we want to map sentiment to a single number.

3. Cool graphs

Now, finally, the fun part. We’ll use consider ‘total sentiment’ for each day and link it to the stock price. In the graphs below, you should see two parameters: ‘hype’ and ‘sentiment’. The first one refers to the sum of absolute values of sentiment — so it basically indicates the popularity of stock. The latter includes sign — so if Redditor’s are positive or negative about it. What shall we expect to see? Assuming these people, led by ‘Roaring Kitty’, moved GME share price, the hype and stock price should be correlated.

But what we see is very interesting: the stock price went up first, and only then, in about a week, sentiment peaked.

This is strange.

One might think it means that WallStreetBets users didn’t had much impact on this ‘exposure’, but we know it is not true. In my opinion, we can say that most users of WallStreetBets got interested and started to comment about GME only after it ‘went to the moon🚀🚀🚀’. What’s also interesting is that the second peak at the beginning of March happened exactly when the stock price started to grow, as we could expect. This can be explained by the fact that a lot of people were watching the stock after it rapidly went up and down. Alternatively, the problem can be that we analyse comments instead of posts and maybe with posts situation is different, but it’s reasonable to assume they appear at the same time.

If we consider other stocks, the pattern will be the same (with a couple of exceptions): the first peak is delayed, and all other peaks happen simultaneously with the stock price.

If you have any other explanation of this fact, please feel free to drop a comment .

What follows from these graphs and discussions: If Redditors wrote comments and bought GME at the same time, most of them were likely to lose money.

4. Bonus: RNN

What if we take all these possible sentiments and create a Recurrent Neural Network to forecast sign of a change in stock price?

The challenge we face is that the stock exchange is open for 40 hours a week, and Reddit for 168 (unfortunately we didn’t get more frequent data). That’s why we added night data to the opening hours. Also, we have only 3 months of data for 40 points a week, so it’s about 400 points. We split data into train/validation/test as 70/15/15, and so our test set contains only 62 points. To not overfit our data too much we’ll use 2 layers with 8 neurons each. We can represent the result as a confusion matrix:

We can see that basically, it doesn’t works, certainly not enough statistical evidence to assume otherwise. However, this can be not the best place for RNNs and time series models like ARIMA may perform better. Finally, we saw that this sentiment data is not too helpful for forecasting, so we shouldn’t expect a miracle.

The notebook with code and more (different) examples can be accessed here.

This article was inspired by Turing Machine & Deep Learning course at Erasmus University Rotterdam, the Netherlands. All analysis were performed together with Robin van Merle.