I enjoyed this recent
article on fivethirtyeight.com analysing Trump followers on Reddit.com. It covered an interesting range of
technical concepts:
- Working with a very large dataset on Google’s BigQuery
- Applying a natural language processing technique in R
- Exploring the behaviour of a defined group of users online
I was interested to see if using a similar approach would allow insight into sports fans on Reddit, so I forked
the code from fivethirtyeight’s github repo and went about
adapting it to sports content. In this blog I will explain the analysis, and provide some
interactive tables and charts for readers to play with. Overall I found that:
- US-specific sports generated the most activity (unsurprisingly given a strong bias on Reddit towards US users), although Soccer was up there too
- The top Premier League clubs were on a par with NFL and NBA teams in terms of activity, particularly Manchester United, but lower-profile teams registered very little activity
- Individual sports subreddits tended to cluster, in behavioural terms, around geographies (e.g. high profile US sports) or the type of sport (e.g. boxing and mma)
- Supporters of teams behaved very similarly to other teams in the same league, and more so for teams that are close geographically (in the case of the NFL) or competitively (in the case of the Premier League)
Part of the motivation in writing this post was to showcase some of the analysis and web development tools and techniques we
use day-to-day in our client work. In this spirit the SQL and R code to recreate all results and static versions of the plots can be
found on
a forked repo on my github page, and the code for the
interactive plots below, created using Chart.js and Echarts 3, is visible in the page source.
Reddit.com
Before we dive into the dataset, some quick notes on Reddit. While public sources disagree about traffic to the site
(as is often the case), Reddit is consistently ranked among the largest websites in the US (
4th and
11th) and globally (
7th and 26th).
The audience is heavily skewed to large English speaking markets, with almost half of visits over the last 2 years
from the US, and two thirds from the top 4 markets (US, Canada, UK and Australia), according to comScore. This
regional bias can also be observed in relative Google search volume (see the Google Trends result below, noting that the numbers
refer to the relative proportion of search within a market as opposed to the absolute number of searches, which is why the
US is not top) and should be kept in mind when considering results.
The Dataset
I have taken the Reddit comments data from a collection hosted on Google’s BigQuery. This contains a total of 1.7
billion comments from January 2015 to February 2017. A dataset of this size would be challenging to download, host
and analyse locally, but BigQuery allows us to query directly using normal SQL code. I was therefore able to generate some
summary statistics before exporting a smaller and more manageable dataset for additional analysis.
First up, I created a set of the largest property/sport subreddits and queried the following for each:
- total number of unique authors over the range
- number of comments they made
- average score (on Reddit users can either upvote or downvote the post, the score is the difference)
- number of comments per author
There were over 80 million comments in the above subreddits alone over a two year period, corresponding to over 100,000
comments per day, and tens of millions more in the team subreddits described below. Reddit clearly represents a
significant platform for sports fans to get together. In pure volume terms, the US sports dominated, with the NFL and
NBA comfortably ahead of the pack. /r/soccer was the next highest, highlighting the fact that most soccer activity goes
through a general subreddit, as opposed to league-specific domains like /r/PremierLeague and /r/LaLiga. A similar situation
arises for /r/ufc and /r/mlb, with more activity in /r/baseball and /r/MMA respectively.
The comments per author metric gives an indication of the volume of activity by subreddit contributors. The high
engagement in Australian sports stands out, accounting for three of the top four subreddits: /r/AFL, /r/nrl and /r/Cricket
(although clearly UK and Indian users will be significant with cricket). Just 7,340 AFL authors contributed almost
1.4m comments. By contrast, /r/nhl, with almost triple the number of authors, generated just 7% the number of comments
(this may be due to greater engagement in team-specific subreddits, see below). US-sport focused subreddits with
similarly high engagement were /r/CFB and /r/NASCAR.
Average scores are less interpretable since they will be naturally higher for larger subreddits, and the dataset does
not record upvotes and downvotes separately so I could not normalise. Despite this, /r/soccer stands out as having
particularly high average scores relative to its size, although this could be due to more ‘lurkers’ – users voting
but not commenting.
Leagues and Teams
Next I focused in on three largest leagues/sports, looking at summary statistics for individual team subreddits in the NFL,
NBA and Premier League.
NFL
Unsurprisingly the /r/Patriots came out top (again), with the volume of activity for other teams seemingly driven
by a combination of recent and historical success. I was surprised to see the highest number of comments per user
in /r/Browns – fans were seemingly not deterred by poor performance on the field, or maybe they just had a lot to
complain about.
NBA
A similar story for the NBA, with the two most recent champs topping the table, followed by the historically strong
Lakers and Bulls. /r/torontoraptors at 5th is perhaps a consequence of the higher relative Reddit interest in Canada.
Premier League
It felt neater to consider Premier League clubs, although arguably the top European clubs would have been a better set
since the smaller Premier League teams have relatively low levels of activity. The historical strength and high profile
of Manchester United (r/reddevils) meant they topped the table, despite a lack of titles over the period. Behind them were
the remainder of the current ‘top six’. Of note is that Leicester (r/lcfc) had a very low number of comments per author
– relatively unengaged fans jumping on the championship bandwagon perhaps?
It was interesting to see that the top Premier League clubs had similar numbers of authors and higher numbers of
comments than top NFL and NBA teams, despite the strong bias towards US users – Manchester United, Liverpool and
Arsenal all saw more comments than any NFL or NBA team, and only the Patriots had more authors than Manchester
United. Having said that, the drop off for other English clubs is stark.
This could be partially explained by the long term financial advantages enjoyed by top premier league clubs,
in contrast to the forced egalitarianism of US leagues. This also may explain why the relatively smaller UK market
manages to generate high levels of top team activity relative to US counterparts – the smaller number of fans are
distributed much more unevenly onto teams. Finally, as we know from our analysis of TV audience data the Premier
League is incredibly international in its fanbase, and international fans are far more likely to follow a high
profile club like Manchester United than lower level clubs like West Bromwich Albion, whose presence in the Premier
League is likely to be transient.
Subreddit Similarity
Finally I stepped into a more detailed analysis of behaviour using latent semantic analysis (LSA), a natural
language processing techniques often used to analyse text and speech. You can read the fivethirtyeight article
for more detail, but the basic concept and steps were:
- Use BigQuery to generate all subreddit pair author cross-over (i.e. the number of authors who commented at least
10 times in every pair across the the 50,323 subreddits)
- Export this dataset to R and calculate a vector for every subreddit comprising its cross-over with 2,133 of the most
important subreddits (i.e. each subreddit is defined by a 2,133 dimensional vector of author cross-overs)
- Write functions in R to calculate the geometrical similarity between subreddits (i.e. the angle between each subreddit
vector), and run so-called subreddit algebra
Memory limitations on my machine meant I was not able to run the analysis locally on R. Instead I spun up a higher memory
Google Cloud Compute Engine instance, installed R, and carried out all analysis remotely through SSH.
I was interested in the relationships between my sets of sports subreddits, so I created an additional function to calculate
a matrix of similarities for a set of subreddits, and plot the results in the form of a heatmap. The resulting plots are
shown below, with lighter squares indicating greater similarity.
Sports Subreddits - Average Similarity = 0.53
Light patches indicate pairs or groups of similar subreddits. Groups tended to emerge along geographical
and sporting lines:
- /r/nfl, /r/nba, /r/baseball, /r/cfb and /r/collegebasketball – similarities of 0.80 to 0.91 (maximum is 1)
- /r/afl and /r/nrl – similarity of 0.79
- /r/boxing and /r/mma – 0.79
Conversely, some sports subreddits emerged as being particularly different to others, most notably Cricket
and Formula 1, with average similarities of 0.46 and 0.4 to all other subreddits, driven largely by their
low similarity to the US based properties.
What does this mean exactly? It means that authors on /r/collegebasketball are very similar to authors on
r/cfb in terms of what other subreddits they comment on. My initial assumption was that this was driven by
the same authors posting on each subreddit, but this is only partially true – for /r/collegebasketball and
r/cfb there was a cross-over of 8,500 authors (just 17% of authors from the smaller subreddit). This would
not completely explain the high level of similarity, and suggests that different users are behaving in
similar ways on the site.
NFL Subreddits - Average Similarity = 0.73
Performing the same analysis for NFL teams, the first thing to note is that the overall average similarity
was much higher than for the sports subreddits above. In other words NFL fans of different teams behave
very similarly. Not too surprising, but I’m not sure it’s something rival fans would be happy to admit.
The light spots tended to arise for teams located close to each other geographically e.g. /r/oaklandraiders
and /r/chargers, /r/minnesotavikings and /r/GreenBayPackers, /r/Saints and /r/Tennesseetitans. Unsurprisingly
/r/StLouisRams (a defiantly still-active subreddit with the tagline “Never Forget”) was very closely related
to /r/LosAngelesRams.
A few teams did emerge that were strikingly different from most other teams (darker lines on the plot),
namely /r/Seahawks, /r/CHIBears and /49ers.
NBA Subreddits - Average Similarity = 0.73
Overall similarity was the same for NBA teams as NFL teams, although NBA team subreddits appeared to be
less driven by geography. The NBA is often described as more of a ‘lifestyle’ league, corroborated by the finding in
the fivethirtyeight article that /r/sneakers is closely related to /r/nba. Perhaps this leads fans to
be less defined by their region, and more by a broader national culture. Again, curiously, a Chicago
team is noticeably different from most other teams.
Premier League Subreddits - Average Similarity = 0.58
Finally, the Premier League. Interestingly, the 'top six' formed a tight knit set, with /r/reddevils,
/r/MCFC, /r/LiverpoolFC, /r/Gunners, /r/coys and /r/chelseafc all having similarities above 0.8. For all
these fans' apparent differences, they behave in remarkably similar ways on Reddit. Despite this,
the overall similarity for the league is significantly lower than for the NFL and NBA, albeit driven
by three of the least followed teams (Hull City, Burnley and Bournemouth).
This may be another function of the international nature of Premier League fans. Particularly on Reddit
where the skew away from the UK will accentuate the bias. The smaller team subreddits are more likely
to be populated by local fans who are significantly different to international fans, driving the overall
similarity down.
Final Thoughts
It was interesting to get an idea of some macro trends across sports subreddits – the volume and share of
activity across sports and teams, and some insight into the behaviour of different groups of users – but
equally exciting (for me, at least…) was being able to tap into a massive dataset and produce meaningful
results relatively easily. If you were interested in the analysis, please feel free to build on it (there
is scope to go much deeper, even encompassing the content of each comment), or reach out with any questions
or comments.