As published this week in Forbes, Kalev Leetaru gives an interesting point of view regarding how social media data fits in with big data for researching purposes.
ContributorAI & Big DataI write about the broad intersection of data and society.
Social media has become synonymous with “big data” thanks to its widespread availability and stature as a driver of the global conversation. Its massive size, high update speed and range of content modalities are frequently cited as a textbook example of just what constitutes “big data” in today’s data drenched world. However, if we look a bit closer, is social media really that much larger than traditional data sources like journalism?
We hold up social media platforms today as the epitome of “big data.” However, the lack of external visibility into those platforms means that nearly all of our assessments are based on the hand picked statistics those companies choose to report to the public and the myriad ways those figures, such as “active users,” are constantly evolved to reflect the rosiest image possible of the growth of social media as a whole.
Much of our reverence for social platforms comes from the belief that their servers hold an unimaginably large archive of global human behavior. But is that archive that much larger than the mediums that precede it like traditional journalism?
Facebook announced its first large research dataset last year, consisting of “a petabyte of data with almost all public URLs Facebook users globally have clicked on, when, and by what types of people.” Despite its petabyte stature, the actual number of rows was estimated to be relatively small. In all, the dataset was projected to contain just 30 billion rows when it was announced, growing at a rate of just 2 million unique URLs across 300 million posts per week, once completed.
To many researchers, 30 billion rows sounds like an extraordinary amount of data that they couldn’t possibly analyze in their lifetime. By modern standards, however, 30 billion records is a fairly tiny dataset and the petabyte as a benchmark of “big data” is long passé.
In fact, my own open data GDELT Project has compiled a database of more than 85 billion outlinks from worldwide news outlet homepages since March 2018, making it 2.8 times larger than Facebook’s dataset in just half the time.
Compared to news media, social media isn’t necessarily that much larger. It is merely that we have historically lacked the tools to treat news media as big data. In contrast, social media has aggressively marketed itself as “big data” from the start, with data formats and API mechanisms designed to maximize its accessibility to modern analytics.
In its 13 short years Twitter has become the defacto face of the big data revolution when it comes to understanding global society. Its hundreds of billions of tweets give it “volume,” its hundreds of millions of tweets a day give it “velocity” and its mix of text, imagery and video offer “variety.”
Just how big is Twitter anyway?
The company itself no longer publishes regular reports of how many tweets are sent per day or how many tweets have been sent since its founding and it did not immediately respond to a request for comment on how many total tweets have been sent in its history. However, extrapolating from previous studies we can reasonably estimate that if trends have held there have been slightly over one trillion tweets sent since the service’s founding 13 years ago.
At first glance a trillion tweets sounds like an incredibly large number, especially given that each of those trillion tweets consists of a JSON record with a number of fields.
However, tweets are extremely small, historically maxing out at just 140 characters of text. This means that while there are a lot of tweets, each of those tweets says very little.
In reality, few tweets come anywhere near Twitter's historical 140-character limit. The average English tweet is around 34 characters while the average Japanese tweet is 15 characters, reflecting the varying information conveyed by a single character in each language.
Moreover, while raw Twitter data can be quite large (a month of the Decahose was 2.8TB in 2012), just 4% of a Twitter record is the tweet text itself. The remaining 96% is a combination of all of the metadata Twitter provides about each tweet and JSON’s highly inefficient storage format.
Since most Twitter analyses focus on the text of each tweet, this means the actual volume of data that must be processed to conduct common social analytics is quite small.
Assuming that all one trillion tweets were the maximum 140 characters long, that would yield just 140TB of text (the actual number would be slightly higher accounting for UTF8 encoding).
In 2012 the average tweet length Twitter-wide was 74 bytes (bytes, unlike characters, account for the additional length of UTF8 encoding of non-ASCII text), which would mean those trillion tweets would consume just 74TB of text: a large, but hardly unmanageable collection.
If we extrapolate from the 2012-2014 Twitter trends to estimate that somewhere in the neighborhood of 35% of all trillion tweets have been retweets (assuming no major changes in retweet behavior), then using that 74-byte average length would yield just 48TB of unique text.
Of course, this is before the hyperlinks found in roughly a third of tweets are removed (again, assuming trends have held since 2014). It also ignores the prevalence of “@username” references in tweets that do not contribute to their analyzable text.
For comparison, the 2010 Google Books NGrams collection representing 4% of all books ever published totaled 500 billion words (361 billion English words) and was estimated to be around 3TB in size. That would make it 25 times smaller than the totality of Twitter. The Internet Archive’s collection of English language public domain books totals around 450GB of text, making it around 86 times smaller than Twitter.
The Google and Internet Archive digitized book collections include only a single copy of each book, making it unfair to compare them against Twitter with its myriad retweets. Filtering out retweets, we find that Twitter is just 16 times larger than the Google Books NGrams source collection, while the Internet Archive’s public domain books collection is around 54 times smaller.
It is a remarkable commentary on the digital era that just 13 years of tweets is larger than the two centuries of digitized books available to researchers today.
Partially this is due to the fact that such a small portion of our history has been digitized (less than 4% of known published books are represented in the Google Books NGrams dataset). In essence we are comparing the totality of 13 years of Twitter against just a 4% sample of two centuries of books.
A bigger factor is the fundamentally altered economics of publishing in the digital age. Through the two centuries of printed books in the two collections above, the cost of publishing a book was so substantial that very few authors were rewarded for their efforts with published volumes and every word of a book mattered.
In contrast, in the Twitter era one’s publishing volume is limited only by the speed one can type (or have a bot type on one’s behalf).
This means that to truly compare Twitter’s size to other datasets we should compare it against a similar born digital collection. Given that the news dataset above ended up being almost three times larger than the equivalent Facebook dataset in just half the time, how does Twitter stack up against traditional journalism?
Over that same time period, we can estimate based on previous trends that Twitter likely published in the neighborhood of 600 billion tweets, of which 330 billion were not retweets (assuming trends have held with retweet volume increasing over time).
This would work out to around 84TB of text during that period if every single tweet were the maximum 140 characters or around 44TB using a 74-character average tweet length. Excluding retweets this would fall to just 24TB of text, assuming an average tweet length.
News content can contain syndicated wire stories that are republished by multiple outlets, but the volume of such republication as a percent of the totality of daily journalistic output is unlikely to come close to the prominence of retweeting.
Counting all trillion tweets sent 2006-present and assuming all of them were the maximum 140 characters, the Twitter archive would be just 47 times larger than global online news output 2014-present as monitored by GDELT. Using the more realistic average tweet length, Twitter would be just 25 times larger and removing retweets it would be just 16 times larger.
Of course, those numbers compare a 13 year stretch of Twitter to just 4 years of news.
Comparing the two over the same four-year period, we find that Twitter was around 15 times larger than news, but just 8 times larger if retweets are removed.
Thus, if one had access to the complete Twitter firehose 2014-present, the total volume of text would likely be only around 8 times larger than the total volume of online news content over the same time period.
Seen in this light, Twitter is large, but it isn’t that much larger than global journalism, reminding us of just how much news is published each day across the world.
Precious few researchers have access to the full firehose, so the largest academic research is typically conducted with the Twitter Decahose, which contains around 10% of daily tweets.
The total Decahose output 2014-present is just 1.5 times larger than news. Removing retweets, the situation is reversed and news is actually 1.2 times larger than Twitter’s Decahose.
Few universities have the financial resources to subscribe to the Twitter Decahose, so the overwhelming majority of academic Twitter research is conducted either with Twitter’s search API or its 1% streaming API that makes available roughly 1% of daily tweets.
News is actually 6.7 times larger than the Twitter 1% stream over this period. If retweets are removed, news rises to 12.2 times larger than Twitter.
Thus, in terms of the 1% data that most academics work with, Twitter over the last four years is actually several times smaller than worldwide online news output over the same time period. Those academics lucky enough to work with the Decahose still have less content than they would get from news. Yet, even if one had the entire firehose at one’s disposal, the totality of that content would be just 8 times larger than news content. Filtering out all of the hyperlinks and username references would drop that number even further.
In short, Twitter is certainly a large dataset, but in terms of the actual textual tweet contents that most analyses focus on, we see that a trillion tweets don’t actually work out to that much text due to their tiny size. In many ways, Twitter is more akin to behavioral messaging data than a traditional content-based platform, especially with the way its retweet behavior corresponds to the “like” and “engagement” metrics of other platforms.
Most importantly, we see that even at the full firehose level, Twitter isn’t actually that much larger than the traditional contemporary datasets that precede it like news media. Twitter may be faster but it isn’t that much larger. In terms of the Decahose and 1% products that most researchers work with, news media actually offers a larger volume of analyzable content with far better understood provenance, stability and historical context.
Putting this all together, it has become the accepted wisdom of the “big data” era that the social media giants reign supreme over the data landscape, their archives forming the very definition of what it means to work with “big data.” Yet, as we’ve seen here, a trillion tweets quickly become just a few tens of terabytes of actual text, reminding us that high velocity small message streams like Twitter may consist of very large record counts, but very little actual data that is relevant to our analyses.
Just as importantly, we see that traditional data sources like news media are actually just as large as the social archives we work with, reminding us of the immense untapped data sources beyond the glittering novelty of social media.
Twitter certainly meets all of the definitions of “big data” but if we look closely, we find that good old traditional journalism is not far behind. The difference is that social media has aggressively marketed itself as “big data” while journalism has failed to rebrand itself for the digital era.
In the end, rather than mythologize social media as the ultimate embodiment of “big data,” perhaps the biggest lesson here is that we should think creatively about how to harness the untapped wealth of data that surrounds us and bring it into the big data era.