Archiving Small Twitter Datasets for Text Analysis: A Workshop for Beginners

Ernesto Priego (, City, University of London, United Kingdom


In this workshop for non – coders, participants will be guided through two tasks: the first task will guide participants in creating an application to tap into Twitter’s API, in our case to get Twitter data. The second task will guide participants in the use of a Google spreadsheet to capture streaming (live) data from Twitter in order to archive it, download it and perform text analysis, data visualization and other studies. This workshop will include a brief introduction contextualizing social media data collection good practices including user data privacy issues.

Keywords: Archiving, Data Collection, Social Media, Twitter, Text Analysis


Twitter data can be very valuable for researchers of perhaps all disciplines, not just DH. Given the difficulties to properly collect and analyse Twitter data as viewable from most Twitter Web and mobile clients (as most people use Twitter) and the very limited short – span of search results, there is the danger of losing huge amounts of valuable historical material.

Tweets are like butterflies – one can only really look at them for long if one pins them down out of their natural environment. The reason why we have access to Twitter in any form is because of Twitter’s API, which stands for Application Programming Interface. Free access to historic Twitter search results is limited to the last 7 days. This is due to several reasons, including the incredible amount of data that is requested from Twitter’s API, and – this is an educated guess – not disconnected from the fact that Twitter’s business model relies on its data being a commodity that can be resold for research. Twitter’s data is stored and managed by Twitter’s enterprise API platform.

For the researcher interested in researching Twitter data, this means that harvesting needs to be do ne not only through automated means but in real time. It also puts scholars without the required coding and data mining skills at a disadvantage. As a researcher, this basically means that there is no way to do proper research of Twitter data without understanding how it works at API level, and this means understanding the limitations and possibilities this imposes on researchers.

What’s a n individual researcher without access to pay corporate access to do? The whole butterfly colony cannot be captured with the nets most of us have available. At small scale, however, and collecting in a timely fashion, it is still possible to capture interesting and more – or – less complete specimens using fairly simply, non – coding required methods. (The Library of Congress h s now 12 years’ worth of text – only Tweets. However, as before, the Library of Congress Twitter collection will remain embargoed and there was no projected timetable for providing public access as of 26 December 2017).

Most researchers out there are likely not to benefit from access to huge Twitter data dumps. For researchers without much resources that are trying to do the talk whilst doing the walk, and conduct research
on Twitter and
about Twitter, this workshop and tutorial will guide participants into creating a Twitter application in order to tap into the Twitter API, followed

by the setting up of a Twitter Google Archiving Spreadsheet. Once a trial archive or dataset has been collected, we will attempt text analysis and basic visualisations using Excel and Voyant Tools. This workshop will include a brief introduction contextualizing social media data collection good practices including user data privacy and research ethics issues.

Workshop Requirements

• Room with projector and screen

• Wifi access

• Power plugs for participants to charge devices if required

Participants Requirements

• Interest in collecting small Twitter datasets and basic Text Analysis

• Wifi – enabled Laptop with Excel or similar spreadsheet software

• Twitter account, and the login credentials to access it (username and password)

Tools We’ll Use



• Voyant Tools

https://voyant –

El taller se puede dar también en español o bilingüe inglés – español.

Appendix A

  1. For complete references please follow links in the referenced outputs below and in the body of the text above.
  2. Priego, E. 2018. #rfringe17: Top 230 Terms in Tweetage.
  3. [Accessed 30 January 2018]
  4. Priego, E., 2016. Bar Chart: Number of #DH2016 Tweets in Archive per Conference Day (Sunday 10 to Friday 15 July 2016 GMT). Available from:
  5. Priego, E. 2016. “Stronger In”: Looking Into a Sample Archive of 1,005 StrongerIn Tweets.
  6. Priego, E. and Zarate, C., 2014. #MLA14 Twitter Archive, 9 – 12 January 2014. Available from:
  7. [Accessed 31 Jan 2018].
  8. Priego, E. 2014. Some Thoughts on Why You Would Like to Archive and Share [Small] Twitter Data Sets. Available from:
  9. Priego, E. 2014. Publicly available data from Twitter is public evidence and does not necessarily constitute an “ethical dilemma”. London School of Economics Impact Blog. Available from:

Leave a Comment