Twitter & The school of data science
Social Media as a research domain emerged into the mainstream as an area of intense interest in the mid-late 2000s.
As a result, UNC Charlotte’s School of Data Science is one of the first institutions of its kind to develop a formal relationship with Twitter for the purpose of acquiring and maintaining volumes of social media generated data sufficient for research. We remain, in many ways, a model for others.
Now, with more than twenty approved use-cases on file, the School is thrilled to enter our sixth year of partnership with Twitter.
Available Types of Twitter Data
- The 1% Stream (the Spritzer stream) - A free, real-time feed accessible to virtually anyone. It is a portion of the Twitter real-time stream, but is not a true statistical representation of the whole body of traffic. Twitter does "massage" it to determine its content. For example, they do not include tweets that contain certain information specific to the financial health of companies which readers could use to assess potential stock market actions. We store the content locally for use by researchers.
This data is ideal for researchers interested in time series analysis in which sampling is not an issue (e.g., aggregate analysis). The downside of this analysis is that currently, the data is only available in raw JSON form, separated into numerous ten minute files. Therefore, to access this data, the researcher needs to have programming skills to aggregate, filter and handle raw JSON files.
- The Historical Power Track API (API - Application Programming Interface) - A subscription service for which SDS pays an annual fee. This is not a real-time feed, but a service which provides access to a tool which allows us to execute queries with parameters identified by UNC Charlotte researchers to retrieve specific tweets from the Twitter Historical Archive.
Every tweet goes into the archive after 30 minutes (unless the author deletes it). We store the retrieved Tweets locally for access by researchers. Prior to using the API, each research topic must be reviewed and approved by Twitter through a Use-Case review process. Use of data obtained via the API is restricted to UNC Charlotte personnel or affiliates of the University.
This type of Twitter data is ideal for experienced Twitter researchers who are interested in full, cross-sectional samples. There are significant limits on the number of Tweets and the time range (number of days) that can be pulled per month across all researchers. Given these constraints, researchers need to carefully create filtering rules based on their research question to minimize noise (i.e., waste Tweets) and maximize their objective Tweets.
- Existing Datasets - We currently hold (for University affiliated researchers) datasets that we have previously extracted. These datasets are available for classwork and we strongly encourage new researchers to consider using these before submitting requests for custom Gnip data.
- The COVID-19 Data Feed - This is a real-time feed from Twitter. The content of the stream is based on parameters selected by Twitter, which they have determined to indicate content specific to COVID-19. Access to this feed is restricted to UNC Charlotte personnel. External (to UNC Charlotte) collaborators seeking to use data from this source must request access directly from Twitter.
SDS is currently working to develop a platform to host this feed. Until it is completed, any UNC Charlotte faculty interested in requesting a sample file should contact Trang Do.
Requesting Twitter Resources
- The 1% Stream and Existing Datasets: Contact Trang Do for more information.
- The Historical Power Track API: Interested researchers must fill out the following form, here
Sr. Project Manager, Rick Hudson, will contact the applicant to provide him/her with a unique use case number along with a request for more information if needed. The use case will be submitted to Twitter for approval. Once approved, the applicant will be required to sign a Data Use Agreement (DUA). Finally, a data extraction process will be performed by SDS Twitter team prior to the applicant receiving the dataset(s). Applicants are advised to plan their timing accordingly, due to the potential delay in Twitter’s approval timeline and our limited data extraction capacity.