Shawn Walker, University of Washington
Joe Eckert, University of Washington
Jeff Hemsley, University of Washington
Robert Mason, University of Washington
Karine Nahon, University of Washington
Social scientists utilizing the big datasets gathered from social media are faced with unique technical, methodological, and ethical challenges that are often not foregrounded in academic literature. This work addresses these technical, methodological and ethical gaps by describing the evolving system used by the Social Media Lab at the University of Washington to collect cross platform data related to the Occupy movement (38+ million tweets, Meetup.com check-ins, and Facebook posts). Design considerations inform the interrelationship of technical, methodological and ethical considerations related to big data and social media. We justify our architecture and design choices vis-à-vis these gaps, providing utility for other studies that involve collecting, curating, and analyzing large quantities of social media data.
Traditional databases and desktop computers are inadequate for the collection, storage, and analysis of larger, more complex social media data sets. More computing power, alternate tools for storage and new methods for utilizing computational clusters are needed. The primary technical challenges include setting up and maintaining computing system architectures suited to constant real time data collection, new forms of fast, flexible data storage for retrieval and curation. Additionally, each platform’s application programming interface (API) is unique, complex, and changes over time. Longitudinal studies and cross platform projects require agile tools and practices for collection, pre-processing, and data storage, or risk loss of relevant information due to the ephemeral nature of social media data and the data’s potential volume. Traditional three-tier server architectures, with established database management approaches, are inappropriate design choices for social media data. In response, we developed an open source system for the collection, storage, processing, filtering, and visualization of large social media data sets. Our system utilizes a clustered cloud computing architecture with tiers devoted to data collection, pre-processing and analysis via MapReduce, storage and archiving via MongoDB (a high performance, scalable NoSQL document-centric database), with a branched front-end that supports both statistical analysis capabilities and web-based tools for the public presentation and interaction of aggregated data.
Methodological issues influence not only technical system design choices, but also the credibility of knowledge claims. Discussions in the literature about data collection often lack this detail. For example, many papers note the number of tweets collected without mentioning the limits of the API. Without knowing which API was used and its sampling procedure, researchers can not critically assess the validity of a researcher’s knowledge claims, as most of Twitter’s multiple APIs offer only a undefined sample of the full Twitter stream. The APIs for social media data collection vary considerably and are not well documented , thus the data that researches are hoping to collect may not be what they really have. And while Kwak et. al.’s work  describes their method for dealing with noise in their Twitter data, most researchers report quantitative findings without discussing signal noise. Our method of sampling Twitter data draws on web-sphere analysis , wherein the researcher is required to deeply understand the sampling methods by which platforms provide data, the quality of the data in terms of signal to noise ratio, how the social norms prevalent in each platform affect the researcher’s ability to collect the signal they expect, as well as the implications for knowledge claims.
Ethical considerations are rarely mentioned in empirical pieces, but emergent concerns surrounding the analysis of big data suggest some possible routes of inquiry . Issues of privacy and access stand out as points for possible intervention. Social media participants may intend messages and actions for a select audience despite being publicly accessible . The “geoweb”, or the collection of services combining web 2.0 technologies with geographic location , is a area of contestation over the shifting definition of locational privacy . The addition of geographic components to big data introduce risks to the user that extend beyond the content of her discourse. Researchers need to reach beyond the ethically anemic terms of service (TOS) provided by each platform as well as outdated internal review board notions of “publicly available” in order to protect user privacy while still maintaining data usefulness and quality.
Researchers are ethically obliged to provide the necessary detail to allow others to replicate their findings. However, not all researchers have access to sufficient technical expertise or the resources required to collect and analyze social media data. We present a toolkit comprised of free and open source software to enable researchers to leapfrog these limitations. These tools will lower the technological and methodological costs of social media analysis for further production of knowledge.
In summary, our design is informed by the interrelationship of technical, methodological and ethical considerations related to big data and social media. For example, we place privacy at the foreground in developing research questions. Research questions in turn dictate our methodological approach and the data we collect. These all constrain the available technical design choices. Technical constraints influence the methods available to us, and thus also influence our questions. Indeed, careless researchers can unintentionally find their ethical considerations loosened by the affordances of technology. Our major contributions are that we illuminate the interrelationships of ethics, methods and technology in informing our technical design choices and offer an end-to-end lay description of our computing environment.
 boyd, d. and Crawford, K. Six Provocations for Big Data. (2011).
 boyd, d. and Marwick, A. Social Privacy in Networked Publics: Teens’ Attitudes, Practices, and Strategies. (2011).
 Elwood, S. and Leszczynski, A. Privacy, reconsidered: New representations, data practices, and the geoweb. Geoforum 42, 1 (2011), 6–15.
 Kwak, H., Lee, C., Park, H., and Moon, S. What is Twitter, a social network or a news media? Proceedings of the 19th international conference on World wide web, ACM (2010), 591–600.
 Scharl, A. and Tochtermann, K. The geospatial web : how geobrowsers, social software and the Web 2.0 are shaping the network society. Springer, London, 2007.
 Schneider, S.M. and Foot, K.A. Web sphere analysis: An approach to studying online action. In Virtual methods: Issues in social research on the Internet. 2005, 157–170.