Jorge Goncalves, University of Oulu
Millions of people login daily to several different internet platforms (e.g., Facebook, Google, Twitter, and Reddit), which has fundamentally changed how humans communicate, seek information, discuss a plethora of topics, follow their interests, etc. For this reason, the stream of information originating from these social platforms has been increasingly explored by researchers. Previous work has highlighted how the information contained within these social streams can provide rich insights on people’s opinions and perceptions on a number of different topics [1,2], what motivates certain online behaviours [3,4], and effects of social networking sites on users . However, an important challenge that scientists face in such large behavioural datasets is that of information overload due to the sheer quantity of available data .
In this paper we describe a web-based platform meant to provide an easy and approachable way to analyse Reddit's entire publicly available comment dataset. Reddit is an entertainment, social networking, and news website where registered community members can submit content, such as text posts or direct links, making it essentially an online bulletin board system. The dataset consists of ~1.8 billion JSON objects complete with the comment body, score (including up and down votes), author, subreddit, position in comment tree, creation time and other fields that are available through Reddit's API. We created our platform by implementing a user-friendly web interface for making queries, and then present the information visually in the form of multiple charts such as bar charts, histograms, line charts, word clouds, etc. We use the Shiny web framework with additional libraries to implement both the front end the back end of the web application. Shiny Dashboard is a package for R that provides a set of functions designed to create analyses into interactive web applications. Shiny Dashboard is easy to use and configure making it ideal for analysing complex dataset such as the one described here. Shiny Dashboard also allows dynamic and responsive design meaning a high-level of compatibility across multiple devices. The web-based platform is connected to a MySQL database where the Reddit dataset is stored. The database is stored on a large commercial server and the functional web page on a smaller local server. Through this platform we give Reddit users and interested researchers the possibility to easily analyse this rich dataset, regarding topics of Policy or Politics, or otherwise.
To showcase the usefulness of our platform and the potential for obtaining rich insights from this dataset, we present a number of examples focused on Policy and Politics. For instance, we present results on the reaction on Reddit regarding the June 26th, 2015 ruling on the legalisation of same-sex marriage in the United States of America. As a more longitudinal analysis, we showcase changes in perceptions and opinions on the different candidates throughout the 2016 United States of America presidential nomination race. Finally, we discuss the need for better and simpler visualisation tools for social scientists to analyse Big Online Behavioural Datasets.
 Jaehyuk Park, Giovanni Luca Ciampaglia, and Emilio Ferrara. 2016. Style in the Age of Instagram: Predicting Success within the Fashion Industry using Social Media. In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing (CSCW '16). ACM, New York, NY, USA, 64-73. DOI=http://dx.doi.org/10.1145/2818048.2820065
 Li Zhuang, Feng Jing, and Xiao-Yan Zhu. 2006. Movie review mining and summarization. In Proceedings of the 15th ACM international conference on Information and knowledge management (CIKM '06). ACM, New York, NY, USA, 43-50. DOI=http://dx.doi.org/10.1145/1183614.1183625
 Suman Kalyan Maity, Ritvik Saraf, and Animesh Mukherjee. 2016. #Bieber + #Blast = #BieberBlast: Early Prediction of Popular Hashtag Compounds. InаProceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computingа(CSCW '16). ACM, New York, NY, USA, 50-63. DOI=http://dx.doi.org/10.1145/2818048.2820019
 Johann Schrammel, Christina KЎffel, and Manfred Tscheligi. 2009. How much do you tell?: information disclosure behaviour indifferent types of online communities. In Proceedings of the fourth international conference on Communities and technologies (C&T '09). ACM, New York, NY, USA, 275-284. DOI=http://dx.doi.org/10.1145/1556460.1556500
 Yong Liu, Jayant Venkatanathan, Jorge Goncalves, Evangelos Karapanos, and Vassilis Kostakos. 2014. Modeling What Friendship Patterns on Facebook Reveal About Personality and Social Capital.ACM Trans. Comput.-Hum. Interact.а21, 3, Article 17 (June 2014), 20 pages. DOI=http://dx.doi.org/10.1145/2617572  Bill Kovach, and Tom Rosenstiel. Blur: How to know what's true in the age of information overload. Bloomsbury Publishing USA, 2011