The Internet, Policy & Politics Conferences

Oxford Internet Institute, University of Oxford

Michael Jensen, Nick Anstead: Semantic Polling and Political Environments: Political Preferences, Political Institutions, and Unknown Unknowns

This paper has been published as: Michael J. Jensen and Nick Anstead (2013) Psephological investigations: Tweets, votes, and unknown unknowns in the republican nomination process. Policy and Internet 5 (2) 161-182.

Michael Jensen, Autonomous University of Barcelona, Department of Political Science
Nick Anstead, London School of Economics


Avinash Kaushik recently heralded Donald Rumsfeld for his three-part typology of knowledge: there are known knowns, known unknowns, and unknown unknowns. The latter are of great interest to data scientists as it is hoped that analyses of large amounts of data collected from an entire field of activity will enable researchers identify unknown unknowns and move them into the realm of known knowns. We know that the American presidential nomination process contains a series of unknown unknowns as the results of primaries and caucuses have repeatedly belied polling forecasts from as little as a week before an election. As scientific polling often requires several days in the field to capture a representative sample and obtain a reasonable response rate, they are less sensitive to shifts in momentum that often characterize the last week before votes are cast. Hence, semantic polling may be useful for both analysts and campaigns to identify otherwise unknown changes in electoral positioning before those numbers manifest in survey and electoral results.

This paper compares semantic polling developed from analyses of Twitter messages and Google Search patterns with traditional survey research. It develops models for semantic polling based on the the Iowa caucuses and the New Hampshire primary and then applies them to the nomination contests held on Super Tuesday to determine the extent to which these models are not only postdictive but also predictive. The combination of primary and caucus settings at both stages of the analysis make it possible to identify the role of institutional differences which have historically resulted in particularly unreliable survey results in states with caucuses.

Research Questions

1. Can semantic polling accurately predict electoral performance or only shifts in campaign momentum?

2. Under what institutional arrangements (caucuses or primaries) do the semantic analyses of Google Search and Twitter perform better?

3. Do the semantic polling models accurately predict electoral outcomes before surveys are able to accurately predict the results?

4. Given the differences between searching Google as a means to obtain information and posting on Twitter as a means to communicate information, do semantic analyses of these sources yield different results and do they have different roles in predicting electoral outcomes? Methods

Data Collection

The data collection surrounds both the Iowa caucuses and New Hampshire primaries as well as the Republican nomination's “Super Tuesday” (March 6, 2012) during which ten states held a primary or a caucus. The survey data consist of the rolling average of polls over the final two weeks of campaigning in each case. The Google data is curated from their search trends site. The Twitter data was collected directly from Twitter's streaming application programming interface (API) using a program scripted in the Python language by the researchers. Together, this has produced a database of over two million tweets. The semantic polling models consider mentions of campaigns and valances of messages, evaluated both in terms of individual words and word clusters. Analysis of the Twitter data will be conducted using natural language processing techniques in Python to identify trends in candidate mentions and cluster analyses of the words associated with each candidate. The Google Search data will be mined in a similar fashion.


The theoretical contributions of this paper are threefold. First, it aids the construction of more robust methods of election forecasting, particularly for contests where the result is not a foregone conclusion. Second, it systematically investigates differences in online activities between communicating and searching as modes of preference revelation and mechanisms of preference formation. Preliminary analysis of the data suggests that search patterns are unrelated to outcomes where there is a strong front runner, but are associated with candidate movement as individuals attempt to learn more about a lesser-known candidate in order to form an opinion. However there is reason to believe Twitter might be implicated in a two-step flow as positive and negative re-tweets reinforce opinions. Third, the paper identifies relationships between the institutional features of the vote process and the processes of opinion formation manifest in online information-seeking and communication. Relative to primaries, as caucuses entail deliberation and draw upon a small and motivated segment of citizens, the relative utility of Google Search and Twitter in predicting voting preferences may manifest differently under these institutional settings. As both Twitter messaging and caucuses entail communication with others, it is possible that analyses of Twitter messages will more closely approximate the results of caucuses than search trends. Further segmentation of the data based on the locational parameters in tweet metadata and Google Search results will enable the creation of state-by-state models which can be contrast with global-level data to provide geographic parameters on the process of voting preference formation and expression.

Michael J. Jensen, Nick Anstead