Beyond Trending Topics: Real-World Event Identification on Twitter

You are here

This paper examines approaches for analyzing Twitter messages to distinguish between those covering real-world events and non-event messages. To validate this work the authors applied their process to 2.6 million Twitter messages. The authors’ approach uses aggregated statistics, applied to topically similar message clusters in large-scale experiments. The authors believe that Twitter can be particularly useful for gaining insight into unique user perspectives related to events, and as a medium for collecting information on unplanned events faster than otherwise possible with traditional media. To identify real-world event content, the authors identify events and their associated Twitter messages utilizing an online clustering technique that groups together topically similar tweets. Revealing features were computed for each cluster, to determine which clusters corresponded to events. These features were then utilized to train a classifier that distinguishes between event and non-event clusters.

This article will be of particular use to researchers and practitioners interested in utilizing event surfacing techniques to detect and isolate real-time event content on Twitter. A key challenge faced in this exercise was distinguishing real-world events from non-event content that triggers substantial message volume over specific time periods. The authors overcame this issue by using an online clustering and filtering network: a scalable and incremental online clustering algorithm. Particular features were used to detect clusters associated with events in the computing process, including: temporal, social, topical and Twitter-centric features. Utilizing these features, the authors train an event classifier by applying standard machine learning techniques; the classifier, in turn, predicted which clusters corresponded to events. The authors found that the Real-World Event Classifier outperformed the Naïve Bayes text classifier over both training and test sets, highlighting that it is overall more effective in predicting whether or not clusters contain real-world event information. In terms of predicting top events in the stream per hour (event surfacing), the real-world event classifier beat both the Fastest and Random baselines, according to precision.

Hila Becker, Mor Naaman and Luis Gravano

2011