Language Analysis on Twitter
School: Los Alamos High
Area of Science: Linguistic Analysis
Interim: Problem Definition:
Knowing the general location from which a messages sent on Twitter, informally known as Tweets, were sent is frequently useful data, such as when attempting to understand how ideas, in the form of hashtags or Tweets, spread. While knowing the origin of messages can be useful, the bulk of Tweets are not sent with geotags that contain their location. Despite this, it is possible to extrapolate the general location from which the Tweet was sent from the language of the message itself.
Previous work done by scientists in the field managed to estimate the location of Tweets with a precision of 900,000 square kilometers by breaking a large number of geotagged Tweets into tokens1. These tokens were then run through a Gaussian mixture model, optimized with MapReduce methodology, which correlated these tokens to location. The model then was able to accept Tweets without geotag, tokenize them, and create a distribution that could predict the Tweets’ locations using the token-location correlation. Our intent is to improve upon this models functionality through sundry methods, including: better tokenization, further optimization of the code, and implementation of a method to track and visualize the spread of specified information.
Progress so Far:
Most of the work done so far has come in the form of research in order to familiarize ourselves with current approaches to estimating Tweet location and find ways that they can be improved. Apart from research, we have obtained, from our mentor, a working version of the aforementioned code and have begun implementing the improvements to tokenization. By starting tokenization work we have taken an important step forward in allowing the code to differentiate hashtags and “@” operands used by Twitter in order to fulfill our goal of tracking and visualizing the spread of information. We will use this as a foundation to expand upon in the coming months.
We hope to first replicate results similar to those that have already been found in our research, but higher in accuracy. Hopefully, our improvements on tokenization yield increases in accuracy of results, or at the least make it possible to analyze and visualize other data. We then intend to begin work on tracing and visualizing the use of words transmitted on Twitter across the physical world with a representation made using Paraview and the information we acquired. Furthermore, we expect the work done on optimization to yield a faster and more memory efficient code, which could be useful in either reducing runtime or increasing the amount of data analyzed.
1. Mahmud, Jalal, Jeffrey Nichols, and Clemens Drews. "Where Is This Tweet From?
Inferring Home Locations of Twitter Users." Aaai. Association for the
Advancement of Artificial Intelligence, 2012. Web. 1 Nov. 2013.
2. Priedhorsky, Reid, Sara Y. Del Valle, and Aron Culotta. "Inferring the Origin Locations
of Tweets with Quantitative Confidence." ArXiv. Cornell University Library, 16
May 2013. Web. 19 Aug. 2013.
Sponsoring Teacher: Lee Goodwin
Mail the entire Team