New Mexico Supercomputing Challenge

Language Analysis on Twitter

Team: 63

School: Los Alamos High

Area of Science: Linguistic Analysis


Interim: Problem Definition:
Knowing the general location from which a messages sent on Twitter, informally known as Tweets, were sent is frequently useful data, such as when attempting to understand how ideas, in the form of hashtags or Tweets, spread. While knowing the origin of messages can be useful, the bulk of Tweets are not sent with geotags that contain their location. Despite this, it is possible to extrapolate the general location from which the Tweet was sent from the language of the message itself.

Problem Solution:
Previous work done by scientists in the field managed to estimate the location of Tweets with a precision of 900,000 square kilometers by breaking a large number of geotagged Tweets into tokens1. These tokens were then run through a Gaussian mixture model, optimized with MapReduce methodology, which correlated these tokens to location. The model then was able to accept Tweets without geotag, tokenize them, and create a distribution that could predict the Tweets’ locations using the token-location correlation. Our intent is to improve upon this models functionality through sundry methods, including: better tokenization, further optimization of the code, and implementation of a method to track and visualize the spread of specified information.

Progress so Far:
Most of the work done so far has come in the form of research in order to familiarize ourselves with current approaches to estimating Tweet location and find ways that they can be improved. Apart from research, we have obtained, from our mentor, a working version of the aforementioned code and have begun implementing the improvements to tokenization. By starting tokenization work we have taken an important step forward in allowing the code to differentiate hashtags and “@” operands used by Twitter in order to fulfill our goal of tracking and visualizing the spread of information. We will use this as a foundation to expand upon in the coming months.

Expected Results:
We hope to first replicate results similar to those that have already been found in our research, but higher in accuracy. Hopefully, our improvements on tokenization yield increases in accuracy of results, or at the least make it possible to analyze and visualize other data. We then intend to begin work on tracing and visualizing the use of words transmitted on Twitter across the physical world with a representation made using Paraview and the information we acquired. Furthermore, we expect the work done on optimization to yield a faster and more memory efficient code, which could be useful in either reducing runtime or increasing the amount of data analyzed.

Bibliography
1. Mahmud, Jalal, Jeffrey Nichols, and Clemens Drews. "Where Is This Tweet From?
Inferring Home Locations of Twitter Users." Aaai. Association for the
Advancement of Artificial Intelligence, 2012. Web. 1 Nov. 2013.

2. Priedhorsky, Reid, Sara Y. Del Valle, and Aron Culotta. "Inferring the Origin Locations
of Tweets with Quantitative Confidence." ArXiv. Cornell University Library, 16
May 2013. Web. 19 Aug. 2013.


Team Members:

  Colin Redman
  Sudeep Dasari

Sponsoring Teacher: Lee Goodwin

Mail the entire Team

For questions about the Supercomputing Challenge, a 501(c)3 organization, contact us at: consult1314 @ supercomputingchallenge.org

New Mexico Supercomputing Challenge, Inc.
Post Office Box 30102
Albuquerque, New Mexico 87190
(505) 667-2864

Supercomputing Challenge Board of Directors
Board page listing meetings and agendas
If you have volunteered for the Challenge, please fill out our In Kind form.