New Mexico Supercomputing Challenge

Can a program learn and analyze topics in human language?

Team: 47

School: La Cueva High

Area of Science: Linguistics


Interim: Team Number: 47
Team Members: Ari Echt-Wilson, Eli Echt-Wilson, Justin Sanchez
Sponsoring Teacher: Samuel Smith
School Name: La Cueva High School
Area of Science: Linguistics
Project Title: Can a program learn and analyze topics in human language?

Problem definition: Using computer programming to analyze topics in the human language and apply reasoning to different forms of human language including but not limited to debates, speeches, and poetry.

Problem solution:

We plan to use a Java program to analyze and provide results on an input file of text. We have decided to use the Spearman’s Correlation to find the correlation between words. First, our program will split up a large text file into smaller groups of related words. Then, we will calculate how each word in a related group correlates with each of the other words. In this, we hope to find how certain words are related to each other in a given sample of text. We will ignore some “unimportant” words such as “and”, “if”, “or”, and other words that are unimportant to our model. Then, we can use the output data to analyze general topic, topic changes, and forms of figurative language. Our plan is to first apply our model to something fairly straightforward, such as presidential debates. After we have solidified our model on something straightforward, we will then apply it to a less objective and formulated piece of writing, such as poetry.

Progress to Date:

So far, we have brainstormed and set out a good layout for our project. We have worked hard to ensure that our project is possible, not out of reach, and not too easy. In addition, we have code that is testable and works on a small level. Currently, our code finds the Pearson Correlation between words, which we have decided against because a Pearson Correlation is a linear correlation. We have decided that words do not necessarily have a linear correlation with each other, and would like a better indicator to measure. We recently have found the Spearman Correlation, and plan to implement that into our code as soon as possible. Our code is currently tested using the first 2012 presidential debate. We separate words into related groups, and the separations occur when a new person speaks. Then, we calculate the correlation with every combination of two words from each “bag” of related words.

Our code currently produces a correlation, but it is very difficult to interpret. For example, words that we would expect to have a high correlation, such as “tax” and “cuts”, have a low correlation. Other words that we would expect to have a high correlation, such as “four” and “years”, do in fact have one. It is our current goal to fix and interpret these results. One thing we need to look at is increasing the amount of data we use. We need to have a larger learning set to make sure that the correlations found are real and not just by chance. Changing Pearson’s correlation to Spearman rank correlation is one of our steps in this direction. After we change which correlation we are using, we have to figure out the system for which we plan to “teach” the program about where changes in topic are and which words should be evaluated for correlation. For example, we need to decide if it is necessary to evaluate the correlation between every pair of words in the transcript and which pairs of words are the most important to analyze.

We have not yet applied our correlations between words to analyzing the topics of texts.

We have also asked Mrs. Bedeaux, a La Cueva High School AP English teacher to be our mentor. We believe that our project will run into many issues with trying to understand the English language that we hope she will be able to help us with. After meeting with our mentor, Mrs. Beadeaux, we gained some new thoughts on different ways we can look at this problem. First, we might look more to presidential speeches instead of debates because it is written instead of spoken language. Second we might try and look at some more ways of analyzing than correlation between words such as position of words, repetition, and transition words that show changes in topic.

Expected Results:

We plan to take the project as far as we can go. We know that we can take our project in many directions, and first want to build a solid base. After that, we will branch out and experiment with new steps. We hope to teach a computer program to analyze relatively easy topic changes that would match what humans would decipher intuitively. We plan to then use that same program to answer the question of how figurative language and literature, like poetry, respond to computer analysis of language.

Acknowledgements

The team would like to thank Mrs. Bedeaux for her helpful guidance on English written and spoken structure. We also thank Dr. Thomas Robey for suggesting the use of Spearman rank correlation. Finally, we would like to thank Mr. Smith and Randy Wilson for guiding our choice of project and help with programming the code.

Bibliography

Nguyen, Viet-An, Boyd-Graber, Jordan, and Resnik, Philip. "SITS: A Hierarchical Nonparametric Model Using Speaker Identity for Topic Segmentation in Multiparty Conversations." Association for Computational Linguistics, 2012.

2012 Presidential Debate at the University of Denver between President Barack Obama and Governor Mitt Romney, moderated by Jim Lehrer. October 3, 2012. Accessed at nytimes.com on November 1, 2012.


Team Members:

  Eli Echt-Wilson
  Justin Sanchez
  Ari Echt-Wilson

Sponsoring Teacher: Samuel Smith

Mail the entire Team

For questions about the Supercomputing Challenge, a 501(c)3 organization, contact us at: consult @ challenge.nm.org

New Mexico Supercomputing Challenge, Inc.
Post Office Box 30102
Albuquerque, New Mexico 87190
(505) 667-2864

Supercomputing Challenge Board of Directors
Board page listing meetings and agendas
If you have volunteered for the Challenge, please fill out our In Kind form.