New Mexico Supercomputing Challenge

Quantifying Literature's Quality

Team: 67

School: Los Alamos High

Area of Science: Behavioral and Social Sciences

Interim: Problem Definition:
The goal of this project is to better understand what makes various pieces of literature simple or complex. Complexity of literature may be related to what makes a particular piece last long and appeal to a specific audience. However, complexity is not directly equivalent to quality. I will especially focus on the size of vocabularies, word length, and word frequency this year. I will acquire a number of novels from Project Gutenberg, an online source of books whose copyrights have expired, for my data. I also hope to use fiction writing samples from my peers and some of my own writing.
Problem Solution:
Lexile measures for texts are developed based on word use and syntactic elements of a book. My complexity measures will focus on similar elements. Code for this project will be written exclusively in C. I will begin by creating a separate dictionary for each of my data samples. Each dictionary will consist of a list of all the words used within the novel or other piece of writing. I will use this information to determine the size of an author's vocabulary and then the frequency of each word in the text. Next, I will consider sentence structure. By the frequency of periods and other punctuation marks, I can determine average sentence length and sentence type (simple, compound, etc). In addition, the frequencies of conjunction words such as “and” and “or” may help point towards compound sentence structure.
Progress So Far:
I am currently constructing the code that will create a dictionary of all the words contained within a specific literature sample. Each word in the sample will appear once in the dictionary. The completed dictionary will show the exact size of the writer's vocabulary. After I create dictionaries for a sufficient number of samples, I will be able to write a code that will count the occurrences of each word in a sample based on that sample's dictionary.
Expected Results:
I believe that my writing and my peers' writing will have smaller vocabularies, shorter words, and higher frequencies of the same words than the novels I take from Project Gutenberg. Gutenberg contains novels from fully developed writers, often very famous writers. They have often had more writing experience than amateur authors. However, there will certainly be exceptions. Some authors choose to write in a simple style in order to convey their ideas more easily. Experience will not necessarily lead to more complex works. This will make my final results much more interesting.
Research Citations:
Kernighan, Brian W. and Ritchie, Dennis M. The C Programming Language, Second Edition. Upper Saddle River, NJ: Prentice Hall, 1978. Print.
"Free Ebooks - Project Gutenberg." Project Gutenberg. N.p., n.d. Web. 31 Dec. 2013.
"What Is a Lexile Measure?" What Is a Lexile Measure? MetaMetrics Inc., 2013. Web. 31 Dec. 2013.

Team Members:

  Tabitha Welch

Sponsoring Teacher: Lee Goodwin

Mail the entire Team

For questions about the Supercomputing Challenge, a 501(c)3 organization, contact us at: consult1314 @

New Mexico Supercomputing Challenge, Inc.
Post Office Box 30102
Albuquerque, New Mexico 87190
(505) 667-2864

Supercomputing Challenge Board of Directors
Board page listing meetings and agendas
If you have volunteered for the Challenge, please fill out our In Kind form.