New Mexico Supercomputing Challenge

Quantifying Literature's Quality

Team: 57

School: Los Alamos High

Area of Science: Behavioral and Social Sciences

Interim: Problem Definition:
This project's goal is to further develop an understanding of what makes certain pieces of literature sell well and appeal to a large audience. Literature's popularity may be connected not just to plot elements and characterization but to syntactical complexity as well. This year I am studying four main measures of complexity. These are the frequencies of various punctuation marks, the frequencies of conjunctions, the distribution of individual word frequencies in the text, and the potential exponential relationship between an author's vocabulary size and the number of sentences in a work (in other words, the work's length).
Problem Solution:
Data for this project is taken from Project Gutenberg, an online database of books whose copyrights have expired. This year I am studying fiction books only. Code will be written in Perl and/or Java. To make the data collection process easier, I plan to write a master code that calls two to six codes for each book in my set, making it necessary to manually run only a single program instead of hundreds. First, I will study the frequencies of punctuation marks and conjunctions in the text. Then, I will consider my other two metrics. Texts with higher punctuation and conjunction frequencies will have more complex sentence structures, perhaps leading to higher popularity. In many cases I will be able to modify or build on programs from last year's project rather than create a new code for every metric. To analyze word frequencies and vocabulary size, I will create a “dictionary” for every sample. The “dictionary” is the list of all different words that an author used in a specific book. I will also determine the number of times each word in a dictionary is used in the novel.
Progress So Far:
This year I have expanded my data set to about 200 novels and short stories. I have also successfully modified an older program to obtain the number of periods, commas, semicolons, colons, exclamation points, and question marks in a text. I had previously written the program to remove all punctuation marks from a book prior to analyzing it further. The code will now simply count the number of times it encounters a particular punctuation mark in addition to performing this function. I am currently creating dictionaries for my samples and developing the master code that will call all other codes for every book.
Expected Results:
I think that novels with a higher syntactical complexity will generally be more popular than those with low complexity. Material with repetitive simple sentence structure and a small vocabulary is often uninteresting to read. However, there will likely be a limit to how complex a book can be. It is sometimes equally difficult to read a book with a very complicated structure. Thus, there is probably an optimal point of syntax complexity where books are most popular.
Research Citations:
"Project Gutenberg." Project Gutenberg. N.p., 11 Oct. 2014. Web. 06 Dec. 2014.
Wall, Larry, et. al. Programming Perl, Second Edition. Beijing: O'Reilly, 1996. Print.

Team Members:

  Tabitha Welch

Sponsoring Teacher: Adam Drew

Mail the entire Team

For questions about the Supercomputing Challenge, a 501(c)3 organization, contact us at: consult1415 @

New Mexico Supercomputing Challenge, Inc.
Post Office Box 30102
Albuquerque, New Mexico 87190
(505) 667-2864

Supercomputing Challenge Board of Directors
Board page listing meetings and agendas
If you have volunteered for the Challenge, please fill out our In Kind form.
Flag Counter

Tweet #SupercomputingChallenge