Scholar Search – School Oriented Search Engine
School: Quemado High
Area of Science: Computer Science
Interim: Team 087
School Name: Quemado High School
Area of Science: Computer Science
Project Title: Scholar Search – School Oriented Search Engine
Schools all around the nation spend millions each year to filtering internet content to protect students. Modern technology does a decent job making sure the internet is filtered. A step into filter the internet is filtering search results. Many filters on the market enforce safe search on search engine and some even remove results that could be found to be inappropriate for students. These methods work to an extent but the filter often has a hard time determining what results are safe. How can schools be more confident in the search results presented? Is there a way to enhance the PageRank algorithm or does it need to be recreated from scratch to present to more relevant and safest results?
Instead buying expensive filters that filter search results and page content, a possible solution would be to build a search engine that crawls the internet and pulls in everything. The information the search engine pulls in should be compared to blacklisted entries and if similar, placed in a “Graylist.” Data in this “Graylist” will not show up in the search results however the administration from schools can decide to place them in results for their school. The PageRank algorithm (PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))) will be analyzed and new algorithms will be created and tested to check for most relevant results. Another system will be designed to analyze page content and determine how relevant and appropriate the content is. If a user with elevated privileges (Higher than student privileges) flags the content, the result will automatically be placed on the Graylist and checked by the administration and then either blocked or whitelisted. If a member of the administration flags the content, the result will automatically be placed in the schools blacklist and placed on the global Graylist. All data collected by the search engine is saved into their respective databases. Black list databases will be cached and the main content will be deleted after 15 days to conserve on disk space.
Progress to Date:
Currently, a computer cluster is being built. The cluster was designed to be both energy efficient and space conservative. Since the search engine has the potential to have a high demand, speed is critical therefore network latency, network reliability, and computer uptime has been mastered. The cluster is enclosed to prevent accidental hardware damage. The cluster contains 9 computers which when enclosed and indexing brings the enclosed temperature up to around 175 degrees Fahrenheit. The ideal temperature is 75 degrees Fahrenheit so cooling solutions have been designed. All 9 computers are configured and 7 of them are indexing currently. The indexing tool was named 4x Technology and is programmed in python however because of performance issues, we are currently working on reprogramming the crawler in C++. A total of 45,000 whitelisted entries are indexed and a total of 150,000 blacklisted entries have been filtered, conformed as inappropriate, cached, and deleted. There is not a public UI online for viewing because security is a main priority and we are working to insure the safety of all its users.
This project my no means is easy. We expect to have a working model ready by mid-February however security, data maintenance, and network redesigning will be required as we expand the cluster. We also expect that inside the working model, users will be able to search for most topics in school and find results that are as relevant as Google’s results. Overall, we expect to have a small scale working search engine online by April and ready to present by May 1st.
Team Members: Justin Miller, Stefan Beauchamp, Sam Eberle, Sam Farr
Sponsoring Teacher/Mentor: Tim Angelus
Sponsoring Teacher: Tim Angelus
Mail the entire Team