Vanity Search: A Narcissistic Text Classifier (5/12/03)

Final Project Proposal
CS 224N - Natural Language Programming

Vanity Search: A Narcissistic Text Classifier

I don't know what other people search Usenet and the web for ("sex AND britney spears AND pictures", from what I hear), but on occasion I like to search it for references to me. Even though my full name is fairly rare, searching for it misses references that can be found by searching for only my first name. However, searching by first name alone results in a large number of false positives. Thus I propose to write a text classifier to aid in such searches. It would be designed for two corpora: webpages and text messages (such as from Usenet or email lists).

First, the user would try tight searches, and tag a number of URL's as being positive matches. The user could either provide looser searches (ones w/ many false positives), or the system could generate them itself from the initial pool. The resulting documents would be analyzed for similarity to the initial pool based on some NLP metric, and matches reported. User feedback about the quality of the matches would be used to further refine the search. Creating and tuning this metric would be the main work of the project. Constructing search strings to find documents similar to the pool could be part as well, however it seems likely that substantial work has already been done in that area (search engines have "similar to" searches already).

Actually, it seems as though such a classifier would be useful for refining any search, but its nice to have an exact use in mind for inspiration. I suspect that a vanitysearch.com, or a VanitySearch java applet would be a moderately popular internet amusement. In the old days, of course, it would have been the basis for an entire company with millions of dollars of venture capital and eventually an IPO :).

<< Community (5/12/03) << || >> Stress, Sickness, and School (5/12/03) >>


Up to Index of Entries
Back to Journal Index