INTELLIGENT INDEXING

Dale Reed* and Rahul Matta+

 

Searching the Internet can result in an intractably large list of matches. Web indexing is not keeping pace with the projected Internet growth of 1000% over the next several years [J. M. Barrie and D. E. Presti, Science 274, 371 (1996)].  In addition, the percent of the web indexed dropping from 60% to 42% [S. Lawrence and C. L. Giles, Nature  400, 107, (1999)] in a little over a year. Focussed search is needed to give meaningful results in the face of the explosive growth in the size of the Internet.  This is particularly important since 61% of educators try to incorporate Internet resources into their curriculum but find it difficult to match up educational sites and software with what they are looking for [http://www.edweek.org/sreports/tc99/, 9/23/99].

This paper describes Intelligent Indexing, a work in progress where software allows Internet knowledge to be treated as a readily accessible extension of a user’s local machine. There are two pieces to this work: 1.A Web Personal Information Manager (Web PIM) that allows a user to locally create a searchable index of the contents of bookmarked pages, and 2. A search assistant, where a user’s search results will be ordering according to a user’s personalized context. 

This work is sponsored in part by a NSF PFSMETE fellowship, DGE-9809497.

 

Finding something on the Internet  - The Problem

75% of users cite “the inability to find desired information” as one of their biggest frustrations in Internet use [www.thewebtools.com/hotstuff/pr070199.htm].  General searches can yield hundreds of thousands of hits.  Consider the example of searching for a “snake,” a piece of equipment used in sound systems for concerts that connects the microphones and instruments on stage with a soundboard further back in an auditorium.  Searching AltaVista for “snake” gives 487,070 hits.  Adding the word “audio” to “snake” gives only 185 web pages.  As this example illustrates, it is easy to forget refining words for searches, or to not know what additional words should be used. 

 

Related Search Tools

            Various search mechanisms have evolved to aid users’ searches. Sherlock [www.apple.com/sherlock/search.html] bases it’s search on the contents of files, not just their names, while Google [www.google.com] ranks results based on link popularity. www.amazon.com gives book recommendations based on the similarities between what you have bought recently and other books bought by those who purchased the same books you did.  Direct Hit [www.directhit.com] tailors search based on geographical location, sex, and age.  This gives the ability, for instance, to direct male users searching for “flowers” to sites offering flowers for sale, while directing female users searching for “flowers” to sites containing images of flowers.

Metasearch tools concurrently use multiple search engines, refining or sorting the results.  One such tool is BullsEye Pro 1.5 [www.intelliseek.com], which can give a Yahoo-like directory structure based on its conceptual analysis of the contents of the web pages.  Another is MataHari [www.thewebtools.com] that provides a uniform interface for performing complex boolean queries on multiple search engines.  According to their site, on average “users only issue 1.5 keywords per query and have little knowledge of effective query construction and how Internet search services operate.”

            A trend we can see in these tools is that context is important.  Successful searches exploit related words in: 1. The search or hyperlinks (Google), 2. Demographic or historical information (Amazon, Direct Hit), or 3. The contents of files or sites as compared to the headers or simple descriptive information (Sherlock). 

 

The Solution: Exploiting Context for Intelligent Indexing through a User Profile

The solution is to use a client-side metasearch tool that uses a personalized user profile to sort queries and provide boolean query enhancement.  The user profile keeps track of keywords that are significant for each user, tracking the frequency and relative proximity between keywords.

A user profile can represent the interests of a user as represented by keywords.  For instance a middle school science teacher might have the words “science,” “photosynthesis”, and “assessment,” while a programmer might have the words “web,” “information,” and “software”.  Keywords such as these can be extracted from three sources: 1. Local text documents, 2. A user profile, and 3. The contents of bookmarked pages.

The contents of local text documents (e.g. wordprocessor files and other text files) give an indication as to subjects in which a user is interested. Similarly a user profile can also be used so that the user can simply type self-descriptive text. Finally the contents of bookmarked pages also indicates subjects and keywords in which a user is interested, such as currently implemented with the Web PIM. The user can select which (or all) of these three should be used in creating the profile. 

            The application will then keep track of which keywords are most common, and with what other keywords they are commonly found.  This will be done by creating a matrix of the 100 most common keywords.  Weighting factors describing how often the combinations of keywords occur together.  This approach has been described as “Indexing by Latent Semantic Analysis” described in [S. Deerwester, S.T. Dumais, et. al., Journal of the American Society for Information Science, 41, 6 (1990), 391-407].

 

The Web Personal Information Manager (Web PIM)

            The Web PIM currently allows a user to browse the local file system to select a bookmarks file. All the web pages in the bookmarks file are then searched and the contents of those pages are indexed.  The search can be limited by domain, for instance limiting the search to pages that exist within a particular business or university.  The search can also be limited by depth, with the default being to index the contents of the bookmarked pages only (depth 1).  For instance, web page A could contain keywords as well as links to web page B.  By default only the keywords on page A would be indexed, but increasing the search depth to 2 then also indexes the keywords on web page(s) B. 

            To give an idea of the potential size of these indices, searching a bookmark file containing 126 entries yielded just under 9000 unique keywords, ranging in frequency as shown in figure 1.  The 100 most frequent words, which will be used for the creation of the user profile, were found between 15 and 64 times.  Note that 95% of the words were found less than 6 times and are not shown in the table.  Words such as “web” and “information” were found around 60 times.  Words such as “page” and “site” were found around 50 times, and words such as “internet” and “software” and “technology” were found around 35 times.

 

Sorting MetaSearch Results

            Once the user profile is created, it can be used to sort the metasearch results as follows:  First the initial query is sent out simultaneously to multiple search engines.  The results of this search is a list of header information corresponding to web pages matching the search criteria.  This list of headers itself can be quite extensive.  This list is then sorted according to the most significant keywords represented in the user profile, presenting the potentially most significant web pages first.

            While the above metasearch and sorting step are being carried out, a second more thorough analysis of potential matching web pages will be initiated.  Just as the Web PIM did a keyword analysis of bookmarked pages, now a keyword analysis is done of the sorted metasearch results pages. A current implementation of the Web PIM is server based so that this refinement step can be done as quickly as possible, though it can not be construed as “real-time.”

 

Future work: Expert Query Refinement

            The key idea behind Intelligent Indexing is to give expert-level performance to a non-expert. Query refinement will take the words of a query and use them to index into the user profile.  The most significant words most commonly associated with the words in the query will then be automatically added (using AND) to the query.  Note that the automatic query refinements would have to be additive (AND, OR) and not subtractive (NOT), otherwise the problem becomes much more difficult since the set of keywords a user is interested in is much smaller than the set of keywords in which a user is not interested.



*  University of Illinois at Chicago, EECS Dept. and Northwestern University, Learning Sciences.  reed@eecs.uic.edu, www.eecs.uic.edu/~reed

+ University of Illinois at Chicago, EECS Dept.  rmatta@eecs.uic.edu