Searching the Internet can
result in an intractably large list of matches. Web indexing is not keeping
pace with the projected Internet growth of 1000% over the next several years [J.
M. Barrie and D. E. Presti, Science 274, 371 (1996)]. In addition, the percent of the web indexed dropping from 60% to 42% [S. Lawrence and C. L. Giles, Nature 400, 107, (1999)] in a little over a year. Focussed search is needed to give meaningful
results in the face of the explosive growth in the size of the Internet. This is particularly important since 61% of
educators try to incorporate Internet resources into their curriculum but find
it difficult to match up educational sites and software with what they are
looking for [http://www.edweek.org/sreports/tc99/,
9/23/99].
This paper describes Intelligent Indexing, a work in progress where software allows Internet knowledge to be treated as a readily accessible extension of a user’s local machine. There are two pieces to this work: 1.A Web Personal Information Manager (Web PIM) that allows a user to locally create a searchable index of the contents of bookmarked pages, and 2. A search assistant, where a user’s search results will be ordering according to a user’s personalized context.
This work is sponsored in
part by a NSF PFSMETE fellowship, DGE-9809497.
75% of users cite “the
inability to find desired information” as one of their biggest frustrations in
Internet use [www.thewebtools.com/hotstuff/pr070199.htm]. General searches can yield hundreds of thousands of hits. Consider the example of searching for a
“snake,” a piece of equipment used in sound systems for concerts that connects
the microphones and instruments on stage with a soundboard further back in an
auditorium. Searching AltaVista for
“snake” gives 487,070 hits. Adding the
word “audio” to “snake” gives only 185 web pages. As this example illustrates, it is easy to forget refining words
for searches, or to not know what additional words should be used.
Various
search mechanisms have evolved to aid users’ searches. Sherlock [www.apple.com/sherlock/search.html] bases it’s search on the contents
of files, not just their names, while Google [www.google.com] ranks results based on link
popularity. www.amazon.com gives book recommendations based on the similarities
between what you have bought recently and other books bought by those who purchased
the same books you did. Direct Hit [www.directhit.com]
tailors search based
on geographical location, sex, and age.
This gives the ability, for instance, to direct male users searching for
“flowers” to sites offering flowers for sale, while directing female users
searching for “flowers” to sites containing images of flowers.
Metasearch tools
concurrently use multiple search engines, refining or sorting the results. One such tool is BullsEye Pro 1.5 [www.intelliseek.com], which can give a Yahoo-like
directory structure based on its conceptual analysis of the contents of the web
pages. Another is MataHari [www.thewebtools.com] that provides a uniform
interface for performing complex boolean queries on multiple search
engines. According to their site, on
average “users only issue 1.5 keywords per query and have little knowledge of
effective query construction and how Internet search services operate.”
A
trend we can see in these tools is that context
is important. Successful searches
exploit related words in: 1. The search or hyperlinks (Google), 2. Demographic
or historical information (Amazon, Direct Hit), or 3. The contents of files or
sites as compared to the headers or simple descriptive information
(Sherlock).
A user profile can represent the interests of a user as represented by keywords. For instance a middle school science teacher might have the words “science,” “photosynthesis”, and “assessment,” while a programmer might have the words “web,” “information,” and “software”. Keywords such as these can be extracted from three sources: 1. Local text documents, 2. A user profile, and 3. The contents of bookmarked pages.
The contents of local text
documents (e.g. wordprocessor files and other text files) give an indication as
to subjects in which a user is interested. Similarly a user profile can also be
used so that the user can simply type self-descriptive text. Finally the contents of bookmarked pages also
indicates subjects and keywords in which a user is interested, such as
currently implemented with the Web PIM. The user can select which (or all) of
these three should be used in creating the profile.
The
application will then keep track of which keywords are most common, and with
what other keywords they are commonly found.
This will be done by creating a matrix of the 100 most common
keywords. Weighting factors describing
how often the combinations of keywords occur together. This approach has been described as
“Indexing by Latent Semantic Analysis” described in [S. Deerwester, S.T.
Dumais, et. al., Journal of the American
Society for Information Science, 41, 6 (1990), 391-407].
The Web
Personal Information Manager (Web PIM)
The Web PIM currently allows a user to browse
the local file system to select a bookmarks file. All the web pages in the
bookmarks file are then searched and the contents of those pages are
indexed. The search can be limited by
domain, for instance limiting the search to pages that exist within a
particular business or university. The
search can also be limited by depth, with the default being to index the
contents of the bookmarked pages only (depth 1). For instance, web page A
could contain keywords as well as links to web page B. By default only the
keywords on page A would be indexed,
but increasing the search depth to 2 then also indexes the keywords on web
page(s) B.
To give an idea of the potential size of these indices,
searching a bookmark file containing 126 entries yielded just under 9000 unique
keywords, ranging in frequency as shown in figure
1. The 100 most frequent words,
which will be used for the creation of the user profile, were found between 15
and 64 times. Note that 95% of the
words were found less than 6 times and are not shown in the table. Words such as “web” and “information” were
found around 60 times. Words such as
“page” and “site” were found around 50 times, and words such as “internet” and
“software” and “technology” were found around 35 times.
Once the user profile is created, it can be used to sort the metasearch results as follows: First the initial query is sent out simultaneously to multiple search engines. The results of this search is a list of header information corresponding to web pages matching the search criteria. This list of headers itself can be quite extensive. This list is then sorted according to the most significant keywords represented in the user profile, presenting the potentially most significant web pages first.
While the above metasearch and sorting step are being carried out, a second more thorough analysis of potential matching web pages will be initiated. Just as the Web PIM did a keyword analysis of bookmarked pages, now a keyword analysis is done of the sorted metasearch results pages. A current implementation of the Web PIM is server based so that this refinement step can be done as quickly as possible, though it can not be construed as “real-time.”
The key idea behind Intelligent Indexing is to give expert-level performance to a non-expert. Query refinement will take the words of a query and use them to index into the user profile. The most significant words most commonly associated with the words in the query will then be automatically added (using AND) to the query. Note that the automatic query refinements would have to be additive (AND, OR) and not subtractive (NOT), otherwise the problem becomes much more difficult since the set of keywords a user is interested in is much smaller than the set of keywords in which a user is not interested.
* University of Illinois at Chicago, EECS Dept. and Northwestern University, Learning Sciences. reed@eecs.uic.edu, www.eecs.uic.edu/~reed
+ University of Illinois at Chicago, EECS Dept. rmatta@eecs.uic.edu