Lori's lis 2600 blog: Assigned Reading 10 Search Engines Part 1 & 2

Search Engines Part 1

GYM search engines

index the most data and provide reliable sub-second responses
provide increase quality answers and rand and present results
respond quickly to changes in content
eliminate duplication, dead links and off topic spam

Crawling algorithm uses a queue of URLs. It does this my fetching the page, scan the content for links and saves content for indexing. It also addresses the following:

speed
politeness
exclusion of content
avoids duplication of content
does a continuous crawl
rejects spam

Part 2

An inverted file is used to rapidly identify terms in a search. The file can be inverted in 2 ways:

scanning--the text of the document is scanned
for each indexible term--aposting is created with document numbers and term numbers. This is put into a temporary file in document number order.

Inversion is where the temporary file is sorted into number order with document number as a secondary sort--this provides a start point and length of the lists for each entry.

Indexes store additional information in posting.

Query process algorithms look up each term in term dictionary and locate posting lists. This returns only the documents with query words.

Query speed up searches by:

skipping
early termination
clever assignment of document numbers
caching

1 comment:

Joan said...: Hi Lori,

I also didn't get the OAI information that we read. It was a lot of stuff that I had no idea existed.
The search engine parts were kind of interesting to read about. How they acquired their information, and how they are not supposed to access some websites if that is posted somewhere on the site.
It was good meeting you last weekend. Can't believe we just got home last Sunday. Seems like it was a long time ago.
See you later.; November 9, 2008 at 6:34 PM

Thursday, November 6, 2008

Assigned Reading 10 Search Engines Part 1 & 2

1 comment:

Lori's lis 2600 blog

About Me