Thursday, November 6, 2008

Assigned Reading 10 Search Engines Part 1 & 2

Search Engines Part 1

GYM search engines 
  • index the most data and provide reliable sub-second responses
  • provide increase quality answers and rand and present results
  • respond quickly to changes in content
  • eliminate duplication, dead links and off topic spam
Crawling algorithm uses a queue of URLs.  It does this my fetching the page, scan the content for links and saves content for indexing.  It also addresses the following:
  • speed
  • politeness
  • exclusion of content
  • avoids duplication of content
  • does a continuous crawl
  • rejects spam
Part 2

An inverted file is used to rapidly identify terms in a search.  The file can be inverted in 2 ways:
  • scanning--the text of the document is scanned
  • for each indexible term--aposting is created with document numbers and term numbers.  This is put into a temporary file in document number order.
Inversion is where the temporary file is sorted into number order with document number as a secondary sort--this provides a start point and length of the lists for each entry.
Indexes store additional information in posting.
Query process algorithms look up each term in term dictionary and locate posting lists.  This returns only the documents with query words.
Query speed up searches by:
  • skipping
  • early termination
  • clever assignment of document numbers
  • caching

1 comment:

Joan said...

Hi Lori,

I also didn't get the OAI information that we read. It was a lot of stuff that I had no idea existed.
The search engine parts were kind of interesting to read about. How they acquired their information, and how they are not supposed to access some websites if that is posted somewhere on the site.
It was good meeting you last weekend. Can't believe we just got home last Sunday. Seems like it was a long time ago.
See you later.