The listed resources are sites that have been tried and found of use or of interest  however, there is no guarantee, endorsement, affiliation or association to these sites  and resources and no confirmation can be made that these sites will remain active and available in the future, These links are available as of the date of inclusion but no personal views, opinion or guarantee is expressed by the author



SEARCH ENGINES














Definition of a Search Engine

Search Engine Components

A Search Engine has 3 Basic Parts

1. Spider (crawler, link finder):  a computer program that harvests web links from page to page

2. Index: a database that is organized and searchable of the Spider's harvested results

3. Search and retrieval mechanism: Software that allows users to search the Index and return results in a predetermined order.

But a Search Engine also is commonly used to refer to any software that searches an Index of Words or material types

Examples

A. "Small" page related Search Engine - To search this page for the word   "Usenet" Click on EDIT in your browser menu, then Click (Find (on this page) .  Enter your term and search.

  1. B.Database or Index Specific - Searches only for content within an enclosed site

  2. C.C. Directory Search Engines  -- Searching for content or web pages submitted by hand.  In other words materials are found and maintained by a human being.
    Examples - Yahoo used to only search a directory of submitted pages

D. Large Search Engines - These search engines use the "3 Basic Parts" listed above.   They try to find everything on the Internet and fail  for a number of reasons.


Spiders or Robots

1. Robot software (spiders, crawlers) uses HTTP to request documents associated with a certain URL. 


2. Robots use either a depth-first or breadth-first search strategy for following URLs.
- depth-first robot follows the first link on the initial page, then the first link of the second, and so on.  This is used more commonly for subject specific search engines.

- breadth-first robot searches the first link of initial page, then retreats back to the initial page and follows the second link, and so on.   This is used most commonly for broad search engines.

3. URLs are organized in a database. 

4. The URLs from the database are "reharvested" and text from the sites are put in an index.  How much text is harvested varies amongst the various search engines.

5. Harvesters generate text summaries.  Most copy the <title> and a fixed amount of the initial text.

6. The search engine uses search software to search the index created by the robot searches.

7. Algorithms are used to set each individual search engines search parameters: boolean, wildcards, etc.

8.  Algorithms are used by search engines to Rank the reults of the search.  Factors that may be considered in Ranking: Which fields the search terms are found (<title>, URL field,) Number of times the word appears in a single document, Where the search term appears in the document.  Payment by companies to have their pages ranked high or first.

9. Netiquette for Robots.  The root directory of a Web server can be named robots.txt.  The robot should leave these web files alone for privacy reasons.   In our SMC web account, Web files may be located in a folder named "private" to prevent a "local search engine" from viewing.


Invisible Web - What is it and What is composed of.

According to "Invisible Planet" - The Invisible Web is the content that resides in searchable databases, the results from which can only be discovered by a direct query. Without the directed query, the database does not publish the result. When queried, deep Web sites post their results as dynamic Web pages in real-time. Though these dynamic pages have a unique URL address that allows them to be retrieved again later, they are not persistent.

The Visible or Surface Web  uses search engines as the primary means for finding information on the "surface" Web. Authors may submit their own Web pages for listing [Directories like Yahoo]. Or, search engines "crawl" or "spider" documents by following one hypertext link to another. 

Major Point - what can be found via a search engine like Google is much less than what exists in total on the Internet

But Search Engines might not search "deep web" because:

1. Dynamically (Database) driven - websites.  Search Engines may have difficulty harvesting non-html mark-upped websites.

2. Search Engines CAN NOT search password driven sites like EbscoHost journal databases or online catalogs.

3. Search Engines may have "difficulty" searching within Adobe, Word, PowerPoint, etc files on a web page

The Invisible Web is 500 to 1,000 times the size of the the Surface Web.

The pie chart below displays the distribution of deep Web sites by type of content.


Distribution of Deep Web Sites by Content

Comparing and ranking different  search engines: Ranking the different search engines depends on the emphasis one gives the following evaluation criteria:

1. Size of the database 
- everything included, dual numbers
- selected and reviewed content

2. File Types
    - Web Pages, Usenet News, gopher, FTP, PDF (Adobe), Word, 
    - Other [software, sound, images, video]
    - Material type: Location (country), language, newspapers, journals, blogs, wikis

3. Interface
    - modes: simple or complex, look over for details of boolean searching, etc.

4. Ranking of results - what search engines consider in giving search results
    - Was word found in URL address
    - frequency of word choices found on web pages
    - location: words found in meta-tags, first paragraph
    - reviewed sites
    - fee paid to rank sites higher in results list
    - proximity of words to each other
    - Link Popularity (Google, Inkotomi), also known as Peer Ranking
    - bundling of results into concepts, domains, and sites

5. Limitations
    - Language, Geography

6. Timeliness
    - Frequency of Discovery
    - Timelag
    - Weeding

7. Description of sources (annotations) found in hit list

8. Speed


What Search Engines Often Don't Search
- the following listing is often referred to as the "Invisible Web"

  1. 1.Contents of Adobe PDF and formatted files
    2. The content of Sites requiring a log-in
    3. CGI-Bin Output such as data requested by a form
    4. Intranets
    5. Commercial or proprietary indexes like ERIC, UMI, Lexis-Nexis. [But Google is making an agreement with WorldCat to allow library catalogs to be avaialble
    6. Sites that use a robots.txt file extensions to keep robots (search engines) away
    7. Non-html resources: Telnet, ftp, gopher, etc.
    8. Web sites that are "Database Driven" 

  2. 2.Note that the URL ends not with .htm or .html  Also note the ? in the URL. 

Search Engine Spiders will generally not retrieve or harvest these URLs.



                                           Search Engine Comparison Chart


Search Engines


Ask Both -Dual Search


Bing


Boolify - easy boolean searching


Cross Language - search arabic sites using english words


Exalead


Faganfinder.com


Google Search Engine


Izik.com


Lycos


Mojeek


Quintura


Search3 - Search Google, Yahoo and Live simultaneously


Slikk Search Engine  


Yandex


Yebol


Cluster Search Engines



Carrot clusters from differing search engines


Touch Graph- Top choice


Yippy advanced search with cloud display


Webclust Clustering engine





Country Specific Search Engines


Colossus  country specific search engines


Haystacks  country specific Google sites



Meta Mega and Multi Search Engines


Allplus


All the web


Ask Both Meta search tool that displays multiple results


Deeperweb


Dog pile


Excite


Info grid


Ithaki


IX quick


Kartoo


Mamma


Metacrawler


Metacrawler


Namecheck username search


Neuskool


Profusion


Query server


Real-time meta search and analysis site.


Soovle Multiple search engine


Searchenginewatch


Search online info


Slikk multiple match


Surf wax


Vivisimo


Web crawler


Real time Social Media search


Alternion

Gramfeed

Flumes

Omgili

Topsy

Trendsmap

Visibium



Reverse search and Domain Links


Webboar



Semantic and Specific Field Search


  

Addictomatic.com


Anonymising meta search engine


Arabic and english search engine


Blog and posts search


Browsys.com


Glearch Country and Language Specific


Haika Semantic search engine


Highly customisable Search engine


Incy wincy


Kngine Semantic Search


Knowem Semantic search and image engine


Mahalo


Mash pedia


Multiple engine comparison site


Net search, in depth, date ranged


Realtime commercial search engine


Real time social search engine


Search cloud


Search engines from other countries


Sensebot semantic search engine


Social media keyword current search



Stealth  search engine - does not collect IP details or save search data


Split screen search engine


Wink Social network search engine


Worldwide search engines


Translation Search Engines


Linguee    


UK Search Directories

Google UK


AltaVista UK


Excite UK


Lycos UK


UK Plus


Yahoo! UK & Ireland


Mirago the UK search engine



Visual Search Directories

                             

Kartoo


Locate Metadata within documents

                                                                                    

Nexplore 


Oskope - excellent visual search return  


Search cube graphical search engine that presents        

compact, visual format in three dimensions.


Search me    


Search to find keywords within a document



Social pinup board


TOP visual search engine

                                    

Touchgraph  


Ujiko                                                                           


Visual meta search engine



Visual search engine





Search Engines - Brief Overview


A search engine consists of the interface that you use to type in a query, an index of web sites that matches the queried data and a software program called a spider or bot which trawls the web at set periods and gets new sites for the index. when you use a search engine you are searching its index for matches with your search terms.


Global


This type of engine, such as Google, Yahoo and similar reads pages from all over the world in many languages and the engines may index more than a billion pages.


Regional


Search engines that are limited geographically such as uk sites only


Targeted


Search engines that are limited to a single subject or topic area


Reference


Search engines that index only specific reference works


Directories


Human edited smaller databases with more specific targeted matches to sites or directories.


When searching think of the following;


1 -- Use more than one search engine -- Use or at least try search terms in two or  three single search engines to get a overview of results. Search engines cover different parts of the net and may return differing results


2 -- Use AND to increase relevance -- Use an AND operator to significantly reduce returned items and at the same time increase relevance.Check the search engine operator terms to confirm it will accept AND, and in which format. (Type AND  in capitals in most search engines).


3 -- Use OR to include synonyms -- Use the OR operator in the search and to increase the relevant terms returned, expense of  (Type OR in capitals in most search engines).


4 -- Use Semantics -- When looking for keywords to search with, use different spellings, abbreviations, translations, synonyms, plural, singular, truncation etc. Use professional terms when looking for 'professional' information and consider use of variants of the word


5 -- Use NOT to exclude unwanted terms -- Use the NOT operator to exclude unwanted terms from the results. Again verify the search engines operator requirements but most require a dash added  "-"


6 -- Consider using web directories -- Consider using some of the larger web directories if you are unsure about search terms, want tips about experts, or an introduction to a certain subject.


7 -- Consider using meta search engines -- Use meta search engines as a last resort if none of the major search engines do not produce the results you want. They will review larger fields but return more general data in most cases.


8 -- Use field names to restrict your search -- Use field names to significantly increase relevance and lower recall. Use intitle: to search for titlewords only, use site: to search in a particular domain, use filetype: to search for particular document formats.


9 -- Consider using Serial search engines -- Use one of the Serial Search Engines to quickly search more than one single search engine in succession without having to retype the query.


10 -- Use repressive phrase searching,  -- When searching for specifics try using inverted commas “Search” around your search terms or words to restrict the search fields coverage.



Utilise the Google advanced search page to utilise a number of the above operators


Google Advanced Search                                                       CLICK HERE


For advanced search operators and procedures                      CLICK HERE





The inclusion of any link should not be taken to be an endorsement of the information, views or opinions expressed.