by RICHARD EINER PETERSON
Search engines fall into five categories: robotic Internet search engines which use a Web robot to retrieve a significant number of documents from the World Wide Web; mega-indexes which have links to the robotic search engines; simultaneous mega-indexes which access the robotic search engines simultaneously; subject directories which are manually-maintained collections of Web sites organized by topic; and robotic specialized search engines which focus on a small or specialized segment of the Internet.
There are eight robotic Internet search engines: AltaVista, Excite, HotBot, InfoSeek Guide, Lycos, Open Text, Ultra, and WebCrawler. The power of these search engines is compared by their performance on a search term "embargo" and a search phrase "Woodrow Wilson's Fourteen Points" over three time periods in 1996, in February, May, and November.
A comparison was made of eleven special features of these search engines: a) full text indexed; b) number of total matches given; c) maximum number of returns; d) returns: size (bytes) given; e) returns: date given; f) returns: number at a time; g) returns numbered; h) search for pictures and sound; i) browsable categories; j) daily news available; and, k) NASDAQ ticker symbol.
Five Major Categories
Big Eight Search Engine Comparison
Special Features of Search Engines
World's Fair Redux
Appendix: Web-Based Bibliographies
At the futuristic New York World's Fair of 1939, there was much talk of robots which could vacuum rugs. People expected robots that would soon clean their homes. This hasn't happened. Yet, a robot of another kind is alive and well on the Internet. It travels on the Net, retrieving documents as well as supplementary materials and data. Working day and night, these virtual robots garner information by the gigabytes.
A search engine is a computer program that searches for documents containing keywords or phrases of interest to you. The retrieving robot and the search engine act as an Information Robot or "Info-bot," a sort of obedient servant digging up dozens or even thousands of documents, quickly. To make it all the more surreal, search engines allow you to examine each retrieved file with just a click of the computer's mouse.
In an information-driven society and economy, information about information reigns supreme. The Internet-connected computer stands ready with its power and speed to interact with the information needs of the query-maker. Information retrieval is an essential skill in this age of terabytes of files and millions of computer servers. Search engines are essential in our collective quest for the Holy Grail of information.
With so many search engines, it can be quite bewildering to understand which engine will work best for a given search. Occam's razor - the razor that slices through complexity and chaos to produce simplicity and clarity - is needed. It is important to realize that are many types of search engines.
I've identified five major segments to the search engine "industry."
(1) Robotic Internet Search Engines
A classic article on the robots that rove the World Wide Web gathering information is "Robots in the Web: threat or treat?". Martijn Koster explains the role of the robot:
A Web robot is a program that traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced. These programs are sometimes called "spiders," "web wanderers," or "web worms."
The robotic search engines attempt to cover at random a significant portion of the World Wide Web. They examine that portion of the Internet with Universal Resource Locator (URL) addresses starting with http:// or with www. as well as parts of the Internet with HyperText Markup Language (HTML) links. At this point in time, I have identified eight engines that fit this definition:
Also known as meta-indexes, these engines do not have their own databases. Instead, they are linked to robotic search engines. There are thousands of such mega-indexes - many are just personal Web pages with search engine links. Here are a few:
All-in-One Search Page
(3) Simultaneous (Parallel) Mega-Indexes
Also known as Multi-Threaded Meta-Indexes, these are mega-indexes which access robotic Internet search engines in parallel (simultaneously) and present the unified results as a single package. Two of the better-known and free simultaneous mega-indexes are
MetaCrawler and Savvy Search. For-fee simultaneous mega-indexes were recently reviewed in the literature [ 1 ].
(4) Subject Directories
These are manually-maintained, browsable, and are often searchable with robotic search engines (that index all documents at a particular site). The most famous is Yahoo! which covers about one or two percent of the Web, has twenty-one subject headings or categories, and is both browsable and searchable. In a sense, Yahoo! is also a simultaneous mega-index since it automatically connects to AltaVista for searching the Web at large. In another sense, Yahoo! is also a mega-index since its hypertext links will take you to other robotic search engines besides AltaVista.
In its simplest form, a subject directory could be merely a set of bookmarks on someone's personal Web page. On my own home page, the most popular link is probably my subject directory entitled "Top Ten Online Newspapers and Magazines." Each newspaper and magazine is itself a subject directory which can be both browsed and searched.
(5) Robotic Specialized Search Engines
These are robotic search engines which focus on a portion of the Internet. These segments include the World Wide Web; newsgroups and discussion lists; files available by file transfer protocol (FTP); people (White Pages); companies (Yellow Pages); and, software.
A convenient one-stop location with links to the specialized search engines has been created by InterNIC. At the InterNIC server, you can find Software and FTP, Usenet and Discussion Lists, the White Pages, and the Yellow Pages.
An analysis of the "Big Eight" search engines - AltaVista, Excite, HotBot, InfoSeek Guide, Lycos, Open Text, Ultra, and WebCrawler - was undertaken in 1996 during February, May, and November. Two basic forms of queries were used, on a single keyword "embargo" and on a keyword phrase "Woodrow Wilson's Fourteen Points [ 2 ]." The mode of analysis is that of a cross section time series data base in which the cross section or panel consists of the robotic Internet search engines. The February 1996 panel and search results are available online [ 3 ]. Of the search engines mentioned above, HotBot appeared for the first time in May 1996 and Ultra in November 1996.
The eight search engines are compared in Table 1. The choice of two search terms - "embargo" and "Woodrow Wilson's Fourteen Points" - is arbitrary. No formal statistical tests are warranted. There are some differences that are, however, worthy of mention.
Although just two months old in February 1996, AltaVista represented a new generation of search engine power. Its 20,000 matching documents for "embargo" far exceeded its closest rival at that time - Open Text - which had 1,026 matching documents. AltaVista, with its four relevant returns for "Woodrow Wilson's Fourteen Points," outdistanced its nearest rival InfoSeek Guide with its two returns.
In terms of sheer numbers of matching documents for "embargo," a ten-tupling of growth occurred between February and November 1996 for Excite, InfoSeek Guide, Lycos, and WebCrawler. In the case of Excite, the number of matching documents for "embargo" rose from a low in February 1996 of 40 items to a high in November of 62,511 files.
A powerful newcomer has appeared on the sceen called Ultra, a companion of InfoSeek Guide. It seems to represent a new generation of processing capability For the phrase "Woodrow Wilson's Fourteen Points," Ultra's 32 relevant returns are actually an underestimate. I examined only the first fifty returns; most of the eighteen items which were not relevant were dead links. The fiftieth item in the search was relevant.
In Table 2 - "Eight Search Engines: Their Special Features" - differences and similarities in the search engine output are outlined. It appears that all the search engines index the full text of the documents retrieved by their robots, with the single exception of Lycos. In every case, on the other hand, they all indicated the total number of matches found for "embargo" and for "Woodrow Wilson's Fourteen Points."
Amazingly, five of the search engines provided all of their results, even if the search produces a very large number of hits. AltaVista and InfoSeek Guide limit returns to 200 items and Lycos to 100 hits. The size of returns is useful information since the lower the number of bytes, the skimpier the site. Only Excite and WebCrawler fail to provide this information.
The date of a return is especially useful if you are looking for recent documents. Excite, InfoSeek Guide, Lycos and WebCrawler do not provide the document's date.
Most engines give you ten hits at a time. HotBot, Lycos, and WebCrawler allow you to choose 10, 20, and even 100 returns at a time. AltaVista gives the option of retrieving the first, second, and up to twentieth block of 10 returns, in any order. This allows a searcher to go offline doing your session, to move straight to the twentieth block without downloading the previous blocks. Lycos gives the same sort of option, but only up to the tenth block. The other search engines have only the option of downloading the "next" ten returns.
Numbering the returns is a convenient feature, especially in comparing search engine output over time. Only HotBot, Lycos and Open Text have this feature.
The ability to search for pictures is available with HotBot, Lycos, and Ultra. Daily news is available in Excite, InfoSeek Guide, Lycos, Ultra, and WebCrawler. If you interested in investing in search engines for the over-the-counter market, their NASDAQ ticker symbol is provided in Excite, Lycos, and Open Text.
Table 1: Eight Search Engines Compared [ 4 ] Search Engine Search Term: Embargo
(number of hits)
Search Phrase: Woodrow Wilson's Fourteen Points
(number of hits)
Number of Indexed URLs
AltaVista 21 February 1996 20,000 4 May 1996 30,000 4 November 1996 30,000 4 Excite 50 February 1996 40 0 May 1996 104 0 November 1996 61,650 0 HotBot 54 May 1996 38,085 3 November 1996 62,511 4 InfoSeek Guide 1 February 1996 100 2 May 1996 947 2 November 1996 1,044 4 Lycos 19 February 1996 366 0 May 1996 3,553 0 November 1996 4,219 1 Open Text 1.5 February 1996 1,026 0 May 1996 929 0 November 1996 3,758 0 Ultra 50 November 1996 23,441 32 WebCrawler 0.5 February 1996 49 0 May 1996 330 0 November 1996 612 2
Table 2: Eight Search Engines: Special Features Special Features AltaVista Excite HotBot InfoSeek Guide Lycos Open Text Ultra Web Crawler Full text indexed? Yes Yes Yes Yes No Yes Yes Yes Number of total matches given? Yes Yes Yes Yes Yes Yes Yes Yes Maximum number of returns 200 infinite infinite 200 100 infinite infinite infinite Returns: size given in bytes? Yes No Yes Yes Yes Yes Yes No Returns: date given? Yes No Yes No No Yes Yes No Returns: Number at a time? 10 10 variable 10 variable 10 10 variable Returns: numbered? No No Yes No Yes Yes No No Search for illustrations,
No No Yes No Yes No Yes No Browsable categories? No Yes No Yes Yes No No Yes Daily news feed? No Yes No Yes Yes No Yes Yes NASDAQ ticker symbol XCIT LCOS OTEXF
Most search engines do not index common words or indefinite or definite articles such as a, the, to, and be. AltaVista, however, indexes all words and is able to handle exact phrases such as "to be or not to be."
Excite has Intranet search capability, in a free version for use locally. It can be downloaded directly from the Excite server.
With HotBot, you can restrict searches to certain localities in the world. You can also narrow your search to a particular site or server.
When the InfoSeek Guide started in January 1994, it was a for-fee search engine at $US9.95 per month. In August 1994, it became free with revenues from advertisers. The ad revenue was estinate to be $US1 million in 1995 and $US5 million for the first six months of 1996 [ 5 ].
Lycos is able to search for pictures, sounds, videos, and other multimedia files on the Internet. The implications are profound:
Know a student working on an assignment for history class? Go get a sound file of Martin Luther King's "I Have a Dream" speech, or Neil Armstrong's "one small step for man" quote. Want to see Bill Clinton playing the saxophone? Go get a picture of it [ 6 ]."
WebCrawler has the capability of adjacency searches. The query "arthritis NEAR/25 nutrition" locates files in which both words appear within 25 words of each other, in either direction. The query "budget NEAR deficit" returns only those pages in which the words are next to each other, in either order.
With Ultra, you can choose to have relevant matches sorted by last modification date. This useful feature is helpful in locating up-to-date news on a specific topic. Using "Special Searches," you discover how popular a particular Web site might be, by typing a given URL on the search form. Ultra will count and list links to the server. Ultra is able to search for an image, a Java applet, or manuscripts. It also keeps track of how often pages are changing and re-indexes each page at its frequency of change. You can determine how many pages refer to a given Web page or site ("searching backwards"). Suppose, for example, that you are interested in pages which refer to Infoseek. Just type the following in the query box: +link.infoseek.com -url.infoseek.com To discover, on a day-by-day basis, how many sites are indexed by Ultra, just type url.http as your query term [ 7 ].
The robot dream of the 1939 World's Fair is yet to be fulfilled. John McCarthy, of Stanford University, has conjectured that household robots might become available in twenty years [ 8 ]. And, borrowing from the language of World War II, the Economist concluded that "Rosie, the Jetson's friendly house-cleaning robot, is likely to remain Rosie the plain old Riveter for a while yet [ 9 ]."
At the 1939 World's Fair, television made its first public appearance. Decades later, Marshall McLuhan described the "global village" and, inspired by the television revolution, told the world that "The medium is the message [ 10 ]." The Internet, the most recent embodiment of a digital revolution , by offering content and by "placing the focus on information instead of infrastructure," has upended the famous McLuhan quote - the message is indeed the medium [ 11 ].
We are in a new age of "information ad hominem" - information to the person. Imagine the limitations of the card catalogs in libraries in 1939. These catalogs limited your search to authors, titles, and a finite assortment of topics, eventually leading you on a trek through shelves of paper and cloth for information.
Fast forward to the present with an estimated fifty million documents available electronically. Replace call numbers with Universal Resource Locators assigned to each document. Dispose of those arbitrary and finite categories in the card catalog.
The ultimate digital information revolution will consist of extremely personal robots who, in matter of moments, find and electronically deliver relevant documents on "Woodrow Wilson's Fourteen Points." These search capabilities by personal and digital information assistants are the fulfillment of a dream that was not even anticipated in the 1939 World's Fair "World of Tomorrow."
Richard Einer Peterson is Professor, Financial Economics and Institutions, in the College of Business Administration at University of Hawaii, Manoa Campus, Honolulu, HI 96822. phone: (808) 956-7563, fax: (808) 956-9887; E-mail: email@example.com
1. R. Santalesa, 1996. "Search tools hit the second generation,"
NetGuide Magazine, (November), and W. Cunningham, 1996. "We never meta search engine..." Net Magazine, (December).
2. I remember submitting the word "embargo" to search engines early in 1995 and getting few returns. The situation has improved consdierably since then. "Woodrow Wilson's Fourteen Points" is not a scavenger hunt item, but it is certainly challenging since it requires a match for all four terms. A speech on the "Fourteen Points" was made by U. S. President Woodrow Wilson just before the end of World War I, appealing for a forgiving, rather than a punitive, peace and requesting the establishment of what became the League of Nations. For more details see, for example, G. M. Gathorne-Hardy, 1939. The Fourteen points and the Treaty of Versailles. Oxford: Clarendon Press, pp. 8-11.
3. Internet search engines by Richard Einer Peterson, March 1996. The panel of ten search engines in March 1996 was marred by the inclusion of two subject directories (Harvest Broker and Magellan), a for-fee search engine (NlightN), and a now-inactive search engine (WWW Worm). The panel at that time did not include the not-yet-born search engines - HotBot and Ultra.
4. For AltaVista, InfoSeek Guide, Lycos, Open Text, and WebCrawler, data on the number of URLs indexed are from G.Venditto, 1996. "Search engine showdown," Internet World, (May). October, 1996 data are used for Excite, HotBot, and Ultra and are from Excite, HotBot, and Ultra.
5. L. Armstrong, 1996."The Education of a Web searcher," Business Week, (September 23).
6. Lycos, September 17, 1996.
7. For an analysis by Ultra of six search engines and their various search features, along with a summary table, see Ultra.
8. J. McCarthy, 1995. " Robot servants," (December).
9. "Roboflops," The Economist, October 19, 1996, p. 86.
10. See, for example, G. Wolf, 1996. "The Wisdom of SaInt Marshall, the holy fool," Wired, vol. 4, no. 1 (January), pp. 122-125, 182, 184, 186, and, Marshall McLuhan and Quentin Fiore, 1996. The Medium is the massage: an inventory of effects. Produced by Jerome Agel. San Francisco: HardWired.
11. S. Bourne, 1996. "This Time, the message is the medium," NetGuide Magazine, (December).
Search Engine Bibliography # 1 and Hot Link Search Engine Bibliography represent sites with useful information on Internet search engines. Bibliography # 1 mentions 76 articles; 41 of these articles include links to the actual articles. The Hot Link Bibliography includes fifty live links that either lead you to an article or to the search engine for a particular magaizne or journal.
Copyright © 1997, ¡ ® s - m ¤ ñ d @ ¥