After the Dot-Bomb

After the Dot-Bomb: Getting Web Information Retrieval Right This Time by Marcia J. Bates
In the excitement of the "dot-com" rush of the 1990's, many Web sites were developed that provided information retrieval capabilities poorly or sub-optimally. Suggestions are made for improvements in the design of Web information retrieval in seven areas. Classifications, ontologies, indexing vocabularies, statistical properties of databases (including the Bradford Distribution), and staff indexing support systems are all discussed.

Contents

Introduction
Improving Web Retrieval
Conclusion

Introduction

At the height of the 1990's information technology bubble, an information broker, researching a question for a client, called me and explained that her client was having a dispute with another dot-com company over which company had been the first to invent the idea of "push" technology, i.e., automatically sending information to people in interest areas they had designated in advance. The goal of the query was to determine that no third party had had the idea earlier.

I explained to the broker that the idea of "push" technology was first called "selective dissemination of information," or SDI, and, to my knowledge, had first been proposed in 1961 - yes, 1961 - in an article in the journal American Documentation by an IBM computer scientist by the name of H.P. Luhn (1961). He worked out the idea in considerable detail; the only key difference was that the old mainframe computer would spit out informative postcards to be mailed to customers, rather than sending the information online - since there was no "online" to use in those days.

I have had many experiences like this one since the Internet burst on the scene in the 1990's. I have watched as hundreds of millions of dollars have been invested to re-invent the wheel - often badly. Everybody understands and takes for granted that there is an expertise needed for the application and use of technology. Unfortunately, many Web entrepreneurs fail to recognize that there is a parallel expertise needed about information - collecting it, organizing it, embedding it successfully in information systems, presenting it intelligently in interfaces, and providing search capabilities that effectively exploit the statistical characteristics of information and human information seeking behavior.

"Content" has been treated like a kind of soup that "content providers" scoop out of pots and dump wholesale into information systems. But it does not work that way. Good information retrieval design requires just as much expertise about information and systems of information organization as it does about the technical aspects of systems.

It is also the case that a lot of what one naturally assumes about how people need, search for, and retrieve information, is wrong - the truth is counter-intuitive. How people cope with not knowing, with trying to find out, and how they use information resources, is a complicated and subtle business (Bates, 1998). Likewise, the good information system solutions for enabling people to search and retrieve information effectively are also counter-intuitive. Good systems don't work the way one would assume. Had the dot-com businesses consulted the research in information science on SDI, they would have learned that SDI was largely unsuccessful, except in certain specific situations (Packer and Soergel, 1979). It comes as no surprise then, that "push" technology has also largely failed.

In the Internet gold rush, the Web entrepreneurs and the venture capitalists who funded them all had the same conventional - and mistaken - ideas about how information retrieval works. So they made a wonderful match. The company founders and their financial backers shared a vision for their Internet companies that was wrong-headed and unproductive in many ways, but, crucially, was the same vision.

In fact, there was an "information industry" already around decades before the 1990's. These were the companies that developed and published giant databases of patent and legal information, biological, chemical and other scientific and humanities information resources, newspaper and government information databases, etc. Companies with names like Chemical Abstracts, Infotrac, Inspec, Lexis-Nexis, and Engineering Index. These organizations had learned the hard way about information systems and information retrieval. When the dot-com newcomers came along in the 1990's, the established companies were not about to give away their hard-earned knowledge to the new kids on the block. The new companies probably would not have listened anyway.

Likewise, in the 1970's and 1980's, librarians had also created multi-million-item online public access library catalogs, when online access was a brand-new concept, and had developed a tremendous amount of expertise about how to handle large, messy databases of textual information. In fact, the largest of these catalog databases, the Online Computer Library Center's "WorldCat" database holds over 47 million records from 41,000 libraries world wide ( http://www.oclc.org/about/). Yet it has been almost an article of faith in the Internet culture that librarians have nothing to contribute to this new age.

Improving Web Retrieval

This author has been researching and consulting in information retrieval system design for decades (see http://www.gseis.ucla.edu/faculty/bates/). Described below are some "pet peeves," some problem areas identified in the design of Web information retrieval to date. These problems are accompanied by suggested solutions, or, at least, directions to go in to develop solutions for the next round of Web information retrieval development.
Using old-fashioned hierarchical classifications.
When classifications are used in Internet databases, it is hierarchical classifications that are almost invariably used. These are in the conventional "tree" shape, a broad area subdivided, then subdivided again and again, with each possible category contained within the one above. Librarians invented a better kind of classification decades ago, that is called faceted classification. It is too involved to explain in this brief article, but a good analogy is to say that faceted classification is to hierarchical classification as relational databases are to hierarchical databases. Most system designers would not dream of using hierarchical files these days, so why are hierarchical classifications of information content still being used?

Librarians implemented some faceted classifications during the twentieth century, but the technology to exploit faceting fully for online systems has become available only recently. Thus the theory as described in information organization textbooks is generally not fully adapted to the new technology, but easily can be. See, for example, Rowley and Farrow (2000) or Ellis and Vasconcelos (1999). A brief comparison of the two types of classification schemes is provided in Bates (1988).

Succumbing to the "ontology" fallacy.
The hot new term in information organization is "ontology." Everybody's inventing, and writing about, ontologies, which are classifications, lists of indexing terms, or concept term clusters (Communications of the ACM, 2002). But here's the problem: "Ontology" is a term taken from philosophy; it refers to the philosophical issues surrounding the nature of being. If you name a classification or vocabulary an "ontology" then that says to the world that you believe that you are describing the world as it truly is, in its essence, that you have found the universe's one true nature and organization. But, in fact, we do not actually know how things "really" are. Put ten classificationists (people who devise classifications) in a room together and you will have ten views on how the world is organized.

Librarians had to abandon this "one true way" approach to classification in the early twentieth century. As many are (re-)discovering today, information indexing and description need to be adjusted and adapted to a myriad of different circumstances. Why, then, use the misleading term "ontology"?

Apart from philosophical issues, there is another, more important reason to abandon use of the term. Recorded information does not work the same way the natural world does. Information is a representation of something else. A book, or a Web site, can mix and match informational topics any way its developer feels like doing. There's no such thing as a creature that is half squirrel and half cat, but there are many mixes of half-squirrel/half-cat topics in information resources and Web sites. Methods of information indexing have to recognize what's distinctive to information, as opposed to classifications of nature, and design the systems accordingly.

For example, one fan of the poet Emily Dickinson creates a Web site that contains a one-paragraph biography of her, along with a list of every poem she ever wrote. Another fan of Dickinson devotes his site to images of the house and community where Dickinson lived. Still another has collected a bibliography of every book or article written about Dickinson and her poetry. Elsewhere on the Web are sites that group Dickinson with other nineteenth century poets or other women poets or other American poets. Beyond just using her name, how can these sites be usefully indexed so people can find the angle they want to explore about the poet?

Long-term solutions to the problems of indexing the Web will probably involve multiple overlapping methods of classifying and indexing knowledge, so that people coming from every possible angle can find their way to resources that work best for them. Instead of calling it an "ontology," label the system of description what it really is - a classification, thesaurus, set of concept clusters, or whatever (see also Soergel, 1999.).

Using standard dictionaries or Roget's Thesaurus for information retrieval.
These days, many information retrieval research experiments and commercial applications are being developed that are based on the sensible-seeming assumption that if we want people to be able to retrieve text, we should build into the system a standard dictionary, or a Roget's-type thesaurus (Bartlett's Roget's Thesaurus, 1996), or an experimental mapping of vocabulary such as Wordnet ( http://www.cogsci.princeton.edu/~wn/). Linguists are particularly prone to this fallacy. Linguists know the most about languages, and so they assume, quite reasonably, that they should make the decisions about what linguistic resources to use for information retrieval experiments.
However, linguists are not experts in information retrieval. Through decades of experimentation, the IR community has learned how ineffectual such conventional dictionary and thesaurus sources are for real-world information retrieval. Instead, another type of thesaurus has been developed specifically for the case where the purpose is information description for later retrieval. These IR thesauri number in the hundreds, and have been developed for virtually every kind of subject matter. Many "ontologists" are truly re-inventing the wheel - an already-developed thesaurus for their subject matter may be hiding in the stacks of their local university library.

Information retrieval thesauri have a different internal logical structure, and contain words and phrases that are designed to be effective in information retrieval. Take a look at any one of these IR thesauri, and the differences from basic dictionaries and Roget's will be immediately evident. Examples:

Art and Architecture Thesaurus, (1994)

Ei Thesaurus (engineering), (1992- )

Legislative Indexing Vocabulary, (Loo, 1998)

Los Angeles Times Thesaurus (news), (1987- )

Thesaurus of Psychological Indexing Terms, (2001)

There is also a thesaurus for use with free-text searching, where there may or may not be formal indexing vocabulary ("controlled vocabulary") assigned to the records. Knapp's The Contemporary Thesaurus of Search Terms and Synonyms (2000) was developed by a search expert over decades of experience, and with lots of input from other searchers.

These are the kinds of thesauri that should be used to index and retrieve from online information resources.

Ignoring the Bradford Distribution.
We might call this the "You can't mess with Mother Nature" Principle. As they grow in size, databases and other bodies of information follow something called the Bradford Distribution - pretty much no matter what you do. In other words, all sorts of things related to information do not conform to the standard Gaussian or "normal" distribution, but rather to the Bradford Distribution. Frequencies of popular queries to a Web search engine, rates of assignment of indexing terms or classification categories to documents or sites, sizes of retrieval sets, etc., all conform to Bradford.

There are numerous sources that will explain the mathematics of Bradford (Bookstein, 1990; Brookes, 1977; Chen and Leimkuhler, 1986). In ordinary language, Bradford distributions do not have the conventional bulge in the middle, but instead have very long tails. For instance, typically there will be a few topics that are requested by huge numbers of people at a Web site, and a huge number of topics requested very little or not at all. Likewise, for retrieval sets, instead of most of the retrievals containing a middling number of "hits," some will contain huge numbers of hits and others few, with not very many retrievals producing middling numbers.

This Bradford distribution (related to the "Pareto Distribution" in economics) is extremely robust, and virtually impossible to defeat. Systems have to be designed to work with the Bradford Distribution, rather than trying to fight it. See discussion in Bates (1998) and references within those articles.

Ignoring size-sensitivity of information retrieval databases.
Every type of indexing vocabulary or classification has an explicit or implicit structure, and that structure works well with only certain sized databases. That cute little classification scheme you devise when you have 1,000 records will be driving you crazy with its inadequacies by the time there are 10,000 records. The indexing vocabulary that was good for a one-million-item database bogs down at five million items - and so on.

The long-term history of libraries and online databases reflects this size-sensitivity problem in slow motion, as it were. Library cataloging systems that worked well in the early nineteenth century have had to be drastically modified every few decades since then to deal with the consequences of growth in the resource base. After World War II, when scientific research was growing rapidly, and scientific literature was exploding in quantity, whole new systems of information access had to be devised to handle it - that is, new intellectual systems, not only technological improvements.

On the Web, this explosion in growth is happening in months, not years or decades. The smart information developer must anticipate growth from the beginning and design for all planned size levels of the database from the beginning. Otherwise, you are always scrambling and always behind the curve. I have repeatedly seen dot-coms assume that they can start with some simple little classification or index vocabulary and worry about growth later. The trouble is, the growth comes in a few weeks! By then, a commitment has been made to the earlier, small system. No one wants to re-index existing records, yet the fuller development or modification of the indexing system requires re-indexing for clarity for users and indexers alike. Eventually, the classification or other metadata system becomes a hodgepodge of work-arounds and bad solutions; see also Bates (1998).

Often, the chief product a company has to offer to its Web site users is some form of indexed information. Yet, figuring out how to optimize the indexing and retrieval of that information is the last thing that is attended to during the ramp-up to going online. If you believe your information resource will grow, then design for growth from the beginning. Otherwise, trust me: It will get worse.

Getting human content processing wrong.
Do you want to keep human indexing costs down? Then pay attention to the design of the indexing support system. Many Web sites today offer information that is in some part indexed or categorized by human beings. Needless to say, this is the most expensive part of many operations, and the point where efficiency produces the highest payoff. Efforts to improve processing efficiency may be limited simply to pressuring indexers to work faster. But more can be done than that to help the human indexers.

Think of the indexing support system as a separate information system, with its own requirements and users. What the indexers need in order to find their way around your system of indexing vocabulary or categories is different from what the system end users need to find their way efficiently around a body of information. Often it is the indexing support software that makes indexers inefficient, not the people themselves. It is important to keep the cognitive processing load of the indexers moderate, so that they are neither bored nor feel in overload. That, in turn, requires segmenting the indexing process into easily manageable parts, with support from the indexing system at key points.

For example, suppose you have a 5,000-term indexing vocabulary. Instead of just listing it in alphabetical order for your staff, create groups of related terms (concept clusters) on broad-concept screens. Then, instead of having to move back and forth through a lengthy alphabetical listing, the indexer can see, on the screen at once, all the terms likely to be relevant to the record in hand. Now the indexer does not have to think up half a dozen possible terms, and look them each up separately to identify the best one, but can instead, at a glance, determine the right term and assign it quickly. In sum, study indexing itself as a process, with "users" (indexers) who need to be accommodated for best performance and satisfaction, then design a targeted indexing support system.

Ignoring information expertise.
Many Web companies, in the process of developing an information-providing site, assemble a powerful team of technology, content, and graphic design experts. Programmers come from the top schools and companies, Ph.D. experts in the subject matter are brought on board, and top graphic designers are hired to present a gorgeous interface to the Web user. But even though the purpose of the site is to present information, or "content," to the world, no one who knows anything about information is brought on board.

Understanding content is not the same as understanding information; See also Bates (1999). The information specialist - the person who creates classifications, designs metadata protocols, crafts search capabilities for information system users, designs systems specifically for information retrieval - that person has an entirely different expertise than either the content expert or the programmer. All these individuals have to work together to produce a good system; see also Bates (2002). But if the information expert is left out, the resulting system will be good in every way - except at providing information!

Conclusion

In sum, the following improvements in Web information retrieval design are recommended for the process of recovering from the late-1990's "dot-bomb":

Use faceted classifications, rather than hierarchical.

Develop an understanding of what distinguishes information classifications and vocabularies from the physical-world equivalents, and stop using the misleading term "ontology."

Use the many vocabularies specifically designed for information retrieval, rather than general English language vocabularies.

Understand and work with the underlying statistical characteristics of information in designing information retrieval. Failing to understand these factors simply leads to sub-optimal systems.

Recognize that systems of information description are extremely size-sensitive. Design for all anticipated database size ranges from the beginning.

Be kind to your indexers: Design a targeted indexing-support system specifically for your human staff, and you will save much staff time.

If you develop a site with any information retrieval component at all, then hire information expertise.

About the Author

Marcia J. Bates is Professor in the Department of Information Studies at the University of California at Los Angeles. She has consulted and published widely in her specialties of subject access, user-centered design of information retrieval systems, and information seeking behavior. She is a Fellow of the American Association for the Advancement of Science, and has twice won the "Best Journal of the American Society for Information Science Paper of the Year Award."
Web: http://www.gseis.ucla.edu/faculty/bates/
E-mail: mjbates@ucla.edu

References

Art and Architecture Thesaurus, 1994. New York: Oxford University Press.

Bartlett's Roget's Thesaurus, 1996. Boston: Little, Brown.

Marcia J. Bates, 1999. "The Invisible Substrate of information Science," Journal of the American Society for Information Science, volume 50, number 12 (October), pp. 1043-1050.

Marcia J. Bates, 1998. "Indexing and Access for Digital Libraries and the Internet: Human, Database, and Domain Factors," Journal of the American Society for Information Science, volume 49, number 13 (November), pp. 1185-1205.

Marcia J. Bates, 1988. "How to Use Controlled Vocabularies More Effectively in Online Searching," Online, volume 12, number 6 (November), pp. 45-56.

Marcia J. Bates, 2002. "The Cascade of Interactions in the Digital Library Interface," Information Processing and Management, volume 38, number 3 (May), pp. 381-400.

Abraham Bookstein, 1990. "Informetric Distributions, Part I: Unified Overview," Journal of the American Society for Information Science, volume 41, number 5 (July), pp. 368-375.

B.C. Brookes, 1977. "Theory of the Bradford Law," Journal of Documentation, volume 33, number 3 (September), pp. 180-209.

Y.S. Chen and Ferdinand F. Leimkuhler, 1986. "A Relationship between Lotka's Law, Bradford's Law, and Zipf's Law," Journal of the American Society for Information Science, volume 37, number 5 (September), pp. 307-314.

Communications of the ACM, 2002. Special issue: "Ontology Applications and Design," volume 45, number 2 (February), pp. 39-65.

Ei Thesaurus, 1992- . Hoboken, N.J.: Engineering Information, Inc.

David Ellis and Ana Vasconcelos, 1999. "Ranganathan and the Net: Using Facet Analysis to Search and Organize the World Wide Web," Aslib Proceedings, volume 51, pp. 3-10.

Sara D. Knapp, 2000. The Contemporary Thesaurus of Search Terms and Synonyms: A Guide for Natural Language Computer Searching. Second edition. Phoenix, Ariz.: Oryx Press.

Shirley Loo, (compiler), 1998. Legislative Indexing Vocabulary: The CRS Thesaurus. 22nd edition. Washington, D.C.: Library Services Division, Congressional Research Service, Library of Congress.

Los Angeles Times Thesaurus, 1987- . Los Angeles: Los Angeles Times Editorial Library.

H.P. Luhn, 1961. "Selective Dissemination of New Scientific Information with the Aid of Electronic Processing Equipment," American Documentation, volume 12, number 2 (April), pp. 131-138.

Katherine H. Packer and Dagobert Soergel, 1979. "The Importance of SDI for Current Awareness in Fields with Severe Scatter of Information," Journal of the American Society for Information Science, volume 30, number 3 (May), pp. 125-135.

Jennifer Rowley and John Farrow, 2000. Organizing Knowledge: An Introduction to Managing Access to Information. Aldershot, Hampshire, U.K.: Ashgate.

Dagobert Soergel, 1999. "The Rise of Ontologies or the Reinvention of Classification," Journal of the American Society for Information Science, volume 50, number 12 (October), pp. 1119-1120.

Thesaurus of Psychological Index Terms, 2001. Ninth edition. Washington, D.C.: American Psychological Association.

Wordnet at http://www.cogsci.princeton.edu/~wn/, accessed 30 May 2002.

Editorial history

Paper received 31 May 2002; accepted 14 June 2002.

Copyright ©2002, First Monday

Copyright ©2002, Marcia J. Bates

After the Dot-Bomb: Getting Web Information Retrieval Right This Time by Marcia J. Bates
First Monday, volume 7, number 7 (July 2002),
URL: http://firstmonday.org/issues/issue7_7/bates/index.html