Who Will Create Metadata for the Internet? by Charles F. Thomas and Linda S. Griffin

Read related articles on Search engines

After metadata schemes are established for the Internet, the next major obstacle will be to persuade information creators to use these schemes. To date, no mechanism has been effective in promoting the use of metadata for the Internet. This article examines the reasons why major categories of information providers may not see metadata as a worthwhile investment. Once these reasons are explained, the authors offer an alternative solution to create a "metadata explosion."

Contents

Introduction
Literature Review
Statement of Problem
Discussion
Conclusion
Notes

Introduction

Electronic metadata has never been more relevant in the information professions, and with good reason. Metadata is information that describes other information sources. For example, records in online library catalogs consist of metadata about books, including information such as physical dimensions, publication dates, and author names. Metadata has existed under various names in the computer science and bibliographic description professions for decades, providing enough information to manage and retrieve resources such as files or books.

Within the past five years, a growing global Internet has placed unprecedented potential for information provision and retrieval at the fingertips of institutions and individuals alike. To the detriment of Internet users, however, the current information retrieval environment is more a "chaotic repository for the collective output of the world's digital printing presses" than an organized venue for the dissemination and discovery of information" [1]. Internet publishers of all domains are turning to metadata as the most promising tool to tame this electronic tiger. Much of their faith and enthusiasm is a direct result of metadata's proven utility in other environments.

As Internet metadata initiatives gain momentum, information professionals must remember that appropriate descriptive schemes are only part of the solution to the Internet's chaos. To revitalize the Internet as a usable information environment, planners also must address the issue of who will generate the necessary metadata. This paper identifies some of the major communities of electronic information providers, and discusses reasons why each community may not wish to utilize any metadata scheme. Once the disincentives to Internet metadata implementation are explained, the authors propose an alternative that would promote a faster and more thorough use of metadata to describe, index and retrieve Internet resources.

Literature Review

Much literature over the past two years has reported on efforts to address the Internet's metadata deficiency. Popular magazines,[ 2] professional journals, and scholarly papers have documented the progress of working groups to build and implement workable solutions. Some of the most current metadata information is also published on the Internet, including workshop summaries [3] and user guides for metadata implementation [4]. The variety and content of all of these publications can be overwhelming, but some print and electronic sources provide excellent overviews of the metadata problem. Bikson and Frinking's "Preserving the Present: Toward Viable Electronic Records" [5] and Ross and Higgs' "Electronic Information Resources and Historians: European Perspectives" [6] offer wide, general views of the scope and importance of metadata as a records management and research tool. Both sources differentiate between three major manifestations of electronic information: electronic records, electronic documents, and electronic data. This important distinction has shaped the full range of metadata schemes developed in the past few years, and has provoked scholarly discussions of appropriate uses of metadata for each of the three types [7].

Cooperative efforts to develop effective metadata schemes are documented in numerous publications. Three of the most extensively documented efforts are Scandinavia's "Nordic Metadata Project," [8] Great Britain's "Arts and Humanities Data Service" [9] and the United States Government's efforts to integrate statistical information across agency boundaries [10]. Governments are not the only bodies interested in metadata, however. Many other groups are striving to integrate metadata into existing or new modes of information provision. Some of these include the extensive work by the Online Computer Library Center (OCLC) and the National Center for Supercomputing Applications (NCSA) to implement an international metadata scheme [ 11], and work by library catalogers to make metadata and USMARC bibliographic description elements interchangeable [ 12]. Overviews of the wide range of metadata schemes are available through numerous locator sites online [13].

All of the sources mentioned above state the case for metadata as a potential remedy to the problem of finding relevant information on the Internet. The international consensus favoring electronic metadata solutions is also apparent. Much thought and discussion obviously has been invested by all interested communities. After reviewing these sources, however, no clear solution has yet emerged for actively associating large bodies of documents and records with descriptive metadata. Once one or more metadata frameworks are in place, who is going to do the intensive work of creating vast amounts of metadata for the Internet? The bottom line seems to be that most planning groups have yet to address this issue. Since various implementations already are being tested, it is not too early to discuss this apparent oversight.

Statement of Problem

Recently, a federal grant of nearly one million dollars was awarded to the University of California at Berkeley's School of Information Management and Systems (SIMS). The School is investigating ways to make searching the Internet easier and more cost-effective. Building upon previous research, Berkeley is creating interfaces that translate simple queries into more sophisticated searches across multiple metadata schemes [ 14]. Such a grant award indicates both the seriousness of the problem of Internet searchability, and the focus of current efforts. The main challenges to be addressed in this research are accommodating users with varying purposes and expertise, and the lack of a dominant metadata scheme for generating, indexing and searching for information resources [15]. The award indicates how much faith is already placed in finding a descriptive language to accommodate most users. Most of the printed and electronic literature on this topic reaches the same conclusion, that an overarching standard will solve the problem of a chaotic Internet.

Careful consideration of this solution indicates a very significant likelihood that the popular approach will be inadequate to tame the Internet. Although a common metadata standard does offer much promise, it may be useless if it is not implemented widely. The challenge of persuading Internet information providers to implement a standard may be more difficult than any development issues. To achieve a successful metadata solution, we must discover ways to encourage extensive metadata generation.

Discussion

Most metadata framework developers misjudge the degree to which schemes will be implemented. In the scenarios currently envisioned by Internet planners, the burden of resource description falls upon those who create online content. Such an assumption is flawed, because information providers in many sectors will encounter strong disincentives to generating metadata for the Internet. Most of these disincentives relate to money. Whether electronic information is provided by governments, academic communities, or profit-driven enterprises, the key to promoting a metadata explosion is financial incentive.

Internet planning groups such as the World Wide Web Consortium are counting on the business community to aggressively develop and adopt a usable metadata system. Their faith is based upon the strong commercial potential of the Internet [16]. This trust may be misplaced, however, for any metadata generation and storage requirement that runs counter to the simple and effective operation of a business is likely to be viewed as an unworthy investment [17]. It is important to make a distinction here between two uses of metadata by the business community. The first is as an advertising function, to catch the attention of online searchers. In this regard, profit-driven institutions might indeed wish to use metadata.

The second use of metadata by businesses is a more significant consideration. Advertising is a small percentage of a company's activity. The much more important application of metadata is to an institution's electronic records and non-advertising documents. Given the quantities of internal information placed online by businesses, it is unlikely that metadata for external users will be a high priority. Furthermore, differing information needs and applications within individual business offices may preclude any one metadata scheme's adoption. Institutional pressures to make money from information resources are mounting. The profit motive may drive businesses to develop proprietary controls over their information that are geared more toward accounting than toward universal access [18]. Little research is available to show that businesses are likely to develop or use metadata on the Internet, or to willingly allocate time and money to maintaining "metabases" of descriptive information so voluminous that its management might overshadow the original information content [19]. For these reasons, for-profit information providers and other businesses are unlikely to embrace a descriptive scheme that promotes free access to their records.

The academic community also may move slowly to adopt an Internet metadata scheme. Lynch's "Searching the Internet" predicted that any metadata scheme adopted for electronic documents in the humanities and sciences will require two components, machine-generated metadata and human-generated annotations. Lynch's prediction is endorsed by other scholars, such as those involved with Great Britain's "Arts and Humanities Data Service" [20]. The scope of such metadata generation will be very labor-intensive, and will require huge investments on the parts of academic institutions, or scholarly electronic publishers, who will see little immediate return on their investment. Such obstacles will not be overcome easily. Any metadata implementation effort may receive diminishing support as academic institutions find themselves investing less time generating new knowledge, and more time describing what they create. Like the business community, the scholarly community therefore seems unlikely to adopt an Internet metadata paradigm that forces them to describe their own information resources.

Who is likely to have the resources and will to adopt a metadata framework? Local and national government agencies offer a greater promise in this regard. National government information providers already utilize highly effective Internet metadata schemes. Unfortunately, governmental information represents only a limited portion of the information universe. Governmental metadata schemes are geared mainly toward managing large volumes of statistical data. As mentioned previously, statistical data is only one of the three major types of electronic information. Even if this segment of online information were fully augmented by searchable metadata, Internet users would still face a messy discovery process, with no overarching consistency across categories of information. Government information providers already have invested significant resources to create usable descriptive information, but their efforts alone will not achieve the goal of a comprehensive metadata implementation.

If the business, academic and governmental sectors of the Internet are not fully prepared for the task, who is willing to create descriptions of the growing body of information online? In the event that one or more Internet metadata schemes ever become dominant, the preceding discussion demonstrates that the majority of information creators may be unwilling to also be information describers. Without adequate metadata description, Internet users will be forced to devote increasing amounts of time to resource discovery, or perhaps to seek knowledge through other channels.

Another key community has not yet been mentioned, however. Of the major players in the current Internet environment, commercial indexing services such as Excite, Yahoo and InfoSeek have the strongest financial incentive to see a workable metadata system implemented. Since the Lycos Internet indexing service went online in 1994, numerous indexers have emerged to compete for usage. These services vary in their methods, and some are better suited to specific types of searches than others, but all of the Web engines are roughly equal in searching power [21]. In the current environment, commercial Internet indexers do not derive revenues directly from users. Instead, they compete with one another for usage based upon the quality and sophistication of their service. By leveraging their use statistics, they profit mainly by attracting advertisers. This information market relationship has thrust online users to the position of highest prominence, for they incur no cost to use these services, while the Internet indexers must continually seek ways to maintain or improve their overall use by the online community.

Commercial Internet indexing services have responded to this challenge with ingenuity, innovation and the will to impose order on the various categories of electronic information. If widespread generation of metadata really will improve searchability on the Internet, devising an incentive structure for commercial indexers as the creators of Internet metadata is a logical extension of the role they already play. Since the value of metadata is very subjective, competition would ensure the success of those enterprises who create the greatest quantities of useful metadata [ 22]. Shifting the responsibility for metadata provision from information creators to commercial indexing services would also assist in prioritizing the information marketplace, so that the most commonly sought resources would be the first to be thoroughly described by commercial indexers. Just as money will be the primary obstacle to metadata implementation for most information creators, profit derived indirectly from increased usage would be the greatest incentive for metadata generation by Internet indexers. Regardless of how financial reward is structured, profit is the catalyst most likely to usher in a rapid growth of metadata across the Internet.

Conclusion

Metadata for the Internet is an extremely complex issue for both developers and planners. Creating descriptive frameworks for a wide range of digital information formats has required enormous contributions of time and money from numerous communities. Because of metadata's success in other applications, Internet planners have good reasons to make these investments. Metadata on the Internet, whether as one common standard or multiple schemes within information categories, appears to be the remedy for disorder on the global information network.

As with most remedies, however, consistent treatment is vital to overcoming the illness. Much international effort is being devoted to the problem of developing a common metadata standard. The major oversight in such efforts is the issue of who is willing to implement any standard. Few incentives have been shown for encouraging metadata creation, or the Internet already would be filled with resource descriptors. The only reasonable way to encourage widespread metadata implementation is to provide a strong potential for profit from use of the information.

Such a scheme has not yet existed because information creators are still seen as the most appropriate source of metadata. Who, after all, can describe something better than its creator? This assumption overlooks the fact that the original function of a document or data is often not the use to which it is later put. For a great part of the Internet's content, information creators therefore are not necessarily any better at generating metadata than anyone else.

Taming the Internet will be an enormous task. Intimidated by the scope of the task, some may suggest that large government subsidization of the effort will be necessary. Indeed, as noted earlier, governments do not have to worry about profit potential, and often can muster vast resources of money to achieve such goals. National governmental organizations, such as the United States' Library of Congress, played key roles in similar challenges of the past. The standardization of bibliographic control among libraries would not have been achieved without substantial training, advocacy, and financial support from this governmental institution. If we were to use history as a justification for contemporary government funding of a metadata proliferation, however, what would be the result? How long would such funding be necessary, and which governments would assume responsibility for transnational information resources? Numerous questions such as these, as well as many conspicuous attempts by the U.S. government in recent years to regulate and control Internet content, should concern anyone who ponders this issue.

For the foreseeable future, no automated system is likely to be able to describe electronic information without frequent human assistance. Humans will continue to play an important role in the process of assisted information discovery and retrieval. If a machine could do all of the work for us, from metadata creation to search mediation, the human element would not matter. As this is not the case, Internet and metadata planners should appeal to an incentive that has proven most successful throughout history: profit potential. Commercial indexing services already have learned to profit indirectly from organizing the Internet's content. If metadata schemes are to succeed, these indexers, or their successors, must be able to recognize a profit for investing the required effort to generate and index reliable metadata. Without such motivation, metadata will continue to be an unattainable solution.

About the Authors

Charles F. Thomas is an archivist at Louisiana State University, and currently serves as the Curator of the Louisiana and Lower Mississippi Valley Collections.
E-mail: chuck@seal.lib.lsu.edu
Linda S. Griffin is a cataloging librarian at Louisiana State University.
E-mail: notlsg@unix1.sncc.lsu.edu

Notes

1. C. Lynch, 1997. "Searching the Internet," Scientific American, volume 276, pp. 52-56.

2. E. Sullivan, 1997. "Standards aim to tame the Web," PC Week, volume 14, number 46, pp. 40-41.

3. S. Weibel and E. J. Miller, 1996. "Image description on the Internet: Summary of CNI/CLC Image Metadata Workshop," Annual Review of OCLC Research, at http://www.purl.org/oclc/review1996

4. S. Weibel, J. Kunze, C. Lagoze and M. Wolf, 1998. RFC 2413: Dublin Core Metadata for Resource Discovery. Available at ftp://ftp.isi.edu/in-notes/rfc2413 .txt

5. T. K. Bikson and E. J. Frinking, 1993. Preserving the Present: Toward Viable Electronic Records. The Hague, Netherlands: Sdu Publishers.

6. S. Ross and E. Higgs (editors), 1993. Electronic Information Resources and Historians: European Perspectives. Proceedings of a workshop held at the British Academy June 25-26, 1993. Berlin: Max-Planck-Institut für Geschichte.

7. D. A. Wallace, 1995. "Managing the present: Metadata as archival description," Archivaria, volume 39, pp. 11-21.

8. O. Husby and others, 1996. The Nordic Metadata Project, at http://linnea.helsinki.fi/meta/ index.html

9. D. Greentstein and L. Dempsey, 1997. "Crossing the Great Divide: Integrating Access to the Scholarly Record," at http://ahds.ac.uk/public/metadata/disc_02.html

10. A. R. Tupek and C. S. Dippo, 1997. "Quantitative literacy: New website for federal statistics provides research opportunities," D-Lib Magazine, (December), at http://www.dlib.or g/dlib/december97/stats/12tupek.html

11. Online Computer Library Center (OCLC), 1997. Dublin Core Metadata, at http://purl.oclc.org/metadata/ dublin_core/

12. P. L. Caplan and R. S. Gunether, 1996. "Metadata for Internet Resources: The Dublin Core metadata elements set and its mapping to USMARC," Cataloging and Classification Quarterly, volume 22, numbers 3-4, pp. 43-58.

13. United Kingdom Office for Library and Information Networking (UKOLN), 1998. Metadata, at http://www.ukoln.ac.uk/metadata/

14. School of Information Management & Systems, University of California at Berkeley, 1998. "Search Support for Unfamiliar Metadata Vocabularies," at http://www.sims .berkeley.edu/research/metadata/index.html

15. C. S. Dippo and V. Grieg, 1997. National Statistical Information Infrastructure, at http://www.isi.edu/nsf/papers/dip po.htm

16. L. Dempsey, R. Russell and R. Heery, 1997. "In At the Shallow End: Metadata and Cross-domain Resource Discovery," at http://ahds.ac.uk/public/m etadata/disc_07.html

17. H. MacNeil, 1995. "Metadata strategies and archival description: Comparing apples to oranges," Archivaria, volume 39, pp. 22-38.

18. Ross and Higgs, op.cit.

19. J. Doppke, D. Heimbigner and A. L. Wolf, 1998. "Software process modeling and execution within virtual environments," ACM Transactions on Software Engineering and Methodology, volume 7, number 1, pp. 1-40.

20. Dempsey, Russell and Heery, op.cit.

21. S. Nicholson, 1997. "Indexing and Abstracting on the Internet," Information Technology and Libraries, volume 16 (June), pp. 73-81.
22. Wallace, op.cit.

Copyright © 1998, ƒ ¡ ® s † - m ¤ ñ d @ ¥