First Monday

INFOMINE: Promising Directions in Virtual Library Development

This article discusses academic virtual libraries (VLs), challenges to them, and possible solutions which will support their continued sustainability. Among these solutions are INFOMINE's efforts to begin the creation of a cooperative VL content building program, together with open software development in the area of machine-learning based hybrid systems. Such a system will feature the best of VL, expert-based content selection, and description effort as augmented by machine-assisted resource collection, classification, and collection maintenance software.




There is a major need for expert mediated virtual libraries (VLs) of well-selected and described links to scholarly and educational resources. VLs are one important component in providing for the Internet resource-finding needs of the academic community. Generalized commercial Web search engines, and even second generation engines such as Google, are often unable to produce consistently relevant results given their generalized focus, the immense amount of territory they cover, and the great number of audiences they serve (Chakrabarti et al., 1999a and 1999b; McCallum et al., 2000). Other problems here include a lack of intelligible descriptive information contained in results displays, unobjective representation of search results (since site placement can be purchased), distracting messages and advertisements, and the possibility in the future of becoming fee-based at costs that academia might not be able to bear.

Expert-mediated, objective finding tools that describe and provide well organized, uniform, integrated access are needed because they make important scholarly and educational resources visible and of use to researchers and students. The response to this need can be seen in the development over the last few years of virtual libraries of categorized links (i.e.,Web directories, indexes, portals) of all sorts including the virtual libraries which are now augmenting and produced by, as of the last couple years, the general search engines themselves, such as AltaVista. We believe that in providing a balanced effort that is positioned between labor intensive MARC cataloging and traditional library catalog approaches (1 to 1 _ hour per record created), generalized minimal indexing approaches of virtual libraries such as Yahoo! (1 _ minute per record) and the general search engines, academic virtual libraries, such as INFOMINE (25 minutes per record), have been providing an important finding tool service. This service can be characterized, in most cases, by expert selected and described resources that cohere to various standards (new and traditional) in academic information organization and retrieval, while emphasizing efficient and streamlined approaches to content building.

The Web will continue to grow at an ever-accelerating pace. At their current levels of growth, the summed content of the community of academic virtual libraries will resultingly represent an increasingly smaller portion of the total number of worthwhile Internet resources. Obviously, there are many challenges that virtual library managers now face and need to meet in order to continue to sustain themselves while providing comprehensive, or even representative, coverage of useful research and educational resources.

The discussion below points out some of these challenges and outlines some possible solutions including those that our VL, INFOMINE, is currently pursuing.


INFOMINE Description

INFOMINE is currently (as of April 2000) a collection of close to 20,000 librarian selected and described scholarly and educational Internet resources. It was one of the very first library originated, Web-based information services of any kind. INFOMINE has been created by University of California, California State University and other librarians working together in a cooperative effort. INFOMINE is funded by the Library of the University of California, Riverside, the Fund for the Improvement of Post-Secondary Education (FIPSE, U.S. Department of Education) and the U.S. Institute for Museum and Library Services (IMLS National Leadership Grant). INFOMINE is easy to use and was designed to accommodate multiple skill levels among users. Sophisticated queries and browsing are supported. Most major disciplines are well covered via access to important databases, e-journals, e-texts and digital collections, among others. Our funding from FIPSE and IMLS is for research in developing software, labor efficiency improvements and cooperative organizational solutions for challenges which INFOMINE and, by extension, the virtual library community at large, are facing.

The Need for Library-based Academic Virtual Libraries

There are many pressing needs for library-based academic VL type finding tools. Driven by library user information needs, libraries have increasingly been augmenting print-source indexing and cataloging with efforts in Internet resource description and organization. Skills transfer easily. Organizing important information for scholarly and educational uses is our mission and is what we have traditionally done as librarians. The proof is in the pudding: academic, library-based virtual libraries of various sizes have begun to proliferate by the hundreds. Librarians, albeit in a generally disorganized fashion, are finding the time and see this as an important expenditure of effort for their users. An incentive apparently exists here as does a general approach or model. Librarians are a natural group to organize in the cause of serious VL building. The challenge now is to focus this energy.

Academic and educational community needs in Internet finding tools are different from most generalized search engine user communities. Academic library VLs proliferate because generalized search engines no longer scale, not so much in regard to Internet coverage, but rather in the sense that, given their general multiple audience focuses, they cannot consistently provide serious researchers with appropriate results. Moreover, the goal of their efforts naturally is to return maximum profit to their owners and this may in the long run be inconsistent with the needs of academic institutions for constant, high-usage access to finding tools that offer affordable, objectively presented, and described content. Just as academics have required expert-mediated, uniform, and objective practices in organizing print information and have thus organized and provided institutional support for the existence of libraries to provide access to information above and beyond the open book market, there is now a similar crucial institutional need for library-based efforts, approaches, and standards (modified appropriately) in organizing academically relevant Internet information. In serving our community of users, academic libraries may very well become the natural home and energy source behind organized, concerted, high quality, and properly funded academic Internet finding tools. Our project and a number of others indicate that new views of the roles of libraries and librarians are taking form. So envisioned, how do we, as librarians providing VLs, move from today's uncertainties to a more sustainable position?

VL Challenges and New Directions

Note that, while the following are some directions for the VL community to explore, there is no single "right" solution. The key to VL sustainability lies in the creative exploration of and application of flexible solutions in many difficult areas by many different projects. There are, in addition, numerous other challenges and directions not mentioned here. Many of these solutions have been in the process of being explored and discussed and even implementated in small ways, in fits and starts, over the last couple of years.

Direction A: Improving Cooperative Organization and Aggregation Effort Through

  1. Interoperability Among Multiple Tools

    Many organizations which provide VL services are, in order to scale, increasingly interested in boosting their coverage through interoperating with (e.g., passing searches among) other services. This is a useful effort and one which INFOMINE is following. The interoperability standards and conventions that should come from efforts such as IMesh, Isaac Network, and ROADS should be important to all VL service providers. Cooperating partners in such projects are concerned with offering reasonably seamless searching of what are often very heterogeneous VL resources. Approaches here include building a common interface for many distributed resources and/or allowing each partner to retain its own interface through which its resources (usually primarily) and those of others (usually secondarily) are queried. Work that leads to passing queries and the development of similar field structures and standards in resource description is inherent here. Maybe more importantly, this kind of interoperability is perhaps the simplest means for initiating cooperation among VLs. This direction is, organizationally, a useful and strategic first step which allows both for the retention of institutional identity and "ownership" while initiating substantial cooperative effort. Such work may eventually lead to more intense forms of cooperation which may yield even better tools.

  2. Cooperation in a Single Open Tool

    Some of the problems and promises of interoperability would be equally present here. However, some problems may be solved by aggregating effort in a single open, cooperatively built tool that has its own presence and runs as a separate organization. Given that interoperability and cooperation overlap greatly, they might even be seen as different stages of the same organizational effort: a loose confederated effort leading, as appropriate, into a more consolidated and focused effort. Multiple entities can be involved but the service provided by the tool would be managed by a cooperative organization that existed separately from its constituents.

    A single cooperative tool and organization might provide for greater efficiencies and improved features in a number of areas including:

    • Greater uniformity and perhaps quality in metadata development, interface, and other system features;
    • Speedier, more responsive systems development, and new technology implementation generally;
    • Better query and system response times through a centralized, non-distributed system architecture that wouldn't be hampered by accommodating the lowest common denominator or least developed/slowest participants;
    • Quicker decision-making through a less heterogeneous organization;
    • Elimination of redundancy in effort in not only content building but systems development as well;
    • Greater success in marshalling more resources around a single, well-defined funding target;
    • Pooling resources and effort around a single mutually owned and cooperatively created tool; and,
    • Supporting institutional identity and "ownership" through custom interface work to create appropriate look and feel.

  3. National NetGain

    INFOMINE is beginning to work with selected college and university libraries to methodically and organically develop a national cooperative network of content and system builders. This effort is known as National NetGain. NetGain is an effort with cooperating partners but one which is focused around building a single tool (as described above). We believe that there is a need for a number of national level, coordinated efforts such as INFOMINE. Our efforts will benefit the library community and all participants: contents entered are owned by their authors and/or the organizations that authors are affiliated with as well as the central NetGain project. System software will be co-developed by participants and be placed in the public domain. User interface options are flexible and would meet the needs of cooperating organizations regarding functionality and "look and feel". NetGain could help save resources among libraries which are now spending a great amount of energy to create frequently redundant content. Together we could do what we do better and save significant resources as a result than we can by continuing to work individually. If you or your institution are interested in NetGain contact us.

Direction B: Improving Systems, Open Software, and Tiered Levels of Resource Description

Providing greater resource coverage while at the same time retaining quality in resource selection and description is a crucial key to sustainability. Achieving a good balance between respectable finding tool "reach" (typical of the larger Web search engines) and metadata quality (typical of the academic VLs) is a challenge involving technological as well as labor and other economic concerns. Part of the answer is in developing "smart" systems that are really hybrids which amalgamate expert based VL approaches with machine-learning based crawling and classification systems. This hybrid system will necessitate the flexible interweaving of expert effort increasingly with machine assistance in the major labor and time consuming tasks involved in finding tool collection development, resource description, and collection maintenance. Also involved will be the development of an approach to flexibly allocating expensive human expertise in resource discovery and description in a tiered way according to the information value of the resource to its audience. This hybrid system would combine the best of both worlds - the expertise of the librarian with the reach and assistance of the focused, smart crawling and classification system.

  1. Hybrid Virtual Library/Smart Crawling and Classification Systems

    Major advancements are being made in machine-learning based "smart", focused crawling, and automatic resource description and classification software. These technologies will be of great use in academic Internet finding tools and VLs. While some VLs (e.g., Social Science Information Gateway) do gainfully employ relatively simple crawling and auto-classification technology (e.g., Harvest-NG), there are many ways in which this work could be significantly enhanced. Recent literature reviewing machine learning in these and closely related applications is available (Paepcke, 1998; Chakrabarti et al., 1999b; Glickman and Jones, 1999; McCallum, 2000). A number of tools have incorporated these approaches including Google, , Cora, and the New Zealand Digital Library, among others. Much of this software is in the public domain.

    Crawlers are starting to do their work in much more efficient, focused, and accurate ways (as opposed to undirected, shotgun crawling using simple filters) employing, among other techniques, reinforcement learning techniques (McCallum, 1999a and 1999b; Rennie and McCallum, 1999). Smart crawling systems, with increasing accuracy, can find and harvest higher quality resources than previously, by being focused on and working within what are essentially self-defining Internet communities (think of citation analysis via Science Citation Index), such as many academic disciplines (Chakrabarti et al., 1999b; McCallum et al., 2000, 1999a, 1999b; Brin and Page, 1998; Gibson, 1998; Kleinberg, 1998; Chang, n.d.). Classifiers can now be "trained", with increasing effectiveness (depending on training data, information/document types crawled, and classification scheme complexity), to automatically recognize and categorize potential high quality documents and sites using increasingly smaller and less well labeled training data, including bibliographic records, than ever before (Craven, 1999; R. Jones, 1999; Riloff and Lorenzen 1999; Nevill-Manning, 1999; Dumais, 1998; Hofmann, 1999 and 1998). Techniques featuring multiple document and site content evaluation schemes, each correcting for the deficiencies of the others, are promising (Frietag, 1998; Cleary and Trigg, 1998). We've been working with approaches that combine document (and site) similarity analysis (McCallum et al., 2000), linkage analysis (see Kleinberg 1998; Chakrabarti, 1999a) and a wide array of statistical truing techniques. These approaches, as judged by the finding tools mentioned in the preceding paragraph, have started to produce increasingly effective, accurate, and high quality crawling and automatic classification systems. They represent a major watershed in Internet finding tool advancement and one of which the academic VL community needs to avail itself.

    Still there is great room for improvement in machine learning applications in these areas. We view these techniques as providing assistance which will boost time and labor savings and will amplify, not replace, expert effort in content building. The subject and resource description expertise of librarians remains at the core of our approach though it will be greatly augmented through "smart", machine-assisted collection development (crawling), resource description (classification and indexing), and machine-assisted URL and collection maintenance (content change checking/fixing) software. New roles for software that assists VL experts are being developed as, conversely, new roles for experts working with this class of software are being developed (see below).

    What a user will see and benefit from in such hybrid systems are multiple types of records, more in-depth indexing, new approaches to search/browse access, and new approaches to displaying records according to different relevance-to-query ratings. More importantly, the records contained should be, on the whole, of a much higher quality and much more relevant to researchers and students than those typically seen in the large Web engines. And, crucially, there will be many more of them than typically found in standard VLs. In our case, we will continue to feature tens of thousands of expert created records while augmenting these with millions of crawling system created records. Both expert and crawling system created records will feature fielded indexing as well as near full-text indexing and retrieval. Expert records will be of an increasingly more important, general, and and/or reference oriented nature while the crawling system records will provide the critical mass to provide for the information specificity necessary to enable the detailed searching often absent in VLs.

  2. Open Source Software Development

    Public domain software (GNU GPL open source licensed) is being created that will be of use to the VL community. Our hope is that we can, in a more specialized area, contribute in a small way to virtual library system development in the same way that the creators of Linux and Apache have contributed to operating system and Web server software advancement. If co-development interests you, contact us.

  3. Resource Description in a Hybrid System: Three Tiers of Record Quality and Expert Labor Input

    Much of our concern with the human side of the hybrid system coin is in developing multiple tiers of labor expenditure in indexing and description that match the quality of the resource. How much human metadata creation effort, beyond basic machine crawling and automated classification and near full-text indexing, needs to be applied? The answer is dependent on your scaling and sustainability strategies and questions of adequate coverage vs. adequate resource description as well as, ultimately, labor costs. The shorter and quicker the description, the more labor that can be expended on increasing coverage, and vice-versa. The time issue is very much involved here. Our goals have been indexing, not cataloging, and adequate, objective description, not reviewing. Within our approach, though, we have always recognized that the higher the scholarly or educational value of a resource, the greater the amount of expert time which can be invested in its description.

    We are developing a three tiered approach to resource description, building on new system capabilities, in order to achieve better efficiencies in labor allocation. These tiers are:

    • Automatically indexed, minimal records for medium to high value resources;
    • This plus expert review and augmentation for high value resources;
    • These plus exporting records for very high value resources to allied, traditional library cataloging operations.

    Using our streamlined record as a foundation, MARC cataloging could be done and the more fully embellished record then moved into traditional library catalogs and/or even back to INFOMINE. All tiers, would benefit from near-full text indexing (the exception would be the library catalogs importing records from us).

    Scenarios for the evolution of a resource's description in our hybrid system might include:

    • a.) A resource is discovered by our crawler/classifier as having value and is either:
      • a1.) Tier 1: Automatically classified (filling fields with data from title words and phrases, URL, author supplied metadata as available, author-emphasized text, significant keywords and phrases from full-text, general subjects from our subject tree) or,
      • a2.) Tier 2: In the case that the crawler/classifier determines very high linkage and similarity ratings (in comparison with known high quality resources), an expert would be notified to immediately review and enrich the initial auto-indexing (the auto-supplied information would be clarified and checked for accuracy while complex descriptive information such as an annotation or Library of Congress Subject Headings would be supplied). In addition, over time, as usage software detected patterns of high usage of a record for a resource not yet expert-reviewed, it would be flagged for review;
      • a3.) Tier 3: If the resource, now with expert created metadata, experiences even higher usage, it would again be flagged for expert review and, if determined to be of extremely high value and significance to warrant fuller description, could be copied to participating or allied cataloging departments (we wouldn't do it) for cataloging and from there moved into library catalogs and back to INFOMINE.

    This implies that INFOMINE could import/export and generally "trade" records with traditional catalogs doing Web resource description. Reaching Tier 3 effort would not be common but would apply in cases where the resource, for example, was a mainstream A&I database, e-journal, digital, or virtual library, etc. Note that the process would be flexible and wouldn't necessarily start with step a.). Experts would continue to introduce resources into the system unaided by the crawler. Some resources would be so highly rated by the system that they might be moved to step c.) immediately.

    Over the next two to four years the ratio of different types of records might be, as a guesstimate to illustrate scale among the three tiers, something like several million records at the Tier 1 level of crawler/auto-classified records (minimal expert labor. e.g., truing crawls); between 100,000-300,000 records that had rated Tier 2 effort (expert labor at the level of streamlined indexing); and, under 150,000 thousand that had justified labor at the level of Tier 3 (expert labor at the level of MARC cataloging).

    Our choices, as part of our scaling strategy, naturally emphasize auto-classification as well as streamlined indexing because expert time and labor are expensive. The Web, as a scholarly and educational resources publishing medium, remains a spectacularly growing mass media; the numbers of resources that will be of use are going to be immense. Immense amounts of intelligent and useful information, both that which is brokered and vetted by scholarly organizations and publishers and that which is not, is becoming available. So, given these trends and that many academic libraries are currently not handling print cataloging backlogs well, we need to become comfortable with new cooperative approaches towards creating effective, minimal metadata. Metadata records need to be created in increasingly automated, streamlined and time efficient ways. Metadata will be created through indexing, not cataloging, approaches, except in rare instances. It will be geared towards providing access and finding (and refinding) information. Metadata will remain simple and capable of being authored by academics and others not trained in cataloging. It will be produced by and/or complementing new machine-assisted means of providing metadata.

Direction C: Developing Labor Savings in VLs and New Roles for Expert Content Builders

Though implied directly or indirectly throughout the discussion, labor savings deserves its own category of concern. VLs, like physical libraries, are not inexpensive to build and maintain. While we believe academic VLs will become crucial in modern scholarly research and education and will have an expanding role among the array of services which serious library systems routinely provide, it is vital that production and maintenance costs be lowered.

Our approaches to machine assistance, then, are not just about increasing the reach and quality of our tool. They are about more efficient, labor saving ways of doing this. Machine-assistance which results in greater efficiencies in the labor intensive VL tasks is critical. Also critical is the refocusing of and best use of expertise so as to make the best use of machine assistance.

How do we take the best of machine learning in our area of application and combine it with human expertise and effort in optimum ways? We see system optimization gains via defining new roles (in addition to continuing with many of the more established roles) for experts in a number of areas. These might include:


Virtual libraries are facing big challenges in their efforts to build a foundation for sustainable, ongoing effort. We have chosen to focus on a few by emphasizing the general themes of developing library-based, open, and cooperatively built content and software within a centralized, focused, cooperative organizational effort. In supporting our community of users (librarians, faculty, students, and instructors), in our mutual need to accurately, reliably, and affordably access important scholarly and educational Internet resources, INFOMINE and other library-based VLs are laying claim to vital new areas in scholarly and educational information service and technology provision.

About the Authors

Julie Mason is a member of the INFOMINE Development Team and the Facilitator for the K-12 subject file, at the Library of the University of California, Riverside.

Steve Mitchell is a science librarian and Co-coordinator of INFOMINE, at the Library of the University of California, Riverside.

Margaret Mooney is Head, Government Publications, Rivera Library and Co-coordinator of INFOMINE, at the University of California, Riverside.

Lynne Reasoner is Facilitator of the Government Information subject file, at the Library of the University of California, Riverside.

Carlos Rodriguez is a member of the INFOMINE Development Team, at the Library of the University of California, Riverside.


Sergey Brin and Lawrence Page, 1998. "The Anatomy of a Large-Scale Hypertextual Web Search Engine," Computer Networks, volume 30, numbers 1-7, pp. 107-117, and Proceedings of the Seventh International World Wide Web Conference, April 1998, Brisbane, Australia, at

Soumen Chakrabarti, Martin van den Berg, and Byron Dom, 1999a. "Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery," Proceedings of the Eighth International World Wide Web Conference, May 1999, Toronto, Canada, at

Soumen Chakrabarti, Byron Dom, S. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins, Jon M. Kleinberg, and David Gibson, 1999b. "Hypersearching the Web," Scientific American, volume 280 (June), pp. 54-60, and at

Huan Chang, David Cohn, and Andrew McCallum, in press. "Creating Customized Authority Lists." Submitted to Digital Libraries 2000, and at

John Cleary and Leonard Trigg, 1998. "Experiences with OB1, an Optimal Bayes Decision Tree Learner," at

M. Craven, 1999. "Learning to Extract Relations from Medline," AAAI-99 Workshop, at

Susan Dumais, John Platt, David Heckerman, and Mehran Sahami, 1998. "Inductive learning algorithms and representations for text categorization," In: CIKM-98: Proceedings of the Seventh International Conference on Information and Knowledge Management, and at http://robotics.Stanford.EDU/users/sahami/papers-dir/cikm98.pdf

Dayne Frietag, 1998. "Multistrategy Learning for Information Extraction," Proceedings of the 15th International Conference on Machine Learning (ICML-98), and at

D. Gibson, Jon Kleinberg, and Prabhakar Raghavan, 1998. "Inferring Web Communities from Link Topology," Proceedings of the 9th ACM Conference on Hypertext and Hypermedia, pp. 225-234, and at

Oren Glickman and Rosie Jones, 1999. "Examining Machine Learning for Adaptable End-to-End Information Extraction Systems," AAAI-99 Workshop on Machine Learning for Information Extraction (July 19), Orlando, Fla., and at

Thomas Hofmann, 1999. "Probabilistic Latent Semantic Indexing," Proceedings of the 22nd International Conference on Research and Development in Information Retrieval (SIGIR'99), Berkeley, Calif., and at

Thomas Hofmann, 1998. "Learning and Representing Topic: A Hierarchical Mixture Model for Word Occurrences in Document," Conference for Automated Learning and Discovery, Workshop on Learning from Text and the Web, Carnegie-Mellon University, and at

Rosie Jones, Andrew McCallum, Kamal Nigam, and Ellen Riloff, 1999. "Bootstrapping for Text Learning Tasks," In: IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications, and at

Jon M. Kleinberg, 1998. "Authoritative Sources in a Hyperlinked Environment," In: Howard Karloff (editor). Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithms, and at

Andrew McCallum, Kamil Nigam, Jason Rennie, and Kristie Seymore, in press. "Automating the Construction of Internet Portals with Machine Learning," at

Andrew McCallum, Kamil Nigam, Jason Rennie, and Kristie Seymore, 1999a. "A Machine Learning Approach to Building Domain-Specific Search Engines," In: Proceedings of the 16th International Joint Conference on Artificial Intelligence (IJCAI-99), and at

Andrew McCallum, Kamil Nigam, Jason Rennie, and Kristie Seymore, 1999b. "Building Domain-Specific Search Engines with Machine Learning Techniques," AAAI-99 Spring Symposium on Intelligent Agents in Cyberspace, and at

Craig G. Nevill-Manning, Ian H. Witten, and Gordon W. Paynter, 1999. "Lexically-Generated Subject Hierarchies for Browsing Large Collections," International Journal of Digital Libraries, volume 2, number 3, pp. 111-123, and at

Kamil Nigam, Andrew McCallum, Sebastian Thrum, and Tom Mitchell, in press. "Text Classification from Labeled and Unlabeled Documents using EM," Machine Learning Journal, and at

Andreas Paepcke, Hector Garcia-Molina, Gerard Rodriquez-Mula, Junghoo Cho, 1998. "Beyond Document Similarity: Understanding Value-Based Search and Browsing Technologies," Stanford Digital Library Technologies Working Papers, SIDL-WP-1998-0099, at

Jason Rennie and Andrew McCallum, 1999. "Using Reinforcement Learning to Spider the Web Efficiently," In: Proceedings of the 16th International Conference on Machine Learning (ICML-99), and at

E. Riloff and J. Lorenzen, 1999. "Extraction-Based Text Categorization: Generating Domain-specific Role Relationships Automatically," In: Tomek Strzalkowski (editor). Natural Language Information Retrieval. Boston: Kluwer, and at

Editorial history

Paper received 1 May 2000; accepted 10 May 2000.

Contents Index

Copyright ©2000, First Monday

INFOMINE: Promising Directions in Virtual Library Development by Julie Mason, Steve Mitchell, Margaret Mooney, Lynne Reasoner, and Carlos Rodriguez
First Monday, volume 5, number 6 (June 2000),