Web Mining Technology and Academic Librarianship

John Naisbitt predicted in his book Megatrends (1980) that high technology would bring the need for "high human touch." This prediction is reflected in today's information-intense world. Due to the rapid development of technology, the library profession faces an uncertain future. Library professionals must use insight to identify technology's potential to benefit the academic library's role in the twenty-first century. This paper focuses on the human-machine connection between academic librarians and Web mining technology with respect to electronic reference service. The connection is featured in processes of: (a) identifying problems of electronic reference service; (b) selecting a technology to solve the problem; and, (c) envisioning the potential of the selected technology for librarianship. Scenarios address pertinent questions, including: (a) What role should librarians play to facilitate implementation of a technology? and, (b) What opportunities do technology offer to the profession in return?

Contents

Introduction
Human-Machine Connections
What are Data Mining and Web Mining?
Web Mining Technology as a Tool
Conclusion

Introduction

The uncertainties of the Internet challenge the modern workplace but also promise unexpected opportunities. Converting uncertainties to opportunities requires human efforts to enhance the capabilities of technology to benefit professional endeavor. In the library profession, practices of unstructured and unrestricted information dissemination and retrieval from the World Wide Web (WWW) not only create an information explosion but make uncertain the future of the profession as well. Facing the need to adapt to the new climate of the information world, academic librarians are charged to extend their services from the reference desk to the virtual environment on the WWW. To meet the challenge, new information technologies that can assist in organizing and retrieving information are constantly being investigated by academic libraries. Web mining is one of these new technologies that deserves special attention because of its capability to discover and analyze useful information from the Web.

This paper focuses on human-machine connections between academic librarians and Web mining technology. Scenarios, used as brainstorming tools, explore ways of converting the uncertainty of changing user habits and the Web information explosion to opportunities for advancing academic library services. The improvement of electronic reference services, from the reference desk to cyberspace, will strongly support the ongoing research interests of the institution. A well designed process for human-machine connection becomes vitally important. The speculative process is divided into three steps: First, identify electronic reference service problems with the WWW that need to be addressed; second, select a technology that has the potential to address these problems; and third, envision what values can be added to products of the selected technology to transfer intellectual interactions from the physical environment to a virtual one, supporting academic libraries' mission of nurturing those interactions.

Human-Machine Connections

Effective human-machine connections will enable librarians to fully utilize technology's capacity to transfer traditional reference skills from the desk in the library to the electronic realm of cyberspace. The success of this connection depends on three factors: First, library professionals need to identify what problems exist in electronic referencing; second, they must identify one or more technologies to solve these problems; and finally, the librarians need to be able to add value to the products of this technology to further their mission as information managers.

Let's identify some of the problems in conducting reference services on the Web.

Issues

In academia, students, researchers and faculty members are communicating their own information to the Web at the same time they are turning to the Internet to seek information. Academic libraries have to deal with this new habit of data exchange while they are also facing an information explosion on the WWW. In addition, the costs of new information technologies, shrinking resources and fast-changing student demographics all have an impact on the services provided by academic libraries. These libraries are challenged to account for delivery of efficient and direct customer-oriented services to both on- and off-campus clientele throughout their service areas. Meanwhile, the libraries are also pressured by the necessity to provide service 24 hours a day, seven days a week, so that their users can access information any time and anywhere. Along with unlimited access, it should also be a priority to ensure the quality of information, confirming that the information is relevant and from authoritative sources, therefore making it suitable for research purposes.

Problems

The Web is recognized as a system that provides opportunities for publishing and disseminating information globally. However, the WWW's unstructured environment creates an information overload and places obstacles in the way of users' access to relevant information. In addition, the existing search engines can only retrieve one-third of the "indexable Web" [ 1] and do not evaluate the content of the information retrieved. To successfully remove obstacles to accessing relevant information, academic librarians' skills of information management are very necessary in facilitating the retrieval process for end users. However, the unstructured information environment of Web requires librarians to upgrade their resources to include new Web-based information management technologies. The inclusion of these new technologies requires new considerations in operational procedures since the WWW is a global environment. A first consideration must be what role librarians can play to enhance implementation of a technology capable of managing and delivering more relevant information from the Web. Secondly, what values can librarians add to the technology so that it becomes a tool for nurturing quality intellectual interactions in a cyberspace environment?
Selecting a Technology to Solve the Problem

"The World Wide Web (WWW) is one of the many types of intelligence that requires human mastery due to its decentralized information arrays and the immense variety of materials available " [ 2]. These decentralized information arrays allow anyone to disseminate and retrieve information anywhere and any time. In order to make this connection, librarians need to choose a tool that can effectively organize and retrieve information on the WWW. Web mining technology appears to be an excellent tool to perform thesefunctions.

What are Data Mining and Web Mining?

Data mining is one of the hottest topics in information technology. It automatically and exhaustively explores very large datasets, consequently uncovering otherwise hidden relationships among data [ 3]. This technology has been successfully applied in science, health, marketing and finance [ 4] to aid new discoveries and strengthen markets. In addition, data mining techniques are being applied to discover and organize information from the Web [ 5]. Data mining itself is in the second generation of artificial intelligence; the core concept of data mining is focused on machine learning [ 6]. This machine-learning ability allows modification of search criteria automatically before the next execution, once particular patterns or trends have been discovered in the data searched [ 7]. It is important to understand that data mining is a discovery-oriented data analysis technology and not a single product or a system. It is a highly focused data transformation framework [8].

This transformation process uses a series of analytical techniques, such as clustering, association and classification [ 9]. These techniques are taken from the field of mathematics, cybernetics and genetics, and can be used independently or cooperatively. The function is to extract highquality information to identify facts and draw conclusions based on relationships or patterns among the data [ 10]. Most important, data mining can ask a processing engine to "show answers to questions we do not know how to ask" [ 11]. For example, bank customers' data are kept in different databases, thus, they are isolated from each other. Data mining technology can search all the different databases together, and provide a better customer view so that the bank can concentrate more on potentially good customers [ 12]. The rationale is that when asking for specific relationships, more important relationships might be missed. Asking to find relationships that we do not know exist will yield more meaningful data or business knowledge [13].

The combination of these two areas, data mining and the WWW, is known as Web mining. When data mining is applied to the Web, it can perform several functions including:

Resource discovery
the discovery of locations of unfamiliar files on the network;
Information extraction
the acquisition of useful information from the WWW; and,
Generalization
the discovery of information patterns from said resources [14].

There are two primary dimensions of Web mining: Web content mining and Web usage mining.

Web Content Mining

Web content mining is the "process of information or resource discovery from millions of sources across the World Wide Web " [ 15]. There are two approaches in Web content mining: the agent-based and database approaches.

The agent-based approach involves artificial intelligence systems that can "act autonomously or semi-autonomously on behalf of a particular user, to discover and organize Web-based information " [ 16]. Some intelligent Web agents can use a user profile to search for relevant information, then organize and interpret the discovered information (e.g., Harvest). Some use various information retrieval techniques and the characteristics of open hypertext documents to organize and filter retrieved information (e.g., HyPursuit). Another kind of agent is programmed to learn user preferences and use those preferences to discover information sources for those particular users (e.g. XpertRuleR Miner) [ 17].

The database approach focuses on "integrating and organizing the heterogeneous and semi-structured data on the Web into more structured and high-level collections of resources." These organized resources can then be accessed and analyzed [ 18]. These "metadata, or generalizations, are then organized into structured collections (e.g., relational or object-oriented databases) and can be analyzed" [19].

Web Usage Mining

The other dimension of data mining is Web usage mining. This is the process of discovering user access patterns (or user habits), as data are automatically collected in daily access logs. Recently, referrer logs, which collect information about referring pages for each reference and user registration, also have been included. Web usage mining is crucial in establishing user profiles for a better structured Web site. "As the manner in which the Web is used continues to expand, there is a continual need to figure out new kinds of knowledge about user behavior that needs to be mined for" [20].

Mining Techniques

The common techniques for Web mining are: clustering/classification, association rules, path analysis, and sequential patterns.

Clustering/classification
A means to develop profiles of items with similar characteristics. This ability enhances the discovery of relationships that are otherwise not obvious. For example, classification of Web access logs allows a company to discover the average age of customers who order a certain product. This information can be valuable when developing advertising strategies [ 21].

Association rules
Rules that govern "databases of transactions where each transaction consists of a set of items." This technique is used to predict the correlation of items "where the presence of one set of items in a transaction implies (with a certain degree of confidence) the presence of other items." For example, prediction of the percentage of clients assessing a particular URL who will place online orders for a certain product [22].

Path analysis
A technique that involves the generation of some form of graph that "represents relation[s] defined on Web pages." This can be the physical layout of a Web site in which the Web pages are nodes and the hypertext links between these pages are directed edges. Most graphs are involved in determining frequent traversal patterns or large reference sequences from physical layout, such as the most frequently visited paths in a Web site. For example, what paths do users travel before they go to a particular URL? [23].

Sequential patterns
Applied to Web access server transaction logs. The purpose is to discover sequential patterns that indicate user visit patterns over a certain period. For example, "30% of clients who visited /company/products/, had done a search in Yahoo within the past week on keyword W" [ 24].

Web Mining Technology as a Tool

Web mining can be a promising tool to address ineffective search engines, that produce incomplete indexing, retrieval of irrelevant information or unverified reliability of retrieved information. It is essential to have a system that helps the user find relevant and reliable information easily and quickly on the Web. Web mining discovers information from mounds of data on the WWW, but it also monitors and predicts user visit habits. This gives designers more reliable information in structuring and designing a Web site. Web mining technology can help librarians design Web sites with paths that can be traveled easily by end users, saving time and effort. Scenarios in this paper are used to envision what this technology could do for library reference services. These scenarios are designed to brainstorm new operational considerations - if Web mining technology is chosen as a tool to organize information from the Internet.

Scenario A

A group of librarians applies Web mining technology to discover information from the WWW that is pertinent to different research interests currently sustained in different disciplines in their academic institution. This technology discovers information automatically and clusters it in groups, according to subject area (e.g., subject-specific), allowing easy access.

The librarians ensure that this information is from authoritative sources and of research quality by building sets of attributes (or profiles) for different subject areas. These attributes facilitate the process of clustering and present information in logical groups for systematic retrieval.

As a result, the librarians spend very little time maintaining this database because Web mining technology can discover and organize information automatically without human intervention. The end users are satisfied because they can find anything they want in a very short time without having to screen thousands of documents. Most importantly, they can access information any time and from anywhere in the world. It is actually a virtual operation in which librarians transfer their reference skills from the desk to the Internet.

Reality

Ideal situations resembling the scenario mentioned above are rarely the case. If the above scenario were real, no human-machine connection would be needed. According to this scenario, the machine is doing all the work and making all the necessary decisions (e.g., what sources to use). Unfortunately, no matter how sophisticated the system is, the unexpected always presents itself. Complications can be manifold. The second scenario suggests two considerations that librarians need to address in order to deal with unforeseen problems in the future.

These two considerations are:

What role should librarians play to facilitate smoother operations so that library patrons can access relevant and reliable information quickly?
What possibilities do librarians see that this technology can be applied to enhance their professional endeavors?

Scenario B

An academic library decides to incorporate Web mining technology into its reference services (e.g., mining data from the WWW and using the data to build subject-specific databases) to help patrons access library reference services any time and anywhere. An implementation committee, made up of reference and cataloging librarians as well as members of the computer staff, is charged with completing the task.

This committee realizes that all efforts have to be concentrated on facilitating machine intelligence, thereby achieving the full potential of the technology, especially in processing reference transactions. The experience of the librarians and their commitment to public service is essential in applying Web mining technology to enhance reference services. To fulfill this commitment, however, librarians need to be willing to participate from the start in the implementation process. This does not imply that librarians have to become computer gurus, but that effective communication and coordination throughout the process will result in full participation of all concerned, including contributions from members according to individual expertise.

The committee decides to tackle this task by considering two points:

Point A:
What should librarians do to facilitate smooth operations so that library patrons can access relevant and reliable information quickly?

To tackle this consideration, the committee sets up brainstorming sessions aimed at identifying important steps necessary to set up an operation in which information retrieved is relevant and reliable. As a result, six steps are drafted. They are:

Step 1: Establish guidelines specifying the scope for different subject databases with the help of reference librarians in their own subject areas.

This guideline would not be unlike the collection policy for each subject area. Web sites would be emphasized that are in line with the research interests of the institution.

Step 2: Investigate end-users' characteristics such as demographics, geographic locations, research objectives and search habits.

Step 3: Investigate the network capability, possibility for scaling, and compatibility of hardware platforms between the institution and its end users.

Step 4: Investigate data mining software in the market including characteristics, capability, price, continuing improvement (new versions) and practicality. Most important, the committee should compile an inventory of software that exists in the institution, exploring the effectiveness and the integration possibilities of targeted data mining software with existing software, so ultimately all would function beneficially.

Step 5: Set up a training plan for patrons as well as library staff.

Step 6: Set up a device for user feedback and communication between content producers and end users.

Point B:
What possibilities do librarians see that this technology can enhance their professional endeavors?

Many ideas to add value to Internet resources are well documented. Suggestions - such as providing "packaged" answers, search assistance, and critical evaluation of relevant resources [ 25] - are common. Besides these ideas, the committee wants to add one more benefit to this database, which is to support and nurture an intellectual cyber-exchange environment.
For years, groups of prestigious scholars have shared preprints and findings in conferences to advance their research interests. Many professionals have benefitted from that sharing. Traditionally, however, libraries do not participate in such circles except for subscribing to publications generated by professional societies and conference proceedings.

As technology grows and changes, the circle of invisible colleges broadens. Currently, many learning circles [ 26] are being generated in cyberspace and are extending their membership beyond established scholarly groups.
The committee sees that intellectual exchange activities are not that different from those taking place within the physical environment of the library. It sees the opportunity for starting new invisible colleges in cyberspace. Members think it may be a good time for the library to participate in an invisible college and to become more involved in the processes involved in generating information. Therefore, it agrees to provide a mechanism to facilitate communication between content producers (researchers) and end users (who may be researchers themselves). The idea is to let scholars (or interested parties) establish contacts of their own, so that people with similar interests can share experiences, ideas, failures and successes. To facilitate communication, Web-based conference software (e.g. Forum ) is integrated into each subject gateway. Content producers and users can communicate directly, either in a group setting or individually (ensuring privacy). Information shared can be posted to the database for public consumption or be protected for access by members' only.
Formal or informal publications produced can be included in the database, which will strengthen the gateway itself by providing sought-after information. The committee also anticipates that keeping statistics on subject keywords for each subject gateway may require the review of research requirements. For example, the subject of wolf communication may show many requests are made to the animal science gateway. This may suggest that demand for information on this topic exists. Researchers can then compare this information to their own research interests. Some may find this a welcome new research direction and start to look at this area.

Result:
The committee uses Web mining technology to build a database that facilitates dynamic research activities. More important, the library is finally a part of the invisible college, allowing it to take on a new role in the changing information age on campus.

Conclusion

This paper describes the use of Web mining as a example of the importance of human-machine connections. These connections will help transfer librarians' traditional information skills from the reference desk to cyberspace - a critical component of successful information retrieval. It has been asserted that "many information-gathering tasks are better handled by finding a referral to a human expert rather than by simply interacting with online information sources"e; [ 27]. If this statement is true, how will academic libraries function in this new digital information age?
Traditionally, the library has been a physical facility for intellectual activities. The profession has long been the advocate for intellectual freedom; in addition, the library also has provided a space for both academic and social interactions. Young students may unknowingly study with or encounter their future mates, colleagues or critics in the library. As technology changes, some of these meetings will be taking place in cyberspace. It is interesting to envision how libraries will facilitate these kinds of cyber-meetings and intellectual exchanges in the future. In Scenario B, digital invisible colleges would evolve in libraries with important contributions by librarians. When databases are established by Web mining technology, all of the research interests of an institution will easily be accessible on the Web to all sorts of audiences. Anyone with the same research interests will be able to contact the content producer for each Web component for further details or to ask additional questions. Such contact is easily available with conference software that could be part of certain databases.

Libraries are instrumental in the formation of invisible colleges generated in part by subject-specific Web mining databases. However, libraries do not participate in the creation of these invisible colleges because librarians generally do not actually create new information. It is anticipated that these invisible colleges will take on a life of their own and contact between interested parties may grow or shrink. It is not unlike watching cellular activity under the microscope. Live cells aggregate and form clusters for a time, then some of them may break off and form new clusters with other cells.

There are many forms of invisible colleges taking place around the world due to the advancement of information technology. For example, one invisible college at Portland State University, in Oregon, is designed to explore teaching and learning through community service by forming learning circles. These learning circles allow participants to learn from other practitioners and innovators in the field. The membership of this invisible college is no longer confined to a prestigious group of academicians. It has been opened to include members outside academia as well. The goal of such inclusiveness is an attempt to connect community service with academic study so that connection "heightens the relevance of academic subjects by directly linking classroom learning to community experience" [ 28]. Galegher, Kraut and Edigo, in their book Intellectual Teamwork: Social and Technological Bases for Cooperative Work, demonstrate that an informal network of collaborators, colleagues, and friends is one of the most effective channels for information dissemination [29].
Clearly no one technology is at the center of the debate. Instead, it is crucial to identify problems and the opportunities offered by technology. Human-machine connections are vital in helping library professionals empower patrons as the new century approaches. These connections also will be an advantage in facing the challenges of fast-paced development in information technology. John Naisbitt predicted that "whenever new technology is introduced into society, there must be a counterbalancing human response" [ 30]. The human-machine connections proposed in this paper echo Naisbitt's idea that as society becomes more oriented toward high technology, more of the "human touch" will be needed. Human touch in this paper is represented by academic librarians using technology to support their fundamental mission of nurturing intellectual interactions, while transferring those interactions from a physical environment to a virtual one.
About the Author
May Y. Chau is Agricultural Reference Librarian and Assistant Professor at Oregon State University. Ms. Chau received a BA in Fine Arts, BS and MS degrees in Horticulture at Brigham Young University (Provo, Utah) and a MSLS from Wayne State University (Detroit).
E-mail: May.Chau@orst.edu

Notes

1. S. Lawrence and C. Lee Giles, 1998. "Searching the World Wide Web, " Science, volume 280, number 3 (April), pp. 98-100.
2. May Y. Chau, 1997. "Finding Order in a Chaotic World: A Model for Organized Research Using the World Wide Web," Internet Reference Services Quarterly, volume 2, numbers 2/3, pp. 37-53.
3. Bruce Moxon, 1996. "Defining Data Mining," DBMS, volume 9, number 9 (August), pp. S11-13.
4. Helge Grenager Solheim, 1996. "Specific Data Mining Applications," at http://www.pvv.unit.no/~hgs/project/report/node80.html

5. Robert Cooley, Mobasher Bamshad, and Srivastava Jaideep, 1997 "Web Mining: Information and Pattern Discovery on the World Wide Web" at http://www-users.cs.umn.edu/~mobasher/webminer/survey/survey.html
6. Philip Chapnick, 1996 "Data Mining at Redux," Database Programming & Design, volume 9, number 9 (September), p. S5.
7. Richard Yevich, 1997. "Data Mining," In: Joyce Bischoff, Ted Alexander and Sid Adelman (editors). Data Warehouse: Practical Advice from the Experts. Upper Saddle River, N.J.: Prentice Hall.
8. Mark M Davydov, 1997. "Exploiting Data Mining at the Application Level," Wall Street & Technology, volume 15, number 3 (March), p. 59.
9. Moxon, op.cit.

10. Davydov, op.cit.

11. Yevich, op.cit.

12. Stuart J. Johnston, 1996. "How To Get a Better Return on Data," Information Week, number 596 (September), pp. 82-84.
13. Yevich, op.cit.

14. Oren Etzioni, 1996. "The World Wide Web: Quagmire or Gold Mine," Communications of the ACM, volume 36, number 11 (November), pp. 65-68.
15. Cooley, Bamshad and Jaideep, 1997. op.cit. at http://www-users.cs.umn.edu/~mobasher/webminer/survey/survey.html
16. Bamshad Mobasher, 1997. "Agent-Based Approach," at http://maya.cs.depaul.edu/~mobasher/webminer/survey/node4.html#SECTION00021100000000000000
17. Ibid.
18. Bamshad Mobasher, 1997. "Database Approach," at http://maya.cs.depaul.edu/~mobasher/webminer/survey/node5.html#SECTION00021200000000000000
19. Ibid.
20. Bamshad Mobasher, 1997. "Web Usage Mining," at http://maya.cs.depaul.edu/~mobasher/webminer/survey/node6.html
21. Bamshad Mobasher, 1997. "Clustering and Classification," at http://maya.cs.depaul.edu/~mobasher/webminer/survey/node17.html#SECTION00032400000000000000
22. Bamshad Mobasher, 1997. "Association Rules," at http://maya.cs.depaul.edu/~mobasher/webminer/survey/node15.html#SECTION0003220000000000000
23. Bamshad Mobasher, 1997. "Path Analysis," at http://maya.cs.depaul.edu/~mobasher/webminer/survey/node14.html#SECTION00032100000000000000
24. Bamshad Mobasher, 1997. "Sequential Patterns," at http://maya.cs.depaul.edu/~mobasher/webminer/survey/node16.html#SECTION00032300000000000000
25. Alastair Smith, 1999. "Beyond the Basics: Role of librarian in the Internet environment," at http://www.vuw.ac.nz/~agsmith/beyond/2role.htm

26. Invisible College at Portland State University: Invisible College Bylaws, at http://www.invcol.pdx.edu/bylaws.htm

27. Henry Kautz and Bart Selman, 1997. "The Hidden Web," AI Magazine, volume 18, number 2 (Summer), p. 27.
28. "The Questions People Ask About the Invisible College, " at http://www.invcol.pdx.edu/questi~1.htm

29. J. Galegher, R. Kraut and C. Edigo (editors), 1990. Intellectual Teamwork: Social and Technological Bases of Cooperative Work. Hillsdale, N.J.: Erlbaum Associates.
30. John Naisbitt, 1982. Megatrends: Ten New Directions Transforming Our Lives. New York: Warner Books.

Copyright © 1999, First Monday