First Monday

Data Mining Solutions and the Establishment of a Data Warehouse: Corporate Nirvana for the 21st Century? by Christine Maxwell and Howard Gutowitz

This paper addresses the issue of a need to broaden the traditional meaning of data mining and data warehousing, to encompass information mining and knowledge retrieval. The authors contend that if data mining and data warehousing are to become 'corporate nirvana' in the next century, then they must be built as complex adaptive systems with the business end user firmly in mind. Companies will need to link the concept of data mining to equally sophisticated information retrieval tools that work on the basis of combined machine and human intervention in more intelligent ways than those presently offered in to day's information retrieval tools. A strong comparison is also made between the growth of the Geographic Information Systems (GIS) market and that which can be expected for the data mining and warehousing industries once the world of business co-opts these terms for their own.

Contents

The Expansion of the Appeal of Data Mining and Data Warehousing
GIS as an Example
Information Overload
The Risks in Data Warehousing
Impact of Cyberspace on Data Warehousing and Data Mining
The Distribution of Documents in Cyberspace
The Shortcomings of Information Technology
Mapping Cyberspace
The Importance of Agents
Conclusion

Introduction

Data warehousing is becoming " de rigeur " in every major company. And big business is betting more and more on a series of data mining tools to help them predict future trends based on an analysis of historical behavior. In the construction of data warehouses to day, companies are pinning their hopes on being able to extract out from the datamined content stored in their data warehouses, likely new customers for products or services through the integration of existing customer account information with demographics and lifestyle data.

According to major research companies like Forester Research, the data mining industry is predicted to grow from its present 40 million dollar size today to over 800 million dollars by the turn of the century.

The Expansion of the Appeal of Data Mining and Data Warehousing

In the book "Advances in Knowledge Discovery and Data Mining," published by the MIT Press, Dr. Usama M. Fayyad and his fellow editors stated that:

" ... in combining the two terms "data mining" and "data warehousing", they are attempting to build bridges between the statistical, database and machine learning communities and appeal to a wider audience of information systems developers."

We think this approach is very helpful - in fact, we believe that the term "data mining" is actually on the cusp of appearing on the radar screens, for the first time, of millions of companies. We predict that the path of familiarity with the terms "data mining" and "data warehousing" will be very similar to what has happened to the term "GIS" in the last five years.

GIS as an Example

Geographic Information Systems (better known as GIS) has been the purview for over 25 years of companies involved with primary resources and who had in their possession large mainframes on which to be able to store and retrieve massive amounts of relational data. ESRI based in Redlands, California, was - and is continuing very much today - to be one of the original leading companies. There are now millions of companies using 'business GIS' from their desktops and laptops to be able to both display and extract all kinds of information relating to such common problems as the location of a new shop or service store to help guarantee the largest "reservoir" of people in the surrounding area.

In other words, GIS went from a very specialized status, known only to a small group of professionals working in such fields as the oil and gas industry, forestry, and mining, to being an enabling application that could also be used on the desktop. It proved to be a strategic business advantage to both small and large companies alike. A burst of new applications came on the market and names like Map Info and Strategic Mapping captured chunks of the global business GIS market. Interestingly enough, it was the GIS enabling capability of linking demographics with psycho-graphics that opened up the way for whole new suites of business applications to be built.

It is our prognosis that the same trajectory is going to be followed by both the terms of data mining and data warehousing. The terms themselves are on the verge of being ''co-opted" by Business with a capital B. Once that happens, the growth of this industry, like that of GIS, will be exponential. It also means, however, that the terms risk to expand in meaning. Even today, there is confusion by many in business over the use of the term "data." Often that term is used to cover textual and other forms of information and not just transactional data. In fact the terms "data" and "information" are often used interchangeably by many business people as they go about their day-to-day data/information-gathering tasks.

We believe that in the very near future, the present suites of data mining tools will start coming with a complement of new information retrieval tools. These tools will have an in-built ability to help even an untrained user extract, from textual and other media formats, the kind of gold nuggets of information presently being pulled from a company's transactional data, by suites of data mining tools. It is important to bear in mind that a large chunk of any company's in-house information is to be found in textual form. That information, if retrieved from a data warehouse using advanced information retrieval tools, can be made to bring forth added value information, leading to better business decisions.

Information Overload

Advances in data technology have overwhelmed corporations with information, driving the urgent need to develop new tools that can help transform data into business advantage. The advent of the Internet and the World Wide Web has, in the matter of five years, made more information accessible to more individuals than at any other time in our history.

Companies know that their in-house databases contain untapped knowledge about themselves. This information, if properly stored, can then be retrieved and analyzed. The outcomes can provide them with a competitive edge in a world of saturated markets. The single biggest problem is that so much of this information is distributed across networks, divisions, and often continents!

What to do?

In our opinion, corporate nirvana in the use of data mining tools and data warehousing will only be achieved when companies link the concept of data mining to equally sophisticated information retrieval tools. These tools will work on the basis of combined machine and human intervention in more intelligent ways than those presently offered in to day's information retrieval tools. Corporations will need to run two complementary data/information retrieval processes. One process will literally mine data and allow software to detect hidden patterns. Another process will query information through the posing of specific questions and secure targeted answers.

Machine learning in its present state is nowhere near the capacities of a human. After 30 years of artificial intelligence research, we can no longer claim with a straight face that machines can replace humans! Instead, we are looking for ways by which the machine can aid in a specially integrated fashion, rather than replacing human intervention in information retrieval. We believe that the remedy is still for human and machine to work together in order to extract the most useful knowledge from a database.

The Risks in Data Warehousing

We recognize that there is a major difference between data and information. In data warehousing, there is a tremendous risk that what will be delivered from these warehouses will be vast quantities of data rather than quality information. This is where the importance of metadata comes in to play. There is no question that improving the quality of metadata automatically improves the quality of the information retrieved. And the amount of data retrieved is also reduced. Successful data warehouses invariably are individually small, heavily used by the target business unit organization, are constantly changing in terms of their content based on changing market needs, and are controlled by the business unit of an enterprise.

The importance of providing all in-house company users with a single, uniform view of information throughout a corporation is key to a company's efficiency and ability to deliver a higher quality customer service. It is only through the complementary usage of data mining tools, and data warehousing, alongside with that of cutting -edge information retrieval tools and processes, that corporate nirvana in the area of knowledge extraction, can be achieved.

Impact of Cyberspace on Data Warehousing and Data Mining

We see today a unification of decision-support technologies into a universal knowledge system. The advent of the Internet and the World Wide Web allows not only for the cheap publication of terabits of information around the world, but also has created a paradigm shift in how we view information, where it is, and how any piece of information relates to any other. This is good news for corporations if they can start being really creative about how to organize both data and information within their organizations.

The Distribution of Documents in Cyberspace

Ever since Gibson, we are familiar with applying a spatial metaphor to the Internet. Documents on the Web are distributed in a "cyberspace." By extension of the metaphor, one "navigates" across the Web, visits sites, and so on.

Part of the excitement of the Web flows directly from this paradigm shift in how we view what information is, where it is, and how one piece of information is related to another. In the near future we will see the reality of the Web adhere even more strongly. We will be able to navigate cyberspace with the aid of good maps, maps which tell us where we are and what is nearby. Cyberspace will acquire the textures of real space, with landmarks both personal and official. We will be able to mark our trails like a cyber-Hansel and cyber-Gretel. We will be able to measure distances in any number of useful ways, effectively warping that space to our specification. All this has very interesting implications for progress in achieving greater accuracy in information retrieval.

The Shortcomings of Information Technology

Information technology as offered to corporations today suffers from having yet to catch up with cyberspace, even in its current incoherent state. Corporate data is treated as a supermarket from which items are to be retrieved, or a pit from which data should be mined. "Search engines" on the Web are similarly crippled at a conceptual level. Users try to funnel their desires through keywords shot into the dark.

There are lots of well-known problems with keywords. For instance, very few recipes contain the word "recipe." The way to look for recipes is actually with such words as "teaspoon," "tablespoon," and "cuillere"! And then the same words can mean different things to different people. An architect looking for documents on computer-aided architecture might form the query -"computers AND architecture" - and arrive at the home page of Intel. Mapping cyberspace will solve these problems and the enabling technologies of data mining and data warehousing must figure out quickly how to pro-actively lead the business charge. If not, there will be then the risk of being trampled by a stampede of business users, who will quickly brand those terms as their own, inventing a broader meaning for those concepts than they currently have today.

Mapping Cyberspace

In a sense, current search engines produce zero-dimensional maps. Points in this space correspond to vectors of words and documents and documents are returned in a list ordered by their distance (relevancy distance) from that point. They collapse the multiple dimensions of cyberspace into a point and all orientation is lost.

Information retrieval tools need to learn that lesson and must 'figure out' how to ensure that orientation within information can be guaranteed to all users of a data warehouse.

The Importance of Agents

Personal agents are entering our lives more and more. And over the next few years, they will come to have even more importance in the area of information retrieval.

Coming from many sources, notably artificial life, agents have been gaining popularity as a way of conceptualizing software design. The agent has a certain autonomy, and inherent rules of behavior. There may be many agents in a system, each responsible for one or many tasks, and able to cooperate with other agents. For example, in the "Chiliad publishing system," an agent maybe responsible for the maintenance of a particular document, or subject area. When new relevant information appears, the agent will automatically incorporate these into the document to the extent possible, and will otherwise alert the responsible author/editor.

The author/editor's job is then not just writing the document, but also defining rules of behavior for the associated agent. These rules will help the agent decide what is relevant or not and set priorities. Other rules will specify what agent types the agent should interact with, and how. There will be rules on how to report information collected, where and on what schedule.

Some of these ideas are extensions of the already familiar personalized newspaper, such as Point Cast. However, by having many agents, and having them interacting in sophisticated ways and under expert control, and then having the results presented in a visual map, we create a system with both quantitative and qualitative advantages. The multiplicity of agents contributes to the robustness of the system, since imperfections in a given agent need not propagate. It also contributes to its speed, since an agent-based system is naturally scalable.

Conclusion

If data mining and data warehousing are to become corporate nirvana for the 21st Century, then they must be built as complex adaptive systems with the business end user firmly in mind. A complex, adaptive system is a social system, a physical system, and an artificial biological system, all at the same time. A think tank that particularly focuses on this kind of research is the Santa Fe Institute in Santa Fe, New Mexico.

As data mining companies work on adding complements of information retrieval processes and tools to their present suites of offerings, this will vastly speed up the adaptation of the data mining industry to the broader needs of the business end user. When the worlds of data mining and knowledge extraction expand to include information retrieval from alternative media formats and text based data, then "data mining" and "data warehousing" will become the hottest buzz words for businesses in the information age.

References

John Seely Brown and Paul Duguid, 1996. "The Social Life of Documents," First Monday, Vol. 1, No. 1 (May), at http://www.firstmonday.org/issues/issue1/documents/

Designing Organizational Memory: Preserving Intellectual Assets in a Knowledge Economy. http://www.zilka.net/business/ifo/pubs/desp/
E. Jeffrey Conklin PhD conklin@cmsi.com

Usama M. Fayyad and others (eds.), 1996. Advances in Knowledge Discovery and Data Mining. Cambridge, Mass.: MIT Press.

Laura Fillmore, 1995. "A Penny for Your Thoughts: Copyright into "Cogniright," at www.obs-us.com/obs/english/papers/cogni.htm

Laura Fillmore, 1996. "Meme Machinery 101: The Evolution of a University Press Marketplace," at www.obs-us.com/obs/english/papers/mememach.htm

"Information" versus Knowledge and Understanding"
Talk given by: Murray Gellmann, Sta Fe Institute
Talk at ACM'97, 5th March, 1997

Howard Gutowitz, 1995. "Internet Two: Case Studies in the Diffusion of Scientific Information via the Internet," submitted to the Information Society Journal.

Marti A. Hearst, 1996. "Research in Support of Digital Libraries at Xerox PARC. Part I: The Changing Social Roles of Documents," D-Lib Magazine (May), at www.dlib.org/dlib/may96/05hearst.html

John H Holland, 1995. Hidden Order: How Adaptation Builds Complexity. Reading, Mass.: Addison Wesley.

Steven R. Holtzman, 1994. Digital Mantras: The Languages of Abstract and Virtual Worlds. Cambridge, Mass.: MIT Press.

R. Kownow, 1995. "Intellectual Property and the Arts: Molten Media and the Infiltration of the Law," at www-mitpress.mit.edu/Leonardo/

Wendy Lehnert, 1996. Center for Intelligent Information Retrieval Online Information Extraction Bibliography. Amherst, Mass.: Computer Science Department, University of Massachusetts, at ciir.cs.umass.edu/info/psfiles/tepubs/tepubs.html

Leonardo, 1996. "Digital Salon," special issue of Leonardo, Vol. 29, No. 5, at www-mitpress.mit.edu/Leonardo/isast/journal/journal.html

Christine Maxwell, 1997. "The Future of Publishing", Digital Publishing Strategies (January), pp. 10-11.

Christine Maxwell, 1995. "Cyberspace the Newest Indexing Frontier," Key Words, Vol. 3, No. 3 (July-August), presented at the American Society for Indexers Meeting, Montreal in June, 1995.

Le Monde Diplomatique, 1996. "Internet: L''extase et L'effroi," special issue of Le Monde Diplomatique (November).

Kamran Parsaye, 1996. Surveying Decision Support: New Realms of Analysis. Information Discovery, Inc.

Michael Dexter-Smith, 1996. Metadata: Data Warehouse Key, at www.data-warehouse.com/resource/articles/dexter.htm

Mark Stefik, 1996. Internet Dreams: Archetypes, Myths, and Metaphors. Cambridge, Mass.: MIT Press.

UMBC AgentNews Webletter, Vol. 2, No. 1 (Jan. 19, 1997), at www.cs.umbc.edu/agents/agentnews/1997/01/

Vista Computer Services, 1995. "The Organizational Impact of Publishing in the New Media."

The Authors

Christine Maxwell is President and Publisher of CHILIAD. A 25-year veteran of the publishing and research industries, Christine Maxwell is now focusing her extensive research, publishing and Internet expertise on creating a new information - mining, knowlege discovery and publishing vehicle for high value intellectual property.

As a co-founder of the Magellan online directory, one of the original top five most visited directory sites on the Internet , Ms. Maxwell has been the publisher and creative visionary behind the McKinley Group, owners of the Magellan directory, which was recently sold to Excite, another major directory service.

Recognising early on the need for a clear, concise guide to the Internet, Maxwell co-authored the original "New Riders Official Internet Yellow Pages" in 1994. She has since brought out the third edition of this work under the revised title of "McKinley Internet Yellow Pages." The work has achieved critical acclaim as an essential Internet reference tool.

Prior to launching The McKinley Group, Maxwell held senior marketing, strategic business and development positions with Pergamon Press Publishers, Science Research Associates (SRA) and Macmillan Publishing Company. She holds a teaching credential from Lady Spencer Churchill College of Education, Oxford, England, and BA degrees in Sociology and Latin American Studies from Pitzer College, in Claremont, California.

She serves as a trustee on the board of directors of the prestigious Santa Fe Institute,( www.stafe.edu) a private, independent multi-disciplinary research and education center, founded in Santa Fe, New Mexico, in 1986 by Professor Murray Gellmann and Professor George Cowan. She is also on theBoard of Directors of LEONARDO, a multidisciplinary journal of art, science and technology.

Email: maxwell@hyperactive.co.uk www.insync.demon.co.uk/christine_maxwell.html

Dr. Howard Gutowitz is a tenured Professor of Mathematics at l'Ecole de Physique et Chimie Industrielles (ESPCI) in Paris, France. He is a member of the Santa Fe Institute for Complex Systems Research, and is a frequent visitor there. He has degrees in Biology (B.Sc., Brown), Physics (Ph.D., Rockefeller), and Computer Science (Habilitation à Diriger des Recherches, U. Paris VIII).

He is the author of over 30 papers in these fields, as well as the editor of a book. Dr. Gutowitz held postdoctoral positions at the Rockefeller University, The Centre for Non-linear Studies, Los Alamos National Laboratory, and the Centre des Etudes Nucleaires, Saclay, before joining the staff of the ESPCI.

His research concerns complex systems, systems containing a multitude of simple individuals which, through their interactions, give rise to complex behavior. He has applied his theoretical work in these areas to practical concerns, such as traffic flow, cryptography, and the economics of pollution, leading to various industrial consultation positions.

Dr. Gutowitz has a long-standing interest in the collective behavior of societies of authors and readers, and has written two papers in this area. His research is now fully focused on the new vistas opened by the rapid development of the World Wide Web.

Email: hag@neurones.espci.fr www.santafe.edu/~hag

Christine Maxwell and Dr. Howard Gutowitz, Copyright, 1997


Contents Index

Copyright © 1997, ƒ ¡ ® s † - m ¤ ñ d @ ¥