Fundamentals of Data Mining -- Article Index and Overview

HOME | Project Management

Data Warehousing / Mining

Software Testing | Technical Writing

Section 1 What Is Data Mining and What Can It Do?

Section 2 The Data Mining Process

Section 3 Problem Definition (Step 1)

Section 4 Data Evaluation (Step 2)

Section 5 Feature Extraction and Enhancement (Step 3)

Section 6 Prototyping Plan and Model Development (Step 4)

Section 7 Model Evaluation (Step 5)

Section 8 Implementation (Step 6)

Section 9 Supervised Learning Genre Section 1--Detecting and Characterizing Known Patterns

Section 10 Forensic Analysis Genre Section 2--Detecting, Characterizing, and Exploiting Hidden Patterns

Section 11 Genre Section 3--Knowledge: Its Acquisition, Representation, and Use References


How to Use This Guide

Data mining is much more than just trying stuff and hoping something good happens! Rather, data mining is the detection, characterization, and exploitation of actionable patterns in data.

This guide is a wide-ranging treatment of the practical aspects of data mining in the real-world. It presents in a systematic way the analytic principles acquired by the author during his 40+ years as a practicing engineer, data miner, information scientist, and Professor of Computer Science.

This guide is not intended to be read and then put on the shelf. Rather, it is a working field manual, designed to serve as an on-the-job guidebook. It has been written specifically for IT consultants, professional data analysts, and sophisticated data owners who want to establish data mining projects; but are not themselves data mining experts.

Most sections contain one or more cases studies. These are synopses of data mining projects led by the author, and include project descriptions, the data mining methods used, challenges encountered, and the results obtained. When possible, numerical details are provided, grounding the presentation in specifics.

Also included are checklists that guide the reader through the practical considerations associated with each phase of the data mining process. These are working check lists: material the reader will want to carry into meetings with customers, planning discussions with management, technical planning meetings with senior scientists, etc. The checklists lay out the questions to ask, the points to make, explain the what's and why's-the lessons learned that are known to all seasoned experts, but rarely written down.

While the treatment here is systematic, it is not formal: the reader will not encounter eclectic theorems, tables of equations, or detailed descriptions of algorithms. The "bit-level" mechanics of data mining techniques are addressed pretty well in online literature, and freeware is available for many of them. A brief list of vendors and sup ported applications is provided below. The goal of this guide is to help the non-expert address practical questions like:

• What is data mining, and what problems does it address?

• How is a quantitative business case for a data mining project developed and assessed?

• What process model should be used to plan and execute a data mining project?

• What skill sets are needed for different types/phases of data mining projects?

• What data mining techniques exist, and what do they do? How do I decide which are needed/best for my problem?

• What are the common mistakes made during data mining projects, and how can they be avoided?

• How are data mining projects tracked and evaluated?

How This Guide Is Organized

The content of the guide is divided into two parts: Sections 1-8 and Sections 9-11.

The first eight sections constitute the bulk of the guide, and serve to ground the reader in the practice of data mining in the modern enterprise. These sections focus on the what, when, why, and how of data mining practice. Technical complexities are introduced only when they are essential to the treatment. This part of the guide should be read by everyone; later sections assume that the reader is familiar with the concepts and terms presented in these sections.

Section 1 (What is Data Mining and What Can it Do?) is a data mining manifesto: it describes the mindset that characterizes the successful data mining practitioner. It delves into some philosophical issues underlying the practice (e.g., Why is it essential that the data miner understand the difference between data and information?).

Section 2 (The Data Mining Process) provides a summary treatment of data mining as a six-step spiral process.

Sections 3-8 are devoted to each of the steps of the data mining process. Check lists, case studies, tables, and figures abound.

• Step 1-Problem Definition

• Step 2-Data Evaluation

• Step 3-Feature Extraction and Enhancement

• Step 4-Prototype Planning and Modeling

• Step 5-Model Evaluation

• Step 6-Implementation

The last three sections, 9-11, are devoted to specific categories of data mining practice, referred to here as genres. The data mining genres addressed are Section 9: Detecting and Characterizing Known Patterns (Supervised Learning), Section 10: Detecting, Characterizing, and Exploiting Hidden Patterns (Forensic Analysis), and Section 11: Knowledge: Its Acquisition, Representation, and Use.

It is hoped the reader will benefit from this rendition of the author's extensive experience in data mining/modeling, pattern processing, and automated decision support. He started this journey in 1979, and learned most of this material the hard way. By repeating his successes and avoiding his mistakes, you make his struggle worthwhile!

A Short History of Data Technology: Where Are We, and How Did We Get Here?

What follows is a brief account of the history of data technology along the classical lines. We posit the existence of brief eras of five or ten year's duration through which the technology passed during its development. This background will help the reader understand the forces that have driven the development of current data mining techniques. The dates provided are approximate.

Era 1: Computing-Only Phase (1945-1955):

As originally conceived, computers were just that: machines for performing computation. Volumes of data might be input, but the answer tended to consist of just a few numbers. Early computers had nothing that we would call online storage.

Reliable, inexpensive mass storage devices did not exist. Data was not stored in the computer at all: it was input, transformed, and output. Computing was done to obtain answers, not to manage data

Era 2: Offline Batch Storage (1955-1965):

Data was saved outside of the computer, on paper tape and cards, and read back in when needed. The use of online mass storage was not widespread, because it was expensive, slow, and unstable.

Era 3: Online Batch Storage (1965-1970):

With the invention of stable, cost-effective mass storage devices, everything changed.

Over time, the computer began to be viewed less as a machine for crunching numbers, and more as a device for storing them. Initially, the operating system's file management system was used to hold data in flat files: un-indexed lists or tables of data. As the need to search, sort, and process data grew, it became necessary to provide applications for organizing data into various types of business-specific hierarchies. These early databases organized data into tiered structures, allowing for rapid searching of records in the hierarchy.

Data was stored on high-density media such as magnetic tape, and magnetic drum.

Platter disc technology began to become more generally used, but was still slow and had low capacity.

Era 4: Online Databases (1970-1985):

Reliable, cost-effective online mass storage became widely available. Data was organized into domain specific vertical structures, typically for a single part of an organization.

This allowed the development of stovepipe systems for focused applications. The use of Online Transaction Processing (OLTP) systems became widespread, supporting inventory, purchasing, sales, planning, etc. The focus of computing began to shift from raw computation to data processing: the ingestion, transformation, storage, and retrieval of bulk data.

However, there was an obvious shortcoming. The databases of functional organizations within an enterprise were developed to suit the needs of particular business units. They were not interoperable, making the preparation of an enterprise-wide data view very difficult. The difficulty of horizontal integration caused many to question whether the development of enterprise-wide databases was feasible.

Era 5: Enterprise Databases (1985-1995):

As the utility of automatic data storage became clear, organizations within businesses began to construct their own hierarchical databases. Soon, the repositories of corporate information on all aspects of a business grew to be large.

Increased processing power, widespread availability of reliable communication net works, and development of database technology allowed the horizontal integration of multiple vertical data stores into an enterprise-wide database. For the first time, a global view of an entire organization's data repository was accessible through a single portal.

Era 6: Data Warehouses and Data Marts (since 1995):

This brings us to the present. Mass storage and raw compute power has reached the point today where virtually every data item generated by an enterprise can be saved.

And often, enterprise databases have become extremely large, architecturally complex, and volatile. Ultra-sophisticated data modeling tools have become available at the precise moment that competition for market share in many industries begins to peak. An appropriate environment for application of these tools to a cleansed, stable, off line repository was needed and data warehouses were born. And, as data warehouses have grown large, the need to create architecturally compatible functional subsets, or data marts, has been recognized.

The immediate future is moving everything toward cloud computing. This will include the elimination of many local storage disks as data is pushed to a vast array of external servers accessible over the internet. Data mining in the cloud will continue to grow in importance as network connectivity and data accessibility become virtually infinite.

Data Mining Information Sources

Some feeling for the current interest in data mining can be gained by reviewing the following list of data mining companies, groups, publications, and products.

Data Mining Publications

Two Crows Corporation

Predictive and descriptive data mining models, courses and presentations.

• "Information Management." A newsletter web site on data mining papers, books and product reviews.

• "Searching for the Right Data Modeling Tool" by Terry Moriarty

• "Data Mining FAQs" by Jesus Mena

• "Data Mining & Pattern Discovery," Elder Research, Inc.

• "An Evaluation of High-end Data Mining Tools for Fraud Detection" by Dean W. Abbot, I.P. Matkovsky, and John F. Elder

• KDnuggets.com is a web site providing companies with data mining related products.

Data Mining Technology/Product Providers

SPSS Web Site

SPSS Products

General Data Mining Tools

The data mining tools in the following list are used for general types of data:

Data-Miner Software Kit--A comprehensive collection of programs for efficiently mining big data. It uses the techniques presented in Predictive Data Mining: A Practical Guide by Morgan Kaufmann.

RuleQuest.com--System is rule based with subsystems to assist in data cleansing (GritBot) and constructing classifiers (See5) in the form of decision trees and rulesets.

SAS

Weka 3 from the University of Waikato-A collection of machine learning algorithms for solving real-world data mining problems.

Tools for the Development of Bayesian Belief Networks

Netica-BBN software that is easy to use, and implements BBN learning from data. It has a nice user interface.

Hugin-Implements reasoning with continuous variables and has a nice user interface.


Next

top of page | Article IndexHome