Image Primer: February 2015

February 27, 2015

CRISP-DM Approach for Data Mining

CRISP-DM approach is widely used in industry for Data Mining tasks. CRISP-DM stands for Cross Industry Standard Process for Data Mining.

Advantages of CRISP-DM approach:
1] It is neutral with respect to tools being used.
2] It provides a uniform framework for guidelines and experience documentation.
3] It is flexible to account for differences in business/agency problems as well as different types of data sets.
4] Standardizing the process makes it easy for new users of the methodology.

The overall chart for this process is shown in following figure. The sequence of execution of these steps may not be strict and moving back and forth between different phases is always required.

Figure 1: CRISP-DM Process Flow

There are six steps in this approach.

Business Understanding:

Understanding the project objectives and requirements

Defining the Data Mining problem

Designing a preliminary plan to achieve the objectives

Data Understanding:

Initial data collection and familiarization with data features

Identification of problems in data quality

Finding initial interesting insights of the data

Data Preparation:

Feature selection

Constructing the final data set from initial raw data

Transformation, formatting and cleaning the data

Modeling:

Selection and application of modeling techniques

Parameters calibration for the model

Assessing the model performance

Evaluation:

Thorough evaluation of model

Evaluation of all important business objectives & issues

Reviewing the process

Deployment:

Result model deployment

Generating report

Implementing repeatable data scoring process

Plan monitoring and maintenance

Sources:
[1] What is CRISP-DM methodology?
[2] CRISP-DM

February 01, 2015

Anomaly Detection in Data Mining

Finding anomalies (also known as outliers) is a very critical and important step in data mining. What is an anomaly? Anomalies are the set of data points that are considerably different than the remainder of the data.

Anomaly Detection is very useful in applications like credit card fraud transactions. In such data base, there are few data points (transactions) which need to be separated out.

In model based anomaly detection techniques, a model is built for the given data. For unsupervised models, anomalies will be these points which distort or don't fit well in the model. For supervised models, anomalies will be those data points which belong to some rare class.

Commonly used anomaly detection techniques are:

1] Proximity-based: Points far away from other data points

2] Density-based: Very low density

3] Pattern Matching: Finding atypical patterns

4] Probabilistic Approach: Points with a low probability with respect to a probability distribution

model of the data

5] Statistical-based Likelihood Approach

6] Distance-based Approach

7] Clustering-Based Approaches: A data point is an anomaly if it does not strongly belong to any cluster

References:

[1] Tan, P.-N., Steinbach, M., and Kumar, V. 2005. Introduction to Data Mining. Addison-Wesley.