February 27, 2015

CRISP-DM Approach for Data Mining

CRISP-DM approach is widely used in industry for Data Mining tasks. CRISP-DM stands for Cross Industry Standard Process for Data Mining. 

Advantages of CRISP-DM approach:
1] It is neutral with respect to tools being used.
2] It provides a uniform framework for guidelines and experience documentation.
3] It is flexible to account for differences in business/agency problems as well as different types of data sets.
4] Standardizing the process makes it easy for new users of the methodology.


The overall chart for this process is shown in following figure. The sequence of execution of these steps may not be strict and moving back and forth between different phases is always required. 

Figure 1: CRISP-DM Process Flow



There are six steps in this approach.

Business Understanding: 
  • Understanding the project objectives and requirements
  • Defining the Data Mining problem
  • Designing a preliminary plan to achieve the objectives
Data Understanding: 
  • Initial data collection and familiarization with data features
  • Identification of problems in data quality 
  • Finding initial interesting insights of the data  
Data Preparation: 
  • Feature selection
  • Constructing the final data set from initial raw data 
  • Transformation, formatting and cleaning the data
Modeling: 
  • Selection and application of modeling techniques 
  • Parameters calibration for the model
  • Assessing the model performance
Evaluation: 
  • Thorough evaluation of model
  • Evaluation of all important business objectives & issues
  • Reviewing the process
Deployment:
  • Result model deployment
  • Generating report
  • Implementing repeatable data scoring process 
  • Plan monitoring and maintenance

Sources:
[1] What is CRISP-DM methodology? 
[2] CRISP-DM

February 01, 2015

Anomaly Detection in Data Mining

Finding anomalies (also known as outliers) is a very critical and important step in data mining. What is an anomaly? Anomalies are the set of data points that are considerably different than the remainder of the data.

Anomaly Detection is very useful in applications like credit card fraud transactions. In such data base, there are few data points (transactions) which need to be separated out. 

In model based anomaly detection techniques, a model is built for the given data. For unsupervised models, anomalies will be these points which distort or don't fit well in the model. For supervised models, anomalies will be those data points which belong to some rare class. 

Commonly used anomaly detection techniques are:
1] Proximity-based: Points far away from other data points
2] Density-based: Very low density
3] Pattern Matching: Finding atypical patterns
4] Probabilistic Approach: Points with a low probability with respect to a probability distribution
model of the data
5] Statistical-based Likelihood Approach

6] Distance-based Approach
7] Clustering-Based Approaches: A data point is an anomaly if it does not strongly belong to any cluster

References:
[1] Tan, P.-N., Steinbach, M., and Kumar, V. 2005. Introduction to Data Mining. Addison-Wesley.