Introduction to Data Mining – Advanced Millennium Technologies

Data miningÂ is the process of discovering patterns in largeÂ data setsÂ involving methods at the intersection ofÂ machine learning,Â statistics, andÂ database systems.Â Data mining is anÂ interdisciplinaryÂ sub-field ofÂ computer scienceÂ andÂ statisticsÂ with an overall goal to extract information (with intelligent methods) from a data set and transform the information into a comprehensible structure for further use.Â Data mining is the analysis step of the “knowledge discovery in databases” process or KDD.Â Aside from the raw analysis step, it also involves database andÂ data managementÂ aspects,Â data pre-processing,Â modelÂ andÂ inferenceÂ considerations, interesting metrics,Â complexityÂ considerations, post-processing of discovered structures,Â visualization, andÂ online updating.

The term “data mining” is aÂ misnomer, because the goal is the extraction of patterns and knowledge from large amounts of data, not the extraction (mining) of data itself.Â It also is aÂ buzzwordÂ and is frequently applied to any form of large-scale data orÂ information processingÂ (collection,Â extraction,Â warehousing, analysis, and statistics) as well as any application ofÂ computer decision support system, includingÂ artificial intelligenceÂ (e.g., machine learning) andÂ business intelligence. The bookÂ Data mining: Practical machine learning tools and techniques with JavaÂ (which covers mostly machine learning material) was originally to be named justÂ Practical machine learning, and the termÂ data miningÂ was only added for marketing reasons.Â Often the more general terms (large scale)Â data analysisÂ andÂ analyticsÂ â€“ or, when referring to actual methods,Â artificial intelligenceÂ andÂ machine learningÂ â€“ are more appropriate.

The actual data mining task is the semi-automatic or automatic analysis of large quantities of data to extract previously unknown, interesting patterns such as groups of data records (cluster analysis), unusual records (anomaly detection), and dependencies (association rule mining,Â sequential pattern mining). This usually involves using database techniques such asÂ spatial indices. These patterns can then be seen as a kind of summary of the input data, and may be used in further analysis or, for example, in machine learning andÂ predictive analytics. For example, the data mining step might identify multiple groups in the data, which can then be used to obtain more accurate prediction results by aÂ decision support system. Neither the data collection, data preparation, nor result interpretation and reporting is part of the data mining step, but do belong to the overall KDD process as additional steps.

The difference betweenÂ data analysisÂ and data mining is that data analysis is used to test models and hypotheses on the dataset, e.g., analyzing the effectiveness of a marketing campaign, regardless of the amount of data; in contrast, data mining uses machine learning and statistical models to uncover clandestine or hidden patterns in a large volume of data.

The related termsÂ data dredging,Â data fishing, andÂ data snoopingÂ refer to the use of data mining methods to sample parts of a larger population data set that are (or may be) too small for reliable statistical inferences to be made about the validity of any patterns discovered. These methods can, however, be used in creating new hypotheses to test against the larger data populations.

TheÂ knowledge discovery in databases (KDD) processÂ is commonly defined with the stages:

Selection
Pre-processing
Transformation
Data mining
Interpretation/evaluation.

It exists, however, in many variations on this theme, such as theÂ Cross-industry standard process for data miningÂ (CRISP-DM) which defines six phases:

Business understanding
Data understanding
Data preparation
Modeling
Evaluation
Deployment

or a simplified process such as (1) Pre-processing, (2) Data Mining, and (3) Results Validation.

Polls conducted in 2002, 2004, 2007 and 2014 show that the CRISP-DM methodology is the leading methodology used by data miners.Â The only other data mining standard named in these polls wasÂ SEMMA. However, 3â€“4 times as many people reported using CRISP-DM. Several teams of researchers have published reviews of data mining process models,Â and Azevedo and Santos conducted a comparison of CRISP-DM and SEMMA in 2008.

Pre-processing:

Before data mining algorithms can be used, a target data set must be assembled. As data mining can only uncover patterns actually present in the data, the target data set must be large enough to contain these patterns while remaining concise enough to be mined within an acceptable time limit. A common source for data is aÂ data martÂ orÂ data warehouse. Pre-processing is essential to analyze theÂ multivariateÂ data sets before data mining. The target set is then cleaned. Data cleaning removes the observations containingÂ noiseÂ and those withÂ missing data.

Data mining:

Data mining involves six common classes of tasks:

Anomaly detectionÂ (outlier/change/deviation detection) â€“ The identification of unusual data records, that might be interesting or data errors that require further investigation.
Association rule learningÂ (dependency modeling) â€“ Searches for relationships between variables. For example, a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis.
ClusteringÂ â€“ is the task of discovering groups and structures in the data that are in some way or another “similar”, without using known structures in the data.
ClassificationÂ â€“ is the task of generalizing known structure to apply to new data. For example, an e-mail program might attempt to classify an e-mail as “legitimate” or as “spam”.
RegressionÂ â€“ attempts to find a function that models the data with the least error that is, for estimating the relationships among data or datasets.
SummarizationÂ â€“ providing a more compact representation of the data set, including visualization and report generation.

Results validation:

Data mining can unintentionally be misused, and can then produce results that appear to be significant; but which do not actually predict future behavior and cannot beÂ reproducedÂ on a new sample of data and bear little use. Often this results from investigating too many hypotheses and not performing properÂ statistical hypothesis testing. A simple version of this problem inÂ machine learningÂ is known asÂ overfitting, but the same problem can arise at different phases of the process and thus a train/test splitâ€”when applicable at allâ€”may not be sufficient to prevent this from happening.

The final step of knowledge discovery from data is to verify that the patterns produced by the data mining algorithms occur in the wider data set. Not all patterns found by data mining algorithms are necessarily valid. It is common for data mining algorithms to find patterns in the training set which are not present in the general data set. This is calledÂ overfitting. To overcome this, the evaluation uses aÂ test setÂ of data on which the data mining algorithm was not trained. The learned patterns are applied to this test set, and the resulting output is compared to the desired output. For example, a data mining algorithm trying to distinguish “spam” from “legitimate” emails would be trained on aÂ training setÂ of sample e-mails. Once trained, the learned patterns would be applied to the test set of e-mails on which it hadÂ notÂ been trained. The accuracy of the patterns can then be measured from how many e-mails they correctly classify. Several statistical methods may be used to evaluate the algorithm, such asÂ ROC curves.

The above is a brief about Data Mining. Watch this space for more updates on the latest trends in Technology.

Leave a Reply Cancel reply