{"id":1676,"date":"2022-09-06T09:53:37","date_gmt":"2022-09-06T09:53:37","guid":{"rendered":"https:\/\/blog.amt.in\/?p=1676"},"modified":"2022-09-06T09:53:37","modified_gmt":"2022-09-06T09:53:37","slug":"introduction-to-data-mining","status":"publish","type":"post","link":"https:\/\/blog.amt.in\/index.php\/2022\/09\/06\/introduction-to-data-mining\/","title":{"rendered":"Introduction to Data Mining"},"content":{"rendered":"<p>Data mining\u00c2\u00a0is the process of discovering patterns in large\u00c2\u00a0data sets\u00c2\u00a0involving methods at the intersection of\u00c2\u00a0machine learning,\u00c2\u00a0statistics, and\u00c2\u00a0database systems.\u00c2\u00a0Data mining is an\u00c2\u00a0interdisciplinary\u00c2\u00a0sub-field of\u00c2\u00a0computer science\u00c2\u00a0and\u00c2\u00a0statistics\u00c2\u00a0with an overall goal to extract information (with intelligent methods) from a data set and transform the information into a comprehensible structure for further use.\u00c2\u00a0Data mining is the analysis step of the &#8220;knowledge discovery in databases&#8221; process or KDD.\u00c2\u00a0Aside from the raw analysis step, it also involves database and\u00c2\u00a0data management\u00c2\u00a0aspects,\u00c2\u00a0data pre-processing,\u00c2\u00a0model\u00c2\u00a0and\u00c2\u00a0inference\u00c2\u00a0considerations, interesting metrics,\u00c2\u00a0complexity\u00c2\u00a0considerations, post-processing of discovered structures,\u00c2\u00a0visualization, and\u00c2\u00a0online updating.<\/p>\n<p>The term &#8220;data mining&#8221; is a\u00c2\u00a0misnomer, because the goal is the extraction of patterns and knowledge from large amounts of data, not the extraction (mining) of data itself.\u00c2\u00a0It also is a\u00c2\u00a0buzzword\u00c2\u00a0and is frequently applied to any form of large-scale data or\u00c2\u00a0information processing\u00c2\u00a0(collection,\u00c2\u00a0extraction,\u00c2\u00a0warehousing, analysis, and statistics) as well as any application of\u00c2\u00a0computer decision support system, including\u00c2\u00a0artificial intelligence\u00c2\u00a0(e.g., machine learning) and\u00c2\u00a0business intelligence. The book\u00c2\u00a0Data mining: Practical machine learning tools and techniques with Java\u00c2\u00a0(which covers mostly machine learning material) was originally to be named just\u00c2\u00a0Practical machine learning, and the term\u00c2\u00a0data mining\u00c2\u00a0was only added for marketing reasons.\u00c2\u00a0Often the more general terms (large scale)\u00c2\u00a0data analysis\u00c2\u00a0and\u00c2\u00a0analytics\u00c2\u00a0\u00e2\u20ac\u201c or, when referring to actual methods,\u00c2\u00a0artificial intelligence\u00c2\u00a0and\u00c2\u00a0machine learning\u00c2\u00a0\u00e2\u20ac\u201c are more appropriate.<\/p>\n<p>The actual data mining task is the semi-automatic or automatic analysis of large quantities of data to extract previously unknown, interesting patterns such as groups of data records (cluster analysis), unusual records (anomaly detection), and dependencies (association rule mining,\u00c2\u00a0sequential pattern mining). This usually involves using database techniques such as\u00c2\u00a0spatial indices. These patterns can then be seen as a kind of summary of the input data, and may be used in further analysis or, for example, in machine learning and\u00c2\u00a0predictive analytics. For example, the data mining step might identify multiple groups in the data, which can then be used to obtain more accurate prediction results by a\u00c2\u00a0decision support system. Neither the data collection, data preparation, nor result interpretation and reporting is part of the data mining step, but do belong to the overall KDD process as additional steps.<\/p>\n<p>The difference between\u00c2\u00a0data analysis\u00c2\u00a0and data mining is that data analysis is used to test models and hypotheses on the dataset, e.g., analyzing the effectiveness of a marketing campaign, regardless of the amount of data; in contrast, data mining uses machine learning and statistical models to uncover clandestine or hidden patterns in a large volume of data.<\/p>\n<p>The related terms\u00c2\u00a0data dredging,\u00c2\u00a0data fishing, and\u00c2\u00a0data snooping\u00c2\u00a0refer to the use of data mining methods to sample parts of a larger population data set that are (or may be) too small for reliable statistical inferences to be made about the validity of any patterns discovered. These methods can, however, be used in creating new hypotheses to test against the larger data populations.<\/p>\n<p>The\u00c2\u00a0knowledge discovery in databases (KDD) process\u00c2\u00a0is commonly defined with the stages:<\/p>\n<ol>\n<li>Selection<\/li>\n<li>Pre-processing<\/li>\n<li>Transformation<\/li>\n<li>Data mining<\/li>\n<li>Interpretation\/evaluation.<\/li>\n<\/ol>\n<p>It exists, however, in many variations on this theme, such as the\u00c2\u00a0Cross-industry standard process for data mining\u00c2\u00a0(CRISP-DM) which defines six phases:<\/p>\n<ol>\n<li>Business understanding<\/li>\n<li>Data understanding<\/li>\n<li>Data preparation<\/li>\n<li>Modeling<\/li>\n<li>Evaluation<\/li>\n<li>Deployment<\/li>\n<\/ol>\n<p>or a simplified process such as (1) Pre-processing, (2) Data Mining, and (3) Results Validation.<\/p>\n<p>Polls conducted in 2002, 2004, 2007 and 2014 show that the CRISP-DM methodology is the leading methodology used by data miners.\u00c2\u00a0The only other data mining standard named in these polls was\u00c2\u00a0SEMMA. However, 3\u00e2\u20ac\u201c4 times as many people reported using CRISP-DM. Several teams of researchers have published reviews of data mining process models,\u00c2\u00a0and Azevedo and Santos conducted a comparison of CRISP-DM and SEMMA in 2008.<\/p>\n<p><span id=\"Pre-processing\" class=\"mw-headline\">Pre-processing:<\/span><\/p>\n<p>Before data mining algorithms can be used, a target data set must be assembled. As data mining can only uncover patterns actually present in the data, the target data set must be large enough to contain these patterns while remaining concise enough to be mined within an acceptable time limit. A common source for data is a\u00c2\u00a0data mart\u00c2\u00a0or\u00c2\u00a0data warehouse. Pre-processing is essential to analyze the\u00c2\u00a0multivariate\u00c2\u00a0data sets before data mining. The target set is then cleaned. Data cleaning removes the observations containing\u00c2\u00a0noise\u00c2\u00a0and those with\u00c2\u00a0missing data.<\/p>\n<p><span id=\"Data_mining\" class=\"mw-headline\">Data mining:<\/span><\/p>\n<p>Data mining involves six common classes of tasks:<\/p>\n<ul>\n<li>Anomaly detection\u00c2\u00a0(outlier\/change\/deviation detection) \u00e2\u20ac\u201c The identification of unusual data records, that might be interesting or data errors that require further investigation.<\/li>\n<li>Association rule learning\u00c2\u00a0(dependency modeling) \u00e2\u20ac\u201c Searches for relationships between variables. For example, a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis.<\/li>\n<li>Clustering\u00c2\u00a0\u00e2\u20ac\u201c is the task of discovering groups and structures in the data that are in some way or another &#8220;similar&#8221;, without using known structures in the data.<\/li>\n<li>Classification\u00c2\u00a0\u00e2\u20ac\u201c is the task of generalizing known structure to apply to new data. For example, an e-mail program might attempt to classify an e-mail as &#8220;legitimate&#8221; or as &#8220;spam&#8221;.<\/li>\n<li>Regression\u00c2\u00a0\u00e2\u20ac\u201c attempts to find a function that models the data with the least error that is, for estimating the relationships among data or datasets.<\/li>\n<li>Summarization\u00c2\u00a0\u00e2\u20ac\u201c providing a more compact representation of the data set, including visualization and report generation.<\/li>\n<\/ul>\n<p><span id=\"Results_validation\" class=\"mw-headline\">Results validation:<\/span><\/p>\n<p>Data mining can unintentionally be misused, and can then produce results that appear to be significant; but which do not actually predict future behavior and cannot be\u00c2\u00a0reproduced\u00c2\u00a0on a new sample of data and bear little use. Often this results from investigating too many hypotheses and not performing proper\u00c2\u00a0statistical hypothesis testing. A simple version of this problem in\u00c2\u00a0machine learning\u00c2\u00a0is known as\u00c2\u00a0overfitting, but the same problem can arise at different phases of the process and thus a train\/test split\u00e2\u20ac\u201dwhen applicable at all\u00e2\u20ac\u201dmay not be sufficient to prevent this from happening.<\/p>\n<p>The final step of knowledge discovery from data is to verify that the patterns produced by the data mining algorithms occur in the wider data set. Not all patterns found by data mining algorithms are necessarily valid. It is common for data mining algorithms to find patterns in the training set which are not present in the general data set. This is called\u00c2\u00a0overfitting. To overcome this, the evaluation uses a\u00c2\u00a0test set\u00c2\u00a0of data on which the data mining algorithm was not trained. The learned patterns are applied to this test set, and the resulting output is compared to the desired output. For example, a data mining algorithm trying to distinguish &#8220;spam&#8221; from &#8220;legitimate&#8221; emails would be trained on a\u00c2\u00a0training set\u00c2\u00a0of sample e-mails. Once trained, the learned patterns would be applied to the test set of e-mails on which it had\u00c2\u00a0<i>not<\/i>\u00c2\u00a0been trained. The accuracy of the patterns can then be measured from how many e-mails they correctly classify. Several statistical methods may be used to evaluate the algorithm, such as\u00c2\u00a0ROC curves.<\/p>\n<p>The above is a brief about Data Mining. Watch this space for more updates on the latest trends in Technology.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Data mining\u00c2\u00a0is the process of<\/p>\n","protected":false},"author":1,"featured_media":1678,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[555,916,556],"tags":[558,917,559],"class_list":["post-1676","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-mining","category-data-vaildation","category-pre-processing","tag-data-mining","tag-data-validation","tag-pre-processing"],"_links":{"self":[{"href":"https:\/\/blog.amt.in\/index.php\/wp-json\/wp\/v2\/posts\/1676","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.amt.in\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.amt.in\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.amt.in\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.amt.in\/index.php\/wp-json\/wp\/v2\/comments?post=1676"}],"version-history":[{"count":1,"href":"https:\/\/blog.amt.in\/index.php\/wp-json\/wp\/v2\/posts\/1676\/revisions"}],"predecessor-version":[{"id":1677,"href":"https:\/\/blog.amt.in\/index.php\/wp-json\/wp\/v2\/posts\/1676\/revisions\/1677"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.amt.in\/index.php\/wp-json\/wp\/v2\/media\/1678"}],"wp:attachment":[{"href":"https:\/\/blog.amt.in\/index.php\/wp-json\/wp\/v2\/media?parent=1676"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.amt.in\/index.php\/wp-json\/wp\/v2\/categories?post=1676"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.amt.in\/index.php\/wp-json\/wp\/v2\/tags?post=1676"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}