Thursday, April 4, 2019

Identifying Clusters in High Dimensional Data

Identifying Clusters in soaring Dimensional t to only(prenominal) oneing pack those who remember, argon mindful if you do not sack out). (Holy Quran, 643)Removal Of b atomic number 18 Dimensions To Find Clusters In N-Dimensional entropy Using Subspace ClusteringAbstractThe entropy archeo lumberical site has emerged as a long powerful tool to bring up companionship from big entropybases. Researchers consume introduced some(prenominal) gondola info algorithms to look the selective informationbases to discover information, recondite figure of speechs, and rules from the information which were not known at the information recording time. Due to the odd developments in the storage capacities, bear upon and powerful algorithmic tools, practitivirtuosors argon developing late and improved algorithms and techniques in several atomic number 18as of information excavation to discover the rules and anyiance among the charges in simple and knotty ris queer ratioal entropybases. Further more(pre nominated) information archeological site has its implementation in large variety of atomic number 18as ranging from banking to selling, engine room to bioinformatics and from investment to luck digest and fraud espial. Practiti starrs be analyzing and implementing the techniques of artificial neural net working for mixed bag and regression problems because of accuracy, efficiency. The aim of his short look into work out is to develop a way of rangeing the clunks in advanced dimensional information as soundly as redundant dimensions which burn create a encumbrance in telling the clusters in blue dimensional selective information. Techniques utilise in this project utilizes the strength of the protrusions of the entropy shoot fors along the dimensions to identify the intensity of projection along each dimension in grade to find cluster and redundant dimension in high dimensional info.1 admittanceIn numerous s cientific cross outtings, engineering processes, and phone line occupations ranging from experimental sensing element info and process restraint info to telecommunication traffic observation and financial transaction monitoring, huge gists of high-dimensional measurement information argon produced and stored. W presentas sensor equipments as easily as big storage devices argon getting cheaper day by day, data analysis tools and techniques wrap behind. Clustering methods ar common solutions to un administrate breeding problems where neither each(prenominal) expert knowledge nor roundwhat helpful annotation for the data is avail adequate to(p). In general, clustering conclaves the data objects in a way that analogous objects get together in clusters whereas objects from several(predicate) clusters atomic number 18 of high dissimilarity. that it is observed that clustering disclose al nigh no structure tied(p) it is known there must be groups of similar obje cts. In m either a(prenominal) an(prenominal) roles, the reason is that the cluster structure is stimulated by whatever sub squ atomic number 18 offs of the spaces dimensions yet, and the many supplemental dimensions contribute nothing otherwise than making haphazardness in the data that kibosh the uncovering of the clusters at heart that data. As a solution to this problem, clustering algorithms argon utilise to the germane(predicate) subspaces only if. Immediately, the naked question is how to determine the relevant subspaces among the dimensions of the full space. Being faced with the power set of the set of dimensions a brute force trial of completely subsets is infeasible due to their exponential become number with respect to the accredited dimensionality.In high dimensional data, as dimensions atomic number 18 increasing, the visual percept and agency of the data becomes more exhausting and sometimes increase in the dimensions flush toilet create a bottle neck. more(prenominal) dimensions mean more visualization or representation problems in the data. As the dimensions atomic number 18 increased, the data indoors those dimensions seems dispersing towards the corners / dimensions. Subspace clustering solves this problem by identifying both problems in par whollyel. It solves the problem of relevant subspaces which plunder be attach as redundant in high dimensional data. It samewise solves the problem of purpose the cluster structures within that dataset which become app atomic number 18nt in these subspaces. Subspace clustering is an extension to the traditional clustering which mechanically finds the clusters present in the subspace of high dimensional data space that al poors better clustering the data points than the original space and it works even when the curse of dimensionality occurs. The intimately of the clustering algorithms oblige been designed to discover clusters in full dimensional space so they are not ef fective in identifying the clusters that exists within subspace of the original data space. The nigh of the clustering algorithms produces clustering results ground on the order in which the excitant records were processed 2.Subspace clustering shadower identify the varied cluster within subspaces which exists in the huge amount of sales data and by dint of with(predicate) it we advise find which of the una akin attributes are related. This rear end be utilitarian in promoting the sales and in supplying the inventory levels of diametric harvest-feasts. It dissolve be use for finding the subspace clusters in spatial databases and some useful determinations keep be taken establish on the subspace clusters identified 2. The technique utilize here for indentifying the redundant dimensions which are creating noise in the data in order to identifying the clusters consist of drawing or plotting the data points in all dimensions. At second step the projection of all data p oints along each dimension are plotted. At the third step the unions of projections along each dimension are plotted using all likely combinings among all no. of dimensions and ultimately the union of all projection along all dimensions and conk outd, it lead show the contribution of each dimension in indentifying the cluster which allow for be represented by the weight of projection. If any of the granted dimension is contributing real little(prenominal)(prenominal) in order to building the weight of projection, that dimension winsomele be considered as redundant, which center this dimension is not so fundamental to identify the clusters in given data. The details of this strategy willing be covered in later on chapters.2 information archeological site2.1 What is info Mining? information exploit is the process of analyzing data from divers(prenominal) perspective and summarizing it for getting useful information. The information tooshie be apply for many use ful purposes like increasing revenue, cuts costs etc. The data mine process also finds the underground knowledge and relationship within the data which was not known period data recording. Describing the data is the first step in data archeological site, followed by summarizing its attributes (like standard deviation mean etc). After that data is reviewed using visual tools like charts and graphical records and and so meaningful relations are determined. In the data dig process, the steps of collecting, exploring and selecting the right data are critically important. User clear analyze data from different dimensions categorize and summarize it. selective information exploit finds the correlation or moulds amongst the fields in large databases. entropy excavation has a great potential to help companies to focus on their important information in their data store. It gage point the future tr land ups and behaviors and allows the business to refer more proactive and knowl edge impelled decisions. It can exercise the business questions that were traditionally more than time consuming to resolve. It scours databases for hidden patterns for finding estimateive information that experts whitethorn miss it might lies beyond their expectations. info mine is ordinarily utilise to exchange the data into information or knowledge. It is commonly use in wide range of profiting practices much(prenominal)(prenominal) as marketing, fraud detection and scientific discovery. more companies al enouncey collect and refine their data. Data minelaying techniques can be utilize on existent platforms for enhance the value of information resourcefulnesss. Data exploit tools can analyze immense databases to deliver answers to the questions. around other terms contains similar meaning from data archeological site much(prenominal) as fellowship mining or intimacy Extraction or material body Analysis. Data mining can also be treated as a Knowledge Discovery from Data (KDD). Some people simply mean the data mining as an substantive step in Knowledge discovery from a large data. The process of knowledge discovery from data contains following steps.* Data cleanup position (removing the noise and inconsistent data)* Data Integration (combining multiple data sources)* Data selection (retrieving the data relevant to analysis task from database)* Data Transformation (transforming the data into appropriate forms for mining by performing stocky or aggregation operations)* Data mining (applying the intelligent methods in order to extract data patterns)* expression evaluation (identifying the truly reposeesting patterns representing knowledge base on some measures)* Knowledge representation (representing knowledge techniques that are used to present the mined knowledge to the exploiter)2.2 DataData can be any symbol of facts, or text, or image or number which can be processed by electronic computer. Todays organizations are accumulating large and growing amounts of data in different formats and in different databases. It can entangle operational or transactional data which complicates costs, sales, inventory, payroll and accounting. It can also involve nonoperational data such as sedulousness sales and forecast data. It can also include the meta data which is, data active the data itself, such as logical database design and data lexicon definitions.2.3 InformationThe information can be retrieved from the data via patterns, associations or relationship may exist in the data. For object less(prenominal)on the retail point of sale transaction data can be analyzed to render information just roughly the products which are being sold and when.2.4 KnowledgeKnowledge can be retrieved from information via historical patterns and the future make outs. For employment the analysis on retail supermarket sales data in promotional efforts point of view can win the knowledge buying behavior of guest. Hence items which are at more or less guess for promotional efforts can be determined by manufacturer easily.2.5 Data warehouseThe advancement in data capture, processing power, data transmission and storage technologies are enabling the indus try to integrate their dissimilar databases into data warehouse. The process of centralizing and retrieving the data is called data warehousing. Data warehousing is new term but belief is a bit old. Data warehouse is storage of massive amount of data in electronic form. Data warehousing is used to represent an ideal way of maintaining a central repository for all organizational data. Purpose of data warehouse is to maximize the substance abuser entrance and analysis. The data from different data sources are extracted, transformed and then loaded into data warehouse. Users / clients can contribute different suits of reports and can do business analysis by admittanceing the data warehouse.Data mining is primarily used today by companies with a str ong consumer focus retail, financial, communication, and marketing organizations. It allows these organizations to evaluate associations between true native external constituents. The product positioning, price or staff skills can be exercise of internal factors. The external factor examples can be economic indicators, customer demographics and competition. It also allows them to calculate the impact on sales, corporate profits and customer satisfaction. Furthermore it allows them to summarize the information to look detailed transactional data. Given databases of fitted size and quality, data mining technology can generate new business opport unities by its capabilities.Data mining usually automates the mathematical operation of searching hazardive information in huge databases. Questions that traditionally require extensive hands-on analysis can now be answered directly from the data very quickly. The targeted marketing can be an example of callive problem. Data mining utilizes data on previous promotional mailings in order to receipt the targets nigh probably to increase return on investment as maximum as possible in future mailings. Tools used in data mining traverses through huge databases and discover antecedently unseen patterns in single step. Analysis on retail sales data to recognize simply unrelated products which are usually purchased together can be an example of it. The more pattern discovery problems can include identifying fraudulent credit card transactions and identifying unorthodox data that could symbolize data entry excitant errors. When data mining tools are used on parallel processing systems of high performance, they are able to analyze huge databases in very less amount of time. Faster or quick processing means that users can automatically experience with more details to recognize the complex data. full(prenominal) speed and quick response makes it actually possible for users to rise huge amounts of data. Huge datab ases, in turn, give improved and better predictions.2.6 Descriptive and Predictive Data MiningDescriptive data mining aims to find patterns in the data that provide some information about what the data contains. It describes patterns in real data, and is generally used to create meaningful subgroups such as demographic clusters. For example descriptions are in the form of Summaries and visualization, Clustering and Link Analysis. Predictive Data Mining is used to forecast explicit values, ground on patterns determined from known results. For example, in the database having records of clients who put one over already answered to a specific offer, a ride can be make that predicts which prospects are most probable to answer to the same offer. It is usually applied to recognize data mining projects with the goal to identify a statistical or neural mesh model or set of models that can be used to predict some response of interest. For example, a credit card company may want to engag e in predictive data mining, to pull in a (trained) model or set of models that can quickly identify transactions which have a high probability of being fraudulent. Other types of data mining projects may be more explorative in nature (e.g. to determine the cluster or divisions of customers), in which case drill-down descriptive and tentative methods privation to be applied. Predictive data mining is goad oriented. It can be decomposed into following study tasks.* Data Preparation* Data Reduction* Data Modeling and Prediction* fibre and Solution Analysis2.7 textual matter MiningThe Text Mining is sometimes also called Text Data Mining which is more or less equal to Text Analytics. Text mining is the process of extracting/ derivation high quality information from the text. High quality information is typically derived from deriving the patterns and trends through means such as statistical pattern accomplishment. It usually involves the process of structuring the input text (us ually parsing, along with the addition of some derived linguistic features and the remotion of others, and subsequent insertion into a database), deriving patterns within the organise data, and finally evaluation and interpretation of the output. The High Quality in text mining usually refers to some combination of relevance, novelty, and interestingness. The text smorgasbord, concept/entity extraction, text clustering, sentiment analysis, production of rough taxonomies, entity relation modeling, document summarization can be included as text mining tasks.Text Mining is also known as the discovery by computer of new, previously unknown information, by automatically extracting information from different compose resources. Linking together of the extracted information is the aboriginal element to create new facts or new hypotheses to be examined foster by more schematic ways of experimentation. In text mining, the goal is to discover unknown information, something that no one y et knows and so could not have yet scripted down. The difference between normal data mining and text mining is that, in text mining the patterns are retrieved from natural lyric text instead of from structured databases of facts. Databases are designed and veritable for programs to execute automatically text is written for people to read. Most of the researchers judge that it will hold a full fledge simulation of how the enthusiasticshot works before that programs that read the way people do could be written.2.8 mesh Mining mesh Mining is the technique which is used to extract and discover the information from weathervane documents and service automatically. The interest of various research communities, tremendous yield of information resources on sack and recent interest in e-commerce has made this area of research very huge. sack up mining can be usually decomposed into subtasks.* Resource finding fetching intended blade documents.* Information selection and pre-proc essing selecting and preprocessing specific information from fetched weather vane resources automatically.* Generalization automatically discovers general patterns at soulfulness and a dawn multiple nettsite* Analysis validation and explanation of mined patterns. sack up Mining can be mainly categorized into three areas of interest base on which plowshare of Web ask to be mined Web Content Mining, Web building Mining and Web Usage Mining. Web contents Mining describes the discovery of useful information from the sack limits, data and documents 10. In past the meshwork consisted of only different types of services and data resources. But today most of the data is accessible over the profit even digital libraries are also available on Web. The vane contents consist of several types of data including text, image, audio, video, metadata as well as hyperlinks. Most of the companies are difficult to transform their business and services into electronic form and putting it o n Web. As a result, the databases of the companies which were previously residing on legacy systems are now entrywayible over the Web. Thus the employees, business partners and even end clients are able to access the companys databases over the Web. Users are accessing the applications over the clear via their clear interfaces due to which the most of the companies are trying to transform their business over the web, because internet is capable of making connection to any other computer anyplace in the world 11. Some of the web contents are hidden and hence cannot be indexed. The dynamically generated data from the results of queries residing in the database or private data can flux in this area. Unstructured data such as free text or semi structured data such as HTML and fully structured data such as data in the tables or database generated web pages can be considered in this form. However unstructured text is largely found in the web contents. The work on Web content mining is mostly do from 2 point of views, one is IR and other is DB point of view. From IR view, web content mining assists and improves the information finding or filtering to the user. From DB view web content mining models the data on the web and integrates them so that the more sophisticated queries other than keywords could be performed. 10.In Web Structure Mining, we are more bear on with the structure of hyperlinks within the web itself which can be called as inter document structure 10. It is virtually related to the web usage mining 14. Pattern detection and graphs mining are essentially related to the web structure mining. Link analysis technique can be used to determine the patterns in the graph. The search engines like Google usually uses the web structure mining. For example, the links are mined and one can then determine the web pages that point to a particular web page. When a string is searched, a webpage having most number of links pointed to it may become first in the list. Thats why web pages are listed based on rank which is calculated by the rank of web pages pointed to it 14. Based on web morphologic data, web structure mining can be change integrity into devil categories. The first mixture of web structure mining interacts with extracting patterns from the hyperlinks in the web. A hyperlink is a structural component that links or connects the web page to a different web page or different location. The other kind of the web structure mining interacts with the document structure, which is using the point-like structure to analyze and describe the HTML or XML tags within the web pages.With continuous growth of e-commerce, web services and web applications, the volume of clickstream and user data collected by web based organizations in their daily operations has increased. The organizations can analyze such data to determine the life time value of clients, design cross marketing strategies etc. 13. The Web usage mining interacts with data generated by users clickstream. The web usage data includes web server access logs, proxy server logs, browser logs, user profile, registration data, user sessions, transactions, cookies, user queries, bookmark data, mouse clicks and scrolls and any other data as a result of interaction 10. So the web usage mining is the most important task of the web mining 12. Weblog databases can provide rich information about the web dynamics. In web usage mining, web log records are mined to discover the user access patterns through which the potential customers can be identified, quality of internet services can be enhanced and web server performance can be improved. Many techniques can be developed for implementation of web usage mining but it is important to know that success of such applications depends upon what and how much valid and reliable knowledge can be discovered the log data. Most often, the web logs are cleaned, condensed and transformed before extraction of any useful and signif icant information from weblog. Web mining can be performed on web log records to find associations patterns, sequential patterns and trend of web accessing. The overall Web usage mining process can be divided into three inter-dependent stages data collection and pre-processing, pattern discovery, and pattern analysis 13. In the data collection preprocessing stage, the raw data is collected, cleaned and transformed into a set of user transactions which represents the activities of each user during visits to the web site. In the pattern discovery stage, statistical, database, and machine study operations are performed to retrieve hidden patterns representing the typical behavior of users, as well as sum-up of statistics on Web resources, sessions, and users.3 mixture 3.1 What is potpourri?As the beat and the variety increases in the available data, it needs some robust, efficient and versatile data categorization technique for exploration 16. categorisation is a method of cate gorizing rank labels to patterns. It is actually a data mining methodology used to predict group membership for data instances. For example, one may want to use potpourri to guess whether the weather on a specific day would be sunny, cloudy or rainy. The data mining techniques which are used to differentiate similar kind of data objects / points from other are called clustering. It actually uses attribute values found in the data of one relegate to spot it from other types or classes. The data classification majorly concerns with the treatment of the large datasets. In classification we build a model by analyzing the existing data, describing the characteristics of various classes of data. We can use this model to predict the class/type of new data. Classification is a superintend machine study procedure in which individual items are placed in a group based on quantitative information on one or more characteristics in the items. Decision Trees and Bayesian Networks are the ex amples of classification methods. One type of classification is Clustering. This is process of finding the similar data objects / points within the given dataset. This similarity can be in the meaning of distance measures or on any other parameter, depending upon the need and the given data.Classification is an ancient term as well as a modern one since classification of animals, plants and other fleshly objects is still valid today. Classification is a way of thinking about things rather than a ingest of things itself so it draws its surmise and application from complete range of human experiences and thoughts 18. From a bigger picture, classification can include medical patients based on disease, a set of images containing red rose from an image database, a set of documents describing classification from a document/text database, equipment malfunction based on cause and loanword applicants based on their likelihood of payment etc. For example in later case, the problem is to pr edict a new applicants loans eligibility given old data about customers. There are many techniques which are used for data categorization / classification. The most common are Decision channelise classifier and Bayesian classifiers.3.2 Types of ClassificationThere are two types of classification. One is oversee classification and other is unsupervised classification. Supervised learning is a machine learning technique for discovering a function from training data. The training data contains the pairs of input objects, and their desired outputs. The output of the function can be a continuous value which can be called regression, or can predict a class label of the input object which can be called as classification. The task of the supervised learner is to predict the value of the function for any valid input object after having seen a number of training examples (i.e. pairs of input and target output). To achieve this goal, the learner needs to simplify from the presented data to h idden situations in a meaningful way.The unsupervised learning is a class of problems in machine learning in which it is needed to seek to determine how the data are organized. It is baronial from supervised learning in that the learner is given only unknown examples. Unsupervised learning is near related to the problem of engrossment estimation in statistics. However unsupervised learning also covers many other techniques that are used to summarize and explain key features of the data. One form of unsupervised learning is clustering which will be covered in next chapter. Blind source partition based on Independent fraction Analysis is some other example. Neural network models, adaptive resonance theory and the self organizing maps are most commonly used unsupervised learning algorithms. There are many techniques for the implementation of supervised classification. We will be discussing two of them which are most commonly used which are Decision Trees classifiers and Nave Bayes ian Classifiers.3.2.1 Decision Trees ClassifierThere are many alternatives to represent classifiers. The decision shoetree is probably the most astray used approach for this purpose. It is one of the most widely used supervised learning methods used for data exploration. It is easy to use and can be represented in if-then-else statements/rules and can work well in noisy data as well 16. Tree like graph or decisions models and their possible consequences including resource costs, chance event, outcomes, and utilities are used in decision trees. Decision trees are most commonly used in specifically in decision analysis, operations research, to help in identifying a strategy most probably to reach a target. In machine learning and data mining, a decision trees are used as predictive model means a planning from observations calculations about an item to the conclusions about its target value. much descriptive names for such tree models are classification tree or regression tree. In these tree structures, leaves are representing classifications and get-goes are representing conjunctions of features those lead to classifications. The machine learning technique for inducement a decision tree from data is called decision tree learning, or decision trees. Decision trees are simple but powerful form of multiple variable analyses 15. Classification is done by tree like structures that have different test criteria for a variable at each of the knobs. New leaves are generated based on the results of the tests at the thickeners. Decision Tree is a supervised learning system in which classification rules are constructed from the decision tree. Decision trees are produced by algorithms which identify various ways carve upting data set into branch like segment. Decision tree try to find out a strong relationship between input and target values within the dataset 15.In tasks classification, decision trees usually visualize that what steps should be taken to reach on classification. all(prenominal) decision tree starts with a parent node called result node which is considered to be the parent of any other node. Each node in the tree calculates an attribute in the data and decides which path it should follow. typically the decision test is equivalence of a value once against some constant. Classification with the help of decision tree is done by traversing from the root node up to a leaf node. Decision trees are able to represent and classify the diverse types of data. The simplest form of data is numerical data which is most familiar too. Organizing nominal data is also take many times in many situations. Nominal quantities are normally represented via discrete set of symbols. For example weather condition can be described in either nominal direction or numeric. Quantification can be done about temperature by apothegm that it is eleven degrees Celsius or l two degrees Fahrenheit. The cool, mild, cold, warm or hot terminologies can also be sued. The former is a type of numeric data while and the latter is an example of nominal data. More precisely, the example of cool, mild, cold, warm and hot is a special type of nominal data, expressed as no. data. Ordinal data usually has an implicit assumption of ordered relationships among the values. In the weather example, purely nominal description like rainy, obscure and sunny can also be added. These values have no relationships or distance measures among each other.Decision Trees are those types of trees where each node is a question, each branch is an answer to a question, and each leaf is a result. Here is an example of Decision tree.Roughly, the idea is based upon the number of stock items we have to make different decisions. If we dont have much, you buy at any cost. If you have a lot of items then you only buy if it is in costly. Now if stock items are less than 10 then buy all if unit price is less than 10 otherwise buy only 10 items. Now if we have 10 to 40 items in the stock then check unit price. If unit price is less than 5 then buy only 5 items otherwise no need to buy anything big-ticket(prenominal) since stock is good already. Now if we have more than 40 items in the stock, then buy 5 if and only if price is less than 2 otherwise no need to buy too expensive items. So in this way decision trees help us to make a decision at each level. Here is another example of decision tree, representing the risk factor associated with the rash driving.The root node at the top of the tree structure is showing the feature that is split first for highest discrimination. The internal nodes are showing decision rules on one or more attributes while leaf nodes are class labels. A person having age less than 20 has very high risk while a person having age greater than 30 has a very low risk. A middle category a person having age greater than 20 but less than 30 depend upon another attribute which is car type. If car type is of sports then there is again high risk pertain while if family car is used then there is low risk involved.In the field of sciences engineering and in the applied areas including business intelligence and data mining, many useful features are being introduced as the result of evolution of decision trees.* With the help of transformation in decision trees, the volume of data can be reduced into more compact form that preserves the major characteristicIdentifying Clusters in High Dimensional DataIdentifying Clusters in High Dimensional DataAsk those who remember, are mindful if you do not know). (Holy Quran, 643)Removal Of Redundant Dimensions To Find Clusters In N-Dimensional Data Using Subspace ClusteringAbstractThe data mining has emerged as a powerful tool to extract knowledge from huge databases. Researchers have introduced several machine learning algorithms to explore the databases to discover information, hidden patterns, and rules from the data which were not known at the data recording time. Due to th e remarkable developments in the storage capacities, processing and powerful algorithmic tools, practitioners are developing new and improved algorithms and techniques in several areas of data mining to discover the rules and relationship among the attributes in simple and complex higher dimensional databases. Furthermore data mining has its implementation in large variety of areas ranging from banking to marketing, engineering to bioinformatics and from investment to risk analysis and fraud detection. Practitioners are analyzing and implementing the techniques of artificial neural networks for classification and regression problems because of accuracy, efficiency. The aim of his short research project is to develop a way of identifying the clusters in high dimensional data as well as redundant dimensions which can create a noise in identifying the clusters in high dimensional data. Techniques used in this project utilizes the strength of the projections of the data points along the dimensions to identify the intensity of projection along each dimension in order to find cluster and redundant dimension in high dimensional data.1 IntroductionIn numerous scientific settings, engineering processes, and business applications ranging from experimental sensor data and process control data to telecommunication traffic observation and financial transaction monitoring, huge amounts of high-dimensional measurement data are produced and stored. Whereas sensor equipments as well as big storage devices are getting cheaper day by day, data analysis tools and techniques wrap behind. Clustering methods are common solutions to unsupervised learning problems where neither any expert knowledge nor some helpful annotation for the data is available. In general, clustering groups the data objects in a way that similar objects get together in clusters whereas objects from different clusters are of high dissimilarity. However it is observed that clustering disclose almost no structure even it is known there must be groups of similar objects. In many cases, the reason is that the cluster structure is stimulated by some subsets of the spaces dimensions only, and the many additional dimensions contribute nothing other than making noise in the data that hinder the discovery of the clusters within that data. As a solution to this problem, clustering algorithms are applied to the relevant subspaces only. Immediately, the new question is how to determine the relevant subspaces among the dimensions of the full space. Being faced with the power set of the set of dimensions a brute force trial of all subsets is infeasible due to their exponential number with respect to the original dimensionality.In high dimensional data, as dimensions are increasing, the visualization and representation of the data becomes more difficult and sometimes increase in the dimensions can create a bottleneck. More dimensions mean more visualization or representation problems in the data. As the dimensions are increased, the data within those dimensions seems dispersing towards the corners / dimensions. Subspace clustering solves this problem by identifying both problems in parallel. It solves the problem of relevant subspaces which can be marked as redundant in high dimensional data. It also solves the problem of finding the cluster structures within that dataset which become apparent in these subspaces. Subspace clustering is an extension to the traditional clustering which automatically finds the clusters present in the subspace of high dimensional data space that allows better clustering the data points than the original space and it works even when the curse of dimensionality occurs. The most of the clustering algorithms have been designed to discover clusters in full dimensional space so they are not effective in identifying the clusters that exists within subspace of the original data space. The most of the clustering algorithms produces clustering results based on the order in which the input records were processed 2.Subspace clustering can identify the different cluster within subspaces which exists in the huge amount of sales data and through it we can find which of the different attributes are related. This can be useful in promoting the sales and in planning the inventory levels of different products. It can be used for finding the subspace clusters in spatial databases and some useful decisions can be taken based on the subspace clusters identified 2. The technique used here for indentifying the redundant dimensions which are creating noise in the data in order to identifying the clusters consist of drawing or plotting the data points in all dimensions. At second step the projection of all data points along each dimension are plotted. At the third step the unions of projections along each dimension are plotted using all possible combinations among all no. of dimensions and finally the union of all projection along all dimensions and anal yzed, it will show the contribution of each dimension in indentifying the cluster which will be represented by the weight of projection. If any of the given dimension is contributing very less in order to building the weight of projection, that dimension can be considered as redundant, which means this dimension is not so important to identify the clusters in given data. The details of this strategy will be covered in later chapters.2 Data Mining2.1 What is Data Mining?Data mining is the process of analyzing data from different perspective and summarizing it for getting useful information. The information can be used for many useful purposes like increasing revenue, cuts costs etc. The data mining process also finds the hidden knowledge and relationship within the data which was not known while data recording. Describing the data is the first step in data mining, followed by summarizing its attributes (like standard deviation mean etc). After that data is reviewed using visual too ls like charts and graphs and then meaningful relations are determined. In the data mining process, the steps of collecting, exploring and selecting the right data are critically important. User can analyze data from different dimensions categorize and summarize it. Data mining finds the correlation or patterns amongst the fields in large databases.Data mining has a great potential to help companies to focus on their important information in their data warehouse. It can predict the future trends and behaviors and allows the business to make more proactive and knowledge driven decisions. It can answer the business questions that were traditionally much time consuming to resolve. It scours databases for hidden patterns for finding predictive information that experts may miss it might lies beyond their expectations. Data mining is normally used to transform the data into information or knowledge. It is commonly used in wide range of profiting practices such as marketing, fraud detectio n and scientific discovery. Many companies already collect and refine their data. Data mining techniques can be implemented on existing platforms for enhance the value of information resources. Data mining tools can analyze massive databases to deliver answers to the questions.Some other terms contains similar meaning from data mining such as Knowledge mining or Knowledge Extraction or Pattern Analysis. Data mining can also be treated as a Knowledge Discovery from Data (KDD). Some people simply mean the data mining as an essential step in Knowledge discovery from a large data. The process of knowledge discovery from data contains following steps.* Data cleaning (removing the noise and inconsistent data)* Data Integration (combining multiple data sources)* Data selection (retrieving the data relevant to analysis task from database)* Data Transformation (transforming the data into appropriate forms for mining by performing summary or aggregation operations)* Data mining (applying the intelligent methods in order to extract data patterns)* Pattern evaluation (identifying the truly interesting patterns representing knowledge based on some measures)* Knowledge representation (representing knowledge techniques that are used to present the mined knowledge to the user)2.2 DataData can be any type of facts, or text, or image or number which can be processed by computer. Todays organizations are accumulating large and growing amounts of data in different formats and in different databases. It can include operational or transactional data which includes costs, sales, inventory, payroll and accounting. It can also include nonoperational data such as industry sales and forecast data. It can also include the meta data which is, data about the data itself, such as logical database design and data dictionary definitions.2.3 InformationThe information can be retrieved from the data via patterns, associations or relationship may exist in the data. For example the retail point of sale transaction data can be analyzed to yield information about the products which are being sold and when.2.4 KnowledgeKnowledge can be retrieved from information via historical patterns and the future trends. For example the analysis on retail supermarket sales data in promotional efforts point of view can provide the knowledge buying behavior of customer. Hence items which are at most risk for promotional efforts can be determined by manufacturer easily.2.5 Data warehouseThe advancement in data capture, processing power, data transmission and storage technologies are enabling the industry to integrate their various databases into data warehouse. The process of centralizing and retrieving the data is called data warehousing. Data warehousing is new term but concept is a bit old. Data warehouse is storage of massive amount of data in electronic form. Data warehousing is used to represent an ideal way of maintaining a central repository for all organizational data. Purpose of data warehouse is to maximize the user access and analysis. The data from different data sources are extracted, transformed and then loaded into data warehouse. Users / clients can generate different types of reports and can do business analysis by accessing the data warehouse.Data mining is primarily used today by companies with a strong consumer focus retail, financial, communication, and marketing organizations. It allows these organizations to evaluate associations between certain internal external factors. The product positioning, price or staff skills can be example of internal factors. The external factor examples can be economic indicators, customer demographics and competition. It also allows them to calculate the impact on sales, corporate profits and customer satisfaction. Furthermore it allows them to summarize the information to look detailed transactional data. Given databases of sufficient size and quality, data mining technology can generate new business opportuni ties by its capabilities.Data mining usually automates the procedure of searching predictive information in huge databases. Questions that traditionally required extensive hands-on analysis can now be answered directly from the data very quickly. The targeted marketing can be an example of predictive problem. Data mining utilizes data on previous promotional mailings in order to recognize the targets most probably to increase return on investment as maximum as possible in future mailings. Tools used in data mining traverses through huge databases and discover previously unseen patterns in single step. Analysis on retail sales data to recognize apparently unrelated products which are usually purchased together can be an example of it. The more pattern discovery problems can include identifying fraudulent credit card transactions and identifying irregular data that could symbolize data entry input errors. When data mining tools are used on parallel processing systems of high performan ce, they are able to analyze huge databases in very less amount of time. Faster or quick processing means that users can automatically experience with more details to recognize the complex data. High speed and quick response makes it actually possible for users to examine huge amounts of data. Huge databases, in turn, give improved and better predictions.2.6 Descriptive and Predictive Data MiningDescriptive data mining aims to find patterns in the data that provide some information about what the data contains. It describes patterns in existing data, and is generally used to create meaningful subgroups such as demographic clusters. For example descriptions are in the form of Summaries and visualization, Clustering and Link Analysis. Predictive Data Mining is used to forecast explicit values, based on patterns determined from known results. For example, in the database having records of clients who have already answered to a specific offer, a model can be made that predicts which pr ospects are most probable to answer to the same offer. It is usually applied to recognize data mining projects with the goal to identify a statistical or neural network model or set of models that can be used to predict some response of interest. For example, a credit card company may want to engage in predictive data mining, to derive a (trained) model or set of models that can quickly identify transactions which have a high probability of being fraudulent. Other types of data mining projects may be more exploratory in nature (e.g. to determine the cluster or divisions of customers), in which case drill-down descriptive and tentative methods need to be applied. Predictive data mining is goad oriented. It can be decomposed into following major tasks.* Data Preparation* Data Reduction* Data Modeling and Prediction* Case and Solution Analysis2.7 Text MiningThe Text Mining is sometimes also called Text Data Mining which is more or less equal to Text Analytics. Text mining is the proce ss of extracting/deriving high quality information from the text. High quality information is typically derived from deriving the patterns and trends through means such as statistical pattern learning. It usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. The High Quality in text mining usually refers to some combination of relevance, novelty, and interestingness. The text categorization, concept/entity extraction, text clustering, sentiment analysis, production of rough taxonomies, entity relation modeling, document summarization can be included as text mining tasks.Text Mining is also known as the discovery by computer of new, previously unknown information, by automatically extracting information from different written resour ces. Linking together of the extracted information is the key element to create new facts or new hypotheses to be examined further by more conventional ways of experimentation. In text mining, the goal is to discover unknown information, something that no one yet knows and so could not have yet written down. The difference between ordinary data mining and text mining is that, in text mining the patterns are retrieved from natural language text instead of from structured databases of facts. Databases are designed and developed for programs to execute automatically text is written for people to read. Most of the researchers think that it will need a full fledge simulation of how the brain works before that programs that read the way people do could be written.2.8 Web MiningWeb Mining is the technique which is used to extract and discover the information from web documents and services automatically. The interest of various research communities, tremendous growth of information resour ces on Web and recent interest in e-commerce has made this area of research very huge. Web mining can be usually decomposed into subtasks.* Resource finding fetching intended web documents.* Information selection and pre-processing selecting and preprocessing specific information from fetched web resources automatically.* Generalization automatically discovers general patterns at individual and across multiple website* Analysis validation and explanation of mined patterns.Web Mining can be mainly categorized into three areas of interest based on which part of Web needs to be mined Web Content Mining, Web Structure Mining and Web Usage Mining. Web Contents Mining describes the discovery of useful information from the web contents, data and documents 10. In past the internet consisted of only different types of services and data resources. But today most of the data is available over the internet even digital libraries are also available on Web. The web contents consist of several typ es of data including text, image, audio, video, metadata as well as hyperlinks. Most of the companies are trying to transform their business and services into electronic form and putting it on Web. As a result, the databases of the companies which were previously residing on legacy systems are now accessible over the Web. Thus the employees, business partners and even end clients are able to access the companys databases over the Web. Users are accessing the applications over the web via their web interfaces due to which the most of the companies are trying to transform their business over the web, because internet is capable of making connection to any other computer anywhere in the world 11. Some of the web contents are hidden and hence cannot be indexed. The dynamically generated data from the results of queries residing in the database or private data can fall in this area. Unstructured data such as free text or semi structured data such as HTML and fully structured data such as data in the tables or database generated web pages can be considered in this category. However unstructured text is mostly found in the web contents. The work on Web content mining is mostly done from 2 point of views, one is IR and other is DB point of view. From IR view, web content mining assists and improves the information finding or filtering to the user. From DB view web content mining models the data on the web and integrates them so that the more sophisticated queries other than keywords could be performed. 10.In Web Structure Mining, we are more concerned with the structure of hyperlinks within the web itself which can be called as inter document structure 10. It is closely related to the web usage mining 14. Pattern detection and graphs mining are essentially related to the web structure mining. Link analysis technique can be used to determine the patterns in the graph. The search engines like Google usually uses the web structure mining. For example, the links are mined and one can then determine the web pages that point to a particular web page. When a string is searched, a webpage having most number of links pointed to it may become first in the list. Thats why web pages are listed based on rank which is calculated by the rank of web pages pointed to it 14. Based on web structural data, web structure mining can be divided into two categories. The first kind of web structure mining interacts with extracting patterns from the hyperlinks in the web. A hyperlink is a structural component that links or connects the web page to a different web page or different location. The other kind of the web structure mining interacts with the document structure, which is using the tree-like structure to analyze and describe the HTML or XML tags within the web pages.With continuous growth of e-commerce, web services and web applications, the volume of clickstream and user data collected by web based organizations in their daily operations has increased. The organ izations can analyze such data to determine the life time value of clients, design cross marketing strategies etc. 13. The Web usage mining interacts with data generated by users clickstream. The web usage data includes web server access logs, proxy server logs, browser logs, user profile, registration data, user sessions, transactions, cookies, user queries, bookmark data, mouse clicks and scrolls and any other data as a result of interaction 10. So the web usage mining is the most important task of the web mining 12. Weblog databases can provide rich information about the web dynamics. In web usage mining, web log records are mined to discover the user access patterns through which the potential customers can be identified, quality of internet services can be enhanced and web server performance can be improved. Many techniques can be developed for implementation of web usage mining but it is important to know that success of such applications depends upon what and how much valid a nd reliable knowledge can be discovered the log data. Most often, the web logs are cleaned, condensed and transformed before extraction of any useful and significant information from weblog. Web mining can be performed on web log records to find associations patterns, sequential patterns and trend of web accessing. The overall Web usage mining process can be divided into three inter-dependent stages data collection and pre-processing, pattern discovery, and pattern analysis 13. In the data collection preprocessing stage, the raw data is collected, cleaned and transformed into a set of user transactions which represents the activities of each user during visits to the web site. In the pattern discovery stage, statistical, database, and machine learning operations are performed to retrieve hidden patterns representing the typical behavior of users, as well as summary of statistics on Web resources, sessions, and users.3 Classification 3.1 What is Classification?As the quantity and t he variety increases in the available data, it needs some robust, efficient and versatile data categorization technique for exploration 16. Classification is a method of categorizing class labels to patterns. It is actually a data mining methodology used to predict group membership for data instances. For example, one may want to use classification to guess whether the weather on a specific day would be sunny, cloudy or rainy. The data mining techniques which are used to differentiate similar kind of data objects / points from other are called clustering. It actually uses attribute values found in the data of one class to distinguish it from other types or classes. The data classification majorly concerns with the treatment of the large datasets. In classification we build a model by analyzing the existing data, describing the characteristics of various classes of data. We can use this model to predict the class/type of new data. Classification is a supervised machine learning proce dure in which individual items are placed in a group based on quantitative information on one or more characteristics in the items. Decision Trees and Bayesian Networks are the examples of classification methods. One type of classification is Clustering. This is process of finding the similar data objects / points within the given dataset. This similarity can be in the meaning of distance measures or on any other parameter, depending upon the need and the given data.Classification is an ancient term as well as a modern one since classification of animals, plants and other physical objects is still valid today. Classification is a way of thinking about things rather than a study of things itself so it draws its theory and application from complete range of human experiences and thoughts 18. From a bigger picture, classification can include medical patients based on disease, a set of images containing red rose from an image database, a set of documents describing classification from a document/text database, equipment malfunction based on cause and loan applicants based on their likelihood of payment etc. For example in later case, the problem is to predict a new applicants loans eligibility given old data about customers. There are many techniques which are used for data categorization / classification. The most common are Decision tree classifier and Bayesian classifiers.3.2 Types of ClassificationThere are two types of classification. One is supervised classification and other is unsupervised classification. Supervised learning is a machine learning technique for discovering a function from training data. The training data contains the pairs of input objects, and their desired outputs. The output of the function can be a continuous value which can be called regression, or can predict a class label of the input object which can be called as classification. The task of the supervised learner is to predict the value of the function for any valid input object af ter having seen a number of training examples (i.e. pairs of input and target output). To achieve this goal, the learner needs to simplify from the presented data to hidden situations in a meaningful way.The unsupervised learning is a class of problems in machine learning in which it is needed to seek to determine how the data are organized. It is distinguished from supervised learning in that the learner is given only unknown examples. Unsupervised learning is nearly related to the problem of density estimation in statistics. However unsupervised learning also covers many other techniques that are used to summarize and explain key features of the data. One form of unsupervised learning is clustering which will be covered in next chapter. Blind source partition based on Independent Component Analysis is another example. Neural network models, adaptive resonance theory and the self organizing maps are most commonly used unsupervised learning algorithms. There are many techniques for the implementation of supervised classification. We will be discussing two of them which are most commonly used which are Decision Trees classifiers and Nave Bayesian Classifiers.3.2.1 Decision Trees ClassifierThere are many alternatives to represent classifiers. The decision tree is probably the most widely used approach for this purpose. It is one of the most widely used supervised learning methods used for data exploration. It is easy to use and can be represented in if-then-else statements/rules and can work well in noisy data as well 16. Tree like graph or decisions models and their possible consequences including resource costs, chance event, outcomes, and utilities are used in decision trees. Decision trees are most commonly used in specifically in decision analysis, operations research, to help in identifying a strategy most probably to reach a target. In machine learning and data mining, a decision trees are used as predictive model means a planning from observations calc ulations about an item to the conclusions about its target value. More descriptive names for such tree models are classification tree or regression tree. In these tree structures, leaves are representing classifications and branches are representing conjunctions of features those lead to classifications. The machine learning technique for inducing a decision tree from data is called decision tree learning, or decision trees. Decision trees are simple but powerful form of multiple variable analyses 15. Classification is done by tree like structures that have different test criteria for a variable at each of the nodes. New leaves are generated based on the results of the tests at the nodes. Decision Tree is a supervised learning system in which classification rules are constructed from the decision tree. Decision trees are produced by algorithms which identify various ways rending data set into branch like segment. Decision tree try to find out a strong relationship between input and target values within the dataset 15.In tasks classification, decision trees normally visualize that what steps should be taken to reach on classification. Every decision tree starts with a parent node called root node which is considered to be the parent of every other node. Each node in the tree calculates an attribute in the data and decides which path it should follow. Typically the decision test is comparison of a value against some constant. Classification with the help of decision tree is done by traversing from the root node up to a leaf node. Decision trees are able to represent and classify the diverse types of data. The simplest form of data is numerical data which is most familiar too. Organizing nominal data is also required many times in many situations. Nominal quantities are normally represented via discrete set of symbols. For example weather condition can be described in either nominal fashion or numeric. Quantification can be done about temperature by saying that it is eleven degrees Celsius or fifty two degrees Fahrenheit. The cool, mild, cold, warm or hot terminologies can also be sued. The former is a type of numeric data while and the latter is an example of nominal data. More precisely, the example of cool, mild, cold, warm and hot is a special type of nominal data, expressed as ordinal data. Ordinal data usually has an implicit assumption of ordered relationships among the values. In the weather example, purely nominal description like rainy, overcast and sunny can also be added. These values have no relationships or distance measures among each other.Decision Trees are those types of trees where each node is a question, each branch is an answer to a question, and each leaf is a result. Here is an example of Decision tree.Roughly, the idea is based upon the number of stock items we have to make different decisions. If we dont have much, you buy at any cost. If you have a lot of items then you only buy if it is inexpensive. Now if stock items are less than 10 then buy all if unit price is less than 10 otherwise buy only 10 items. Now if we have 10 to 40 items in the stock then check unit price. If unit price is less than 5 then buy only 5 items otherwise no need to buy anything expensive since stock is good already. Now if we have more than 40 items in the stock, then buy 5 if and only if price is less than 2 otherwise no need to buy too expensive items. So in this way decision trees help us to make a decision at each level. Here is another example of decision tree, representing the risk factor associated with the rash driving.The root node at the top of the tree structure is showing the feature that is split first for highest discrimination. The internal nodes are showing decision rules on one or more attributes while leaf nodes are class labels. A person having age less than 20 has very high risk while a person having age greater than 30 has a very low risk. A middle category a person having age greater than 20 but less than 30 depend upon another attribute which is car type. If car type is of sports then there is again high risk involved while if family car is used then there is low risk involved.In the field of sciences engineering and in the applied areas including business intelligence and data mining, many useful features are being introduced as the result of evolution of decision trees.* With the help of transformation in decision trees, the volume of data can be reduced into more compact form that preserves the major characteristic

No comments:

Post a Comment