Review of the Popular Association-Mining Algorithms - Apriori Case Study Example | Topics and Well Written Essays

DATA MINING By Review of the popular association-mining algorithms Introduction Data mining is the science used to extract important information from databases or data sets (Gaber 2010). The term data mining came into use in the 90s despite the concept behind it existing many decades earlier. It usually covers different areas such as data management, machine learning, statistics, artificial intelligence, pattern recognition among other areas (Gupta 2011). A review of four different types of association rule mining algorithms will be presented in this paper which will help determine the algorithm which is most fitted to mine association rules over distributed databases. The four algorithm techniques which will be under analysis include: AprioriTid, Apriori hybrid, Apriori, count distribution as well as Fast Distributed mining algorithm. Association rule mining is a highly important technique used in data mining. It mainly extract frequent patterns, interesting correlations and association in a given items set provided in a transaction database. It is a well researched technique that aid in the discovery and derivation of interesting relationships between variables present in databases (Benoit 2002). Interestingness is used in different measures in the identification of strong rules in large databases. The main problem encountered in the development of association rules is identifying the one with confidence and support that is more than the available user-specified minimum confidence and minimum support (Arockiam etal. 2012). Support (which is abbreviated as S) of an association rule refers to the percentage of records containing X∪Y over the total number of available records in a given database. For example, a 0.1% support of an item means only 1 out of 10 of the transactions in the database represents the purchase of that item (Arockiam etal. 2012). Support (XY)=support count of (XY)/Total no. of transaction in D Confidence on the other (abbreviated as C) of an association rule refers to the percentage of the number of transactions containing X∪Y over the total number of transaction containing X. For example, a confidence of 80% signifies the presence of Y in 80% of X transactions (Arockiam etal. 2012). Confidence (X|Y) = Support (XY) / Support (X) Example in a basket containing 3 items out of a sample of 5 items BASKET (ABC) BASKET(ACD) BASKET (BCD) BASKET (ADE) BASKET (BCE) rule support confidence A B 2/5 2/3 C A 2/5 2/4 A C 2/5 2/3 B&C D 1/5 1/3 Most algorithms perform well when data is in a centralised environment. In case of distributed database, mining useful pattern is more complicated. To make it more feasible, the data sets from the distributed databases have to be merged which would prove very costly. Therefore, for datasets distributed geographically, an algorithm that requires minimal communication costs is most favourable. Most existing distributed and parallel association rule mining algorithms employs a kernel that uses the Apriori algorithm. Apriori algorithm The name Apriori is derived from a Latin word whose meaning is ‘from what comes before’. This algorithm that locates the frequency of set L in D database was developed by Srikant and Agrawal in the year 1994. Apriori principle: The frequent appearance of an item set signifies that also their subsets appear frequently and vice versa. This holds true due to the anti-monotone support measure property thus the support posted by the subset never exceed that of the item set at any given time. Anti monotone property also applies to confidence of rules obtained from similar item set. It performs poorly in large dataset and requires a lot of memory during candidate set generation which translate to more cost. How the algorithm works It makes use of the property of downward closure and searches the breadth first. Basically, the algorithm adopts a search from the bottom of the lattice and moves upward from one level to the other. Essential steps 1. candidate generation – the underlying logic of Apriori candidate generation is based on the premise that if an item X obtain the minimum support, the same will happen to all of its subsets. 2. Pruning – Involves the removal of the item set that does not occur frequently so that they can be emitted from counting support. The candidates discarded are those that fail to garner the minimum support (Wets 2003). Procedures for the discovery of Large Item-sets Multiple passes are carried out on the data First pass to count individual items support Subsequent pass to generate candidates as well as verify their actual support. Termination after exhausting the available large item-sets Process of locating the large k-item-set The large k-1 item-sets are combined to create the candidate The item set with small subset are then deleted Merits It performs well in the case of small dataset. It is less complex since treats items equally based on their presence or absence. Demerits Its time consuming and occupy more memory and space during candidate generation Multiple scan have to be run to acquire the required candidate set It is inefficient in the case of large dataset. Possess the frequent pattern problem due to the use of uniform minimum support threshold. Aprioritid algorithm How the algorithm works It is similar to Apriori algorithm in that it uses steps to generate candidate item-set. It only read the database once which is different from Apriori. Therefore, after the first pass a different set C is generated and used instead of the database in counting of candidate item-sets support (Wets 2003). Each member of the set C contain the TID of every transaction hence the name Aprioritid. TID represent a unique identifier tied to each transaction. I represent the set of items D represents number of transaction in the database T represents the transactions Merits It performs very well when set C fit in available memory Demerits Its performance is largely affected by large problems when set C size fails to fit in the available memory. The candidate item sets generated after the initial pass are sometimes as large as the database which translates to more cost and time consumption. Apriori Hybrid algorithm In the initial passes, Apriori has a better performance compared to AprioriTid though that efficiency drops considerably in the passes follows. It is for this reason that a more powerful algorithm, the Apriori Hybrid is used which is a combination of both Apriori and AprioriTid algorithm. The Apriori version is preferred in the initial passes while the AprioriTid is favoured in the other passes (Wets 2003). However, this switch between the two consumes a considerable amount of time but it guarantees more efficiency in terms of cost and performance. Merits Performs better than individual Apriori and Aprioritid It is more cost efficient Demerits The switch from Apriori algorithm to AprioriTid leads to the development of an extra cost. Unfortunately, if an Apriori algorithm is adapted directly to distributed databases, the performance over generation of frequent item sets is not significantly enhanced. Some of the challenges that are faced include: synchronisation, work-load balancing, communication minimisation, disk I/O minimisation and data decomposition. To obtain an efficient data decomposition technique the algorithm must ensures there is suitable load balancing and minimal communication. Since data sets in distributed database are not in one single location, the algorithm has to initiate external communication which account for lower or higher cost in the association rule mining (Cheung 1996). Mining association rules from distributed sites is more cost effective than having them in a centralised location. In addition, distributed database has an element of data skewness which should be solved by the algorithm of choice. Two of the adopted algorithm that tries to solve the above mentioned problem are discussed below; Count distribution algorithm This algorithm makes use of the sequential Apriori algorithm with the general assumptions that there is horizontal portioning of data sets in the available distributed sites. It generates the local candidate set at each site which is followed by determination of the counts of generated candidate sets at each site (Zaki & Pin 2002). Merits It uses a simple communication scheme in the count exchange. Demerits It is riddled with the problem of generation of high amount of candidate sets and high communication overhead. It uses the system memory inefficiently hence more costly to run Fast Distributed Mining Algorithm In every site, FDM determine the counts for local support and prune those that occur infrequently. The instead of all the local counts being broadcasted as is the case of count distribution, they are sent to a polling site (Zaki & Pin 2002). Merits In comparison with count distribution algorithm, FDM has reduced communication overheads and also produce less candidate sets. Demerits Its optimisation mechanism leads to the creation of more computational cost. There is increase in message size if numerous sites are to receive the request from each polling site. The proposed algorithm is EMatrix (Efficiency Matrix) which can manage to offer maximum efficiency. The generation of frequent item sets is as shown in figure A below. Steps for association rule generation using the frequent item set 1. Select the frequent item sets 2. Determine the non-empty subset 3. Calculate the confidence 4. Lastly determine the rules for decision making Y N Y N Figure A This proposed algorithm assumes that the data sets are partitioned horizontally in the available site and employs sequential Apriori algorithm. To develop the EMatrix, a transaction database (in binary matrix) is created first. Then, by applying Apriori-Gen function, the candidates are generated by a sequence of iteration. In each iteration only the frequent sets are identified. This helps in reducing the complexity that emerges during the generation of association rules. All the partitions have to be scanned to convert them to EMatrix. Advantages It saves memory since it uses the developed local EMatrix instead of scanning the database repeatedly. It reduces communication and overheads The messages transmitted over the network are reduced Conclusion From the above analysis, it is clear that the proposed algorithm should manage to reduce communication overheads as well as reduce the candidate set generated while mining association rules over distributed databases. This would help reduce the computational cost associated with associated rule mining and improve the overall, efficiency. Examples of use of the above algorithms in different scenarios This Fast Distributed Mining algorithm can suit the following scenario:- suppose there are three branches, each branch extracts the data of item (such as milk) or whatever and send it to headquarters. It can be suited in this scenario better than count distribution algorithm since it has reduced communication overheads and also produce less candidate sets The Count distribution algorithm can suit the following scenario:- suppose there are three branches, each branch extract data and send only the counts of item (such as milk) and send it to headquarters. This is due to the fact that it uses a simple communication scheme in the count exchange. The proposed algorithm EMatrix can suit the following scenario:- suppose there are three branches, and there are no headquarters. Each branch extracts data and sends only the counts of item (such as milk) to each other. It is better suited than both count and fast distributed mining algorithm since it reduces the number of messages that need to be transmitted over the network reducing the overall computational cost. Review of popular classification-mining algorithms J48 (C4.5) It is similar to ID3 in the way it uses the concept of Information Entropy to derive decision trees based on a training data set. The data set is classified in a way that it creates a representation of features or attributes in the sample. It employ two heuristic criteria in ranking possible test which are information gain (reduces subset total entropy) and default gain ratio (analyse information in relation to information gain). Then to reduce over fitting, pruning is usually done on the initial tree (Manolopoulos 2006). The case which does not appear frequently is removed from the observed events via that process. Application of decision tree in this method provides an efficient mean of displaying large amount of data accurately. Naive Bayes/simple Bayes/independence Bayes The classifier operates on the assumption that the presence of a particular feature or its complete absence is not related to the availability of other features. Thus it does not use network to form classification since every attribute is conditionally independent. The computational cost is therefore greatly reduced since only class distribution is counted. Furthermore, its construction and interpretation is relatively easy hence readily applicable in case of huge data sets (Kantardzic 2003). It stands out amongst the oldest algorithms used for formal classification and despite its simplicity it has proved to be highly effective. Some of the areas this method is applied include spam filtering and text classification. k-nearest neighbourhood This algorithm works efficiently using noisy data since it basically find the averages of the k nearest neighbours. It is basically composed of three elements which include; a set of labelled object, the distance between them and constant k. In determining the distance, it is of high important to choose the one which will have smaller distances signifying objects emanating from the same class (Wang 2006). One can either use the cosine measure (if it is a document classification) or the Euclidean distance (in case where the attributes are many).With reference to the query point, it assigns greater weight to those neighbours which are closer. However, this method has a limitation in that the smaller the value of k, the more sensitive will be the method to noise (Hellerstein & Stonebraker 2005). On the other hand, if the value of k is set too high, the specificity of neighbourhood will be compromised due to the inclusion of irrelevant points. KNN is thus more suited in situation where there are muilti-modal classes and also where an object possesses a large number of class label. Sometimes it is usually ranked higher than SVM in terms of efficiency. Neural network It is an important classification tool since it provides a viable alternative to the current conventional methods used in classification. The main strength of this algorithm is that it is data driven and possesses self-adaptive methods which are responsible for their adjustment to different situations without a major influence on the underlying model (Gurney 1997). Support vector machine It is a new method of classification that supports both non linear and linear data. It remains ones of the most accurate and stable method among the most popular algorithm. Its theoretical foundation is highly proven and also user friendly. In addition, it is not affected by the amount of dimension present in the data analysis. Nonlinear mapping is used in the transformation of the available data to higher dimension which is crucial in the location of hyper plane that separate data from different classes (Christianini & Shawe 2000). In so doing, it comes up with the best function for classifying training data comprising of two classes. Due to the presence of many hyper planes between two classes, it provides a solution to this problem by margin maximisation so that those that qualifies as the SVM solution remains few in number. Even though it strives to find the best classification, it is dynamic in the sense that the future data is not tied down to this outcome. From the above analysis, the algorithm that can mine classification over large databases with great efficiency and scalability is naive Bayes. This is due to its simplicity and efficiency in handling huge data sets that are mostly common while working on large databases. References List Arockiam, L., Baskar, S. S., & Jeyasimman, L. November 28, 2012, Importance of association rules in data mining: A review. International Journal of Soft Computing, 7, 3, 135-140. Benoît, G. January 01, 2002, Data mining. Annual Review of Information Science and Technology, 36, 265-310. Cheung, D.W. et al. 1996,“Efficient Mining of Association Rules in Distributed Databases”, IEEE Trans. Knowledge and Data Eng., Vol. 8, No. 6, pp. 911-922. Cristianini, N., & Shawe-Taylor, J. 2000, An introduction to support vector machines: And other kernel-based learning methods. Cambridge: Cambridge University Press. Gaber, M. M. 2010, Scientific data mining and knowledge discovery: Principles and foundations. Heidelberg: Springer. Gupta, G. K. 2011, Introduction to data mining with case studies. S.l.: Prentice-Hall Of India Pv. Gurney, K. 1997, An introduction to neural networks. London: UCL Press. Hellerstein, J. M., & Stonebraker, M. (2005). Readings in database systems. Cambridge, Mass: MIT Press. Kantardzic, M. 2003, Data mining: Concepts, models, methods, and algorithms. Hoboken, NJ: Wiley-Interscience. Manolopoulos, Y. 2006, R-trees: Theory and applications. London: Springer. Wang, J. 2006, Encyclopedia of data warehousing and mining. Hershey, PA: Idea Group Reference. Wets, G. January 01, 2003, Identifying decision structures underlying activity patterns: An exploration of data mining algorithms. Transportation Research Record, 1718, 1-9. Zaki, M.J. & Y. Pin 2002,“Introduction: Recent Developments in Parallel and Distributed Data Mining”, J. Distributed and Parallel Databases, Vol. 11, No. 2, pp. 123-127. Read More

Review of the Popular Association-Mining Algorithms - Apriori - Case Study Example

Extract of sample "Review of the Popular Association-Mining Algorithms - Apriori"

CHECK THESE SAMPLES OF Review of the Popular Association-Mining Algorithms - Apriori

Carvers Popular Mechanics

The Caning of Charles Sumner

The Horse on the dinning room table

Requirements and specifications