Association Rule Mining - Apriori Algorithm Case Study

Association Rule Mining Association Rule Mining is the most important and highly researched technique associated with Data Mining. The objective is to extract relevant correlations and recurring patterns amongst the group of items in the transaction databases or repositories of data. The rules have vast application in a number of areas such as inventory management, risk protection and telecommunication networks [5]. As the name implies, Association rule mining seeks to find out association rules that satisfy the already defined minimum level of support and confidence extracted from a predefined database. Overall, one problem is usually divided into two sub problems. The first problem deals with finding sets of those items which have occurrences exceeding a predefined threshold level in the database. In other words finding frequent or large item sets [5]. The second problem is to create association rules from those large sets of items keeping the confidence level to be minimal. The remaining rules are generated by removing previous items and adding them to the consequent. Besides this, the confidence level of new rules is browsed to find out the “interestingness” amongst them. This process occurs in iteration till all the previous items have been browsed. The second sub problem seems fairly simple compared with the first one. As a result, the researchers have focused on the first sub problem [5]. The first sub-problem can be further broken into two sub problems: generating candidate sets of large items and generating sets of frequent items. Those sets of items which have the level of support exceeding a certain threshold are known as large or frequent sets of items. Those sets of items that are likely to become large or frequent in future are known as candidate sets of items [5]. The algorithms may generate extremely large number of association rules, very often in thousands or even in millions. The association rules themselves are fairly large. As a result it becomes almost impossible for users to understand and validate the complex association rules. This limits the usefulness of the association mining rules [2]. One of the most well known algorithm for generating association mining rules is the Apriori algorithm. This paper discusses the algorithm in detail and further considers the issues involved in it. It then identifies some of the ways in which this can be improved. At the next stage it explains how support, confidence and lift are used in determining the usefulness of a rule . Apriori Algorithm In the field of Computer Science, particularly in Data Mining, Apriori is a well known algorithm for learning rules of association in the databases. It is intended to operate on databases which contain transactions. Examples of such type of databases would be collection of items bought or details of which websites were visited frequently. Other algorithms are also designed for finding association rules which have no transaction data [1]. In order for an Apriori algorithm to work for Association Rule Mining, sets of set of items (for example sets of retail transaction where each set contains the list of items purchased by an individual) must be known. The algorithm then attempts to find subsets of which are common in at least a minimum number C of the sets of items. Apriori uses bottom up approach and extends the frequent subsets one item at a time. This step is known as candidate generation. All candidates are tested against the data. The algorithm terminates when no further successful extensions are found [1]. Apriori makes use of breadth-first search method and involves a tree structure to count candidate item sets efficiently. It ‘generates candidate item sets of length k from item sets of length k-1’. Then it prunes all the candidates which have a non frequent sub pattern. The candidate set contains all frequent item sets each of length k. It then scans the transaction database to determine the frequent item sets among the candidates [3]. However, Apriori while it has been used since a long time, it has a number of efficiency concerns. As a result other algorithms have spawned off. Example of Apriori Algorithm In a large superstore, sales are tracked by their item numbers and it is useful to find out what items are typically purchased together. This helps in planning future inventory levels and generating effective sales promotion offers. Apriori is an effective way to build a list of item pairs which are frequent. Consider a database with transaction records which has the following sets: {A,B,C,D},{B,C,D},{B,C},{A,B,D},{A,B,C,D},{B,D} Each alphabet represents an actual item. At the firs step, Apriori counts up the frequencies, called supports, of all the items in the dataset. The resulting table looks like this: Item Support A 3 B 6 C 4 D 5 We can define a minimum support level to qualify as "frequent," which depends on the context. For this case, let min support = 3. Therefore, all are frequent. The next step is to generate a list of all 2-pairs of the frequent items. Had any of the above items not been frequent, they wouldnt have been included as a possible member of possible 2-item pairs. In this way, Apriori prunes the tree of all possible sets. In the next step, we define a minimum support level to qualify as frequent. This depends on the context. In this example we shall define the minimum support level to be 3. So all items so far are frequent. The next step involves generating a list of 2-pairs of all the frequent items. If any item had less than the minimum support value, it would not have been placed in the list. This step is known as pruning of the tree. Item Support {A,B} 3 {A,C} 2 {A,D} 3 {B,C} 4 {B,D} 5 {C,D} 3 This is counting up the occurrences of each of those pairs in the database. Since minsup=3, we dont need to generate 3-sets involving {1,3}. This is due to the fact that since theyre not frequent, no supersets of them can possibly be frequent. Keep going: Since the minimum support level is set to be 3, we do not need to generate 3-sets which involve {A,C}. Since they are not frequent, no supersets involving any of them can be frequent. Generating the 3-paired sets results in: Item Support {A,B,D} 3 {B,C,D} 3 In this step {A,B,C} and {A,C,D} were not generated. The 4-set involves only {A,B,C,D} and its subset {A,C} has been ruled out as non frequent [3]. Issues with Apriori Algorithm The primary issue involved in a basic Apriori Algorithm exists in the candidate generation process. In this process the candidate records are generated. Since in a large database there might be millions of candidate records, it can take huge computational power and time to generate them. The time and power spent can be worth if all the candidate records are to be used subsequently in the process. However, most of the candidate records are excluded from the process if they happen to be infrequent, i.e. they have a level of support lower than the minimum support level defined in the system. Kotsiantis and Kanellopoulos [1] suggest four ways in which the computational cost and time involved can be reduced: 1. Reducing the number of traversals in the database. This means that the number of iterations is reduced. 2. Sampling the database. This means that a sample of database is scanned to determine association rules rather than scanning the entire database to determine the rules. 3. Adding extra constraints on the pattern structure. This means that the pattern structure cannot be kept simple to map association between items. It could also involve other factors such as time, location etc. This will improve association. 4. Through parallelization. This means that the computation be simultaneously done on multiple computers or multiple processors. This will lower the time to reach the output but will involve investment in higher technology. Role of Support: Support is the basic element in an apriori algorithm. Support of a rule is defined as “a measure of how frequently the items involved in it occur together” [2]. In simpler words, support is the level of frequency associated with each pair in the data set. At the first step support is the frequency of each individual item. At the next step, support is the frequency of two-paired items and in subsequent iterations the frequency of all n-paired items. It is thus the most basic characteristic in the Apriori algorithm to determine association. Without support, the Apriori algorithm cannot work and neither can any association rules be found out. The level of minimum support has to be kept at a level which shall ensure that algorithm results in useful association rules. Setting up a support level too high may result in certain items being excluded from the dataset. Whereas, setting up a level too low may result in huge computations and wastage of time and computational power. Role of Confidence: Confidence of a rule identifies the probability of both the antecedent and consequent appearing in the same transaction. Confidence is merely the conditional probability of the consequent, given that the antecedent has occurred. For example, item A might appear in 100 transactions and 80 out of those might include item B. The confidence would be: A implies B with 80% confidence. ‘Confidence is the ratio of the rule support to the number of transactions that include the antecedent’. To express it in terms of probability, confidence can be written as: confidence (A implies B) = P (B/A), which is equal to P(A, B) / P(A) [2] Whereas support is useful in determining the frequency of each pair, confidence on the other hand signifies the strength of the pair. It determines whether the pair generated can be considered to be worthy of trust and reliable. Since a lot of subsequent decision making is to be done on the association rules, it is important that confidence level must be accurate. Role of Lift: Support and confidence are both involved in determining if the rule is valid. However, there are cases when both of these measure might be high and yet they still produce a rule which is not useful. This is explained through the example below: In a super store, confidence level of items A and B is 75% and the support level is 30%. The numbers make it sound as if the rules are quite accurate and in most conditions they might be. It has both high support and high confidence. But, what happens if the customers in store buy B 90% of the time? In that case, the customers for item A are less likely to buy B than customers in general. Hence, it is evident that a third measure is needed to evaluate the quality of the rule. Lift, as the name implies, indicates the ‘strength of a rule over random co-occurrence of the co-occurrence of the antecedent and the consequent, given their individual support’.[2] It provides information about what improvement has taken place, the change in probability of the consequent with the antecedent given. Lift is defined as: (Rule Support) /(Support(Antecedent) * Support(Consequent)) [2]. This can also be stated as the confidence of the pair of items divided by the support of the pair. So in our example, assuming that 40% of the customers buy A, the improvement would be: 30% / (40% * 90%), which is 0.83 – giving an improvement of less than 1. Any rule with an improvement of less than 1 does not indicate a real useful combination, no matter how high the support and confidence level may be. Lift, as a result, serves as a corrective mechanism for a rule. It seeks to ensure that the rule is based on valid association rather than random occurrences. References: 1. Kotsiantis, S & Kanellopoulos, D. Association Rules Mining: A Recent Overview, [online]. Available at http://www.math.upatras.gr/~esdlab/en/members/kotsiantis/association%20rules%20kotsiantis.pdf. [Accessed: 12 April 2009]. 2. Apriori. Oracle Data Mining Concepts. [e-book]. Part Number B28129-04. Available at http://download.oracle.com/docs/cd/B28359_01/datamine.111/b28129/algo_apriori.htm. [Accessed: 12 April 2009]. 3. Wardz. 2008. How Does The A Priori Algorithm Work (Part 1)? Generating Frequent Itemsets. [Online]. (Updated March 18 2009). Available at: http://wardselitelimo.com/?p=720. [Accessed 12 April 2009]. 4. Anita Wasiewska. Apriori Algorithm. Lecture Notes. [Online]. Available at: http://www.cs.sunysb.edu/~cse634/lecture_notes/07apriori.pdf. [Accessed 12 April 2009]. 5. Agarwal R, Srikant R. Fast Algorithms for Mining Association Rules. IBM Almaden Research Company. [Online]. Available at: http://ifsc.ualr.edu/xwxu/Teaching/SciDB/vldb94Apriori.pdf]. [Accessed 12 April 2009]. 6. R. Agrawal, C. Faloutsos, and A. Swami. Efficient similarity search in sequence databases. In Proc. of the Fourth International Conference on Foundations of Data Organization and Algorithms, Chicago, October 1993. 7. G. Piatestsky-Shapiro, editor. Knowledge Discovery in Databases. AAAI/MIT Press, 1991. 8. H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering association rules. In KDD-94: AAAI Workshop on Knowledge Discovery in Databases, July 1994 9. M. Houtsma and A. Swami. Set-oriented mining of association rules. Research Report RJ 9567, IBM Almaden Research Center, San Jose, California, October 1993. 10. Agarwal, R. Aggarwal, C. and Prasad V., A tree projection algorithm for generation of frequent itemsets. In J. Parallel and Distributed Computing, 2000. 11. Agarwal, R. Aggarwal, C. and Prasad V., A tree projection algorithm for generation of frequent itemsets. In J. Parallel and Distributed Computing, 2000. 12. Cheung, D., Han, J., Ng, V., Fu, A. and Fu, Y. (1996), A fast distributed algorithm for mining association rules, in `Proc. of 1996 Intl. Conf. on Parallel and Distributed Information Systems, Miami Beach, Florida, pp. 31 - 44. 13. Hegland, M., Algorithms for Association Rules, Lecture Notes in Computer Science, Volume 2600, Jan 2003, Pages 226 – 234. 14. Descriptive Data Mining Models. Oracle Data Mining Concepts. [e-book]. Part B10698-01. Available at http://www.stanford.edu/dept/itss/docs/oracle/10g/datamine.101/b10698/4descrip.htm. [Accessed: 12 April 2009]. 15. Ye, N., The Handbook of Data Mining. Human Factors and Ergonomics. [online]. Available at: http://books.google.com.pk/books?id=tABuaqVTf3MC&pg=PA28&lpg=PA28&dq=support+in+apriori+algorithm&source=bl&ots=Xr4yyKfek9&sig=0_a1poB0R4YpfGWv3upFSbMvl88&hl=en&ei=OLzhSfHED5bqlQeV2bHgDg&sa=X&oi=book_result&ct=result&resnum=9#PPP1,M1. [Accessed 12 April 2009]. 16. Wu, X., Nie, Y., Implementation Issues for Reliable A Priori Shortest Path Problem. 2009. [online]. Avaliable at: http://pubsindex.trb.org/document/view/default.asp?lbid=881986. [Accessed 12 April 2009]. Read More

Association Rule Mining - Apriori Algorithm - Case Study Example

Extract of sample "Association Rule Mining - Apriori Algorithm"

CHECK THESE SAMPLES OF Association Rule Mining - Apriori Algorithm

Data Mining Techniques and DNA/bio-data analysis

Data Mining: Concepts and Techniques

Data Mining Theory

A Systematic Approach to Cost-Based Optimization in Data Mining Environment

Data Mining Process and Algorithms

Divisions and Categories of Business Intelligence Technologies

Waikato Environment for Knowledge Analysis

Review of the Popular Association-Mining Algorithms - Apriori