A Systematic Approach to Cost-Based Optimization in Data Mining Environment Dissertation Example | Topics and Well Written Essays

?Table of Contents Table of Contents References 12 Introduction. Data mining is commonly recognized as an interactive and iterative process. Thedevelopment of the Knowledge Discovery and Data Mining System (KDDMS) has been one of the long term aims of data mining so as to support the process of data mining. Up till now, there has been extensive research done in order to give database support to the mining operations. Nevertheless, the emphasis in such endeavors has been, most typically, laid upon the mining of a single data set although, most of the times, the user has to look up for multiple data sets that are acquired from various data sources. Thus, for such cases, it is extremely essential for the KDD process to compare the patterns from various data sets and comprehend their relationship with each other. For this purpose, the multiple data sets in a KDDMS require support for the complex queries. Due to this reason, new functionality and optimizations are needed that particularly emphasize over the frequent item set mining. Faster response to queries is the prime function of the query optimization. The data is better known to the semantic optimizer rather than the user. Thus, the semantic optimizer is able to replace the query of the user with another query that provides the same outcome more efficiently in lesser time. The efficiency of the new query is due to the execution of less work for the retrieval of the selected result tuples from the data base. The most advanced query optimizers select the one “best” plan during the time of the compilation to execute a given query (Ramakrishnan and Gehrke, 2000). The cost of execution for the alternative plans is calculated, out which the one is selected that has the overall cheapest cost. Conventionally, the cost is determined on the basis of the average statistics of the overall data since the prime purpose is to identify a single plan for all data. Nevertheless, the significant statistical variations of various data sub-sets may yield poor performance of the query execution (Christodoulakis, 1984). The basic disadvantage is the highly coarse optimization granularity in which just one execution plan is selected for the entire data. Important opportunities for effective query optimization are left out because of this sort of “monolithic” approach (Ramakrishnan and Gehrke, 2000). Thus, the research problem is to augment the cost-based optimization in data mining for patterns, in single and multiple databases. Therefore, the present study will focus on the cost-based optimization of the queries in data mining. 2. Topics covered There are numerous research papers that have been published in the area of Data mining, Data ware-housing and Query Optimization Techniques however the researches in the past do not clearly specify the conditions under which, what kind of query optimizer will probably possess more weight or points than the others. According Yu and Sub (n.d.), rules are deduced from the restriction clauses of the queries that are received at the database and also, from the outcome that they generate. It can also be stated that the cost of each query is different for the approaches through which the two syntactically distinct queries generate the same outcome. Ullman (1998), in his research, explained the principle of semantic query optimization that refers to the use of semantic rules, for instance, to re-generate a query into an equivalent but less expensive query, in order to minimize the cost of query evaluation. Subramanian and Venkataraman (n.d) in their work suggested the architecture to process the queries of complex decision support that incorporates various heterogeneous data sources and puts forward the concept of transient-views and moreover, formulates a cost-based algorithm that requires a query plan as an input and develops an optimized “covering plan” through reducing the redundancies in the original-input-query plan. According to the research work of Stefan Berchtold et.al (2001), the problem of extracting all objects that satisfy a given query incorporating multiple attributes is classified as a standard problem of query processing that is extensively found in almost all database systems. In particular, the problem takes place in relation to feature-based extraction of data in multiple databases. According to the study of Babu, Bizarro, and Dewitt (2005), the multiple database systems employ a query optimizer in order to determine the efficient most strategy known as the plan for the execution of declarative queries. The reason due to which the query optimizers sometimes take inefficient decisions is that their compile-time-cost-models employ incorrect measures of numerous parameters. Falout, Barber, and Flicker (2000), in their study, have estimated the cost-function in the allocation of task that was found to be equal to the sum of the cost of inter-processor communication and the cost of processing. They observed that these two costs are actually different in terms of their measurement unit. Chen, Zhou and Wang (2003) have explained in their study that the multiple query processing requires numerous queries as its input and then it optimizes these input-queries as a whole, after which it produces a strategy or plan for the execution of multiple queries. 3. Findings by others The performance of a relational database significantly depends upon the Query optimization particularly, in the scenario of the execution of complex SQL statements. The best plan for executing each query is determined by a query optimizer that, for instance, chooses whether to use indexes for a particular query or not. The query optimizer also determines the join technique to be used in joining multiple tables. Such decisions have great impact over the performance of the SQL queries. Thus, for all applications ranging from analytical systems to content management systems as well as from operational systems to data ware-house systems, the query optimization is a vital technology. Currently, the data ware houses and data mining have become the common basis to integrate and assess data in contemporary organizations. Applications based upon data mining are used to analyze data not only on the operational level but also on the strategic level. This incorporates techniques such the data mining and online analytical processing (OLAP). Moreover, various tools are employed to pre-process and integrate data from various sources. There has been a lot of work done on the data mining and data ware-housing as well as on their optimization. Both the query processing and the query optimization work collaboratively for the execution of any type of queries. The query processing is linked to the execution of query or relates to the activities associated with the extraction of data from a data ware-house. Thus, the query processing finds out what data is to be retrieved however does not refer to the process through which the data manager looks up in to the data-base. On the other side, query optimization is linked to the efficiency or performance of the query. Therefore, the optimization process determines the strategy and the plans for the query execution and then selects the best execution plan out of it. Hence, it can be said that the query optimization refers to the process of choosing the most efficient plan for query evaluation from the various strategies available. The purpose of the query optimization is to reduce the response time of the query along with making the best use of the available resources of the server through reducing the time involved in disk input-output, network traffic and the processing in the CPU. This objective can also be attained through comprehending the logical as well as the physical structure of the data ware-house. The most advanced database systems employ a query optimizer in order to determine the plan that is the most efficient strategy for executing the declarative SQL queries. Query Optimization is a mandatory process as there could be a significant difference between the cost of the best plan and the cost of any randomly selected strategy for the query execution. The query optimizers play a critical role in the decision-support queries that are employed in data mining applications. Query optimization achieved through this cost-based approach is computationally quiet high in accordance to the resources and time that require to be consumed in order to determine the most efficient plan. Thus, comprehending and defining the query optimizers with the extreme goal to enhance their performance is basically the most significant issue in the research literature of the databases and data mining. The cost of the execution plan of a particular query is fundamentally a function of numerous parameters that specifically involve the content and the structure of the database, the system configuration, the settings of the engine, and many others. The choice of the optimizer’s plan, for a query on a particular database and system configuration, is basically a function of the preferences of the base relations that contribute to the query, which means that the anticipated number of rows of each relationship that is associated with the production of the final outcome. Changing the preferences of one or more than one base relations yields the selectivity space in accordance to these relations. For an SQL database system, the prime elements of the query evaluation component are: (1) the query optimizer and (2) the query execution engine. The purpose of the query optimizer is to generate and provide the input for the query execution engine. The query optimizer takes the input in the form of a parsed representation of an SQL query in order to generate an efficient execution plan for a particular SQL query from the collection of various possible plans of execution. An important feature of query optimization appears when the system endeavors to determine an expression that is equivalent to another specific expression however is more efficient in execution in comparison to that other expression. The other important feature of the query optimization process is the selection of the detailed plan or strategy for the query processing. The job of the query optimizer is computationally quiet challenging as there can be numerous possible plans of execution for a particular SQL query. The large number of possible plans of execution for a particular query that employs different access methods, join-operators, join-order, etc. turns the process of query optimization in to a difficult problem where as the query optimizers of industrial strength have their personalized and customized methods for finding the most efficient plan. Usually, in a multi-user environment, a database system receives multiple queries simultaneously, due to which numerous queries are executed at the same time on different processors. The execution of multiple queries can be characterized, on the basis of query dependency, in to the following two categories: (a) multiple dependent queries and (b) independent queries. At present, data mining is an active database research area, through which the information is extracted or retrieved from large amounts of agglomerated data for the use of other purposes 4. Limitations and Problems Identified by Other Researchers The basic objective of data mining is to discover effective patterns from large databases or warehouses. Amongst the common data mining techniques is the sequential pattern discovery (Agrawal, 1995). Informally, the sequential patterns occur most frequently in the sequences of item-sets. Conventionally, sequential mining algorithms find out all patterns having support greater than the threshold given by the user. Some of these algorithms let the users to provide time constraints (Srikant and Agrawal, 1996) to be employed while examining that whether a specified source sequence includes a given pattern. Data mining can be considered, from the user’s perspective, as an iterative and interactive process of modernized querying, in which a user requests a class of patterns from a specified source data set, then the system selects the suitable data mining algorithm and then the searched patterns are returned to the user (Han and Lakshmanan, 1999). A user, who interacts with a data mining system, has to provide many constraints on the patterns to be searched. Nevertheless, it is not trivial, in general, to determine a set of constraints that tend to comply with the set of patterns. Therefore, the users tend to execute a series of data mining queries that are analogous to each other before they actually discover what they were looking for. It is quite a set-back that long processing times are required to execute the data mining algorithms, due to which the user interaction becomes complex. A possible solution to deal with this set-back is to exploit the materialized outcomes of the past queries while executing a new query (Baralis and Psaila, 1999). Thus, a data mining system should be capable of identifying that which materialized query outcomes can be utilized for searching out the results of the current/new query and it should also be able to select the most efficient one, having the shortest response time, from the identified list. Equivalence, dominance and inclusion have been observed as the 3 significant relationships in between any two data mining queries, which refer to the results of the queries rather than to their syntax and are important due to their representation of situations in which one query can be efficiently executed without involving any actual data mining process with the help of the results of another data mining query (Baralis and Psaila, 1999). Although these relationships were presented in the context of association rules however these are the common relationships that are suitable for numerous pattern types as well as constraint models. In general, the users consider that the information can be always extracted in an efficient manner provided that the corresponding database has been developed properly. Nevertheless, there are certain limitations of query processing and optimization. Data mining queries that have been developed with the help of the SQL query language offer little information required to measure the query performance. In order to design an effective query execution strategy, it is necessary to acquire the internal knowledge of the database structure, the distribution of the data as well as the semantic query optimizing plan. This is practical only when the naming conventions are defined properly at the level of the database development so that the semantic query optimizer can employ referential integrity and the indexing in order to re-construct a new query. 5. Gaps in the Analysis Conducted. The researches on exploiting materialized patterns, in the past, emphasized over finding out those data mining queries in which materialized outcomes can be utilized for answering the new query. Although the previous researches have empirically shown that the materialized outcomes of one of the previous queries can be very useful in efficiently acquiring the results of another query rather than executing a complete data mining algorithm but still the past researchers did not specifically addressed the issue of measuring the cost of the execution of a query that employs materialized outcomes of another query. Moreover, the estimation of the cost so as to select the best query answering strategy when numerous possible plans are applicable. 6. Contribution to the body of knowledge that is relevant to the research problem. In this study, we aim to discuss cost-based sequential pattern query optimization with the help of the materialized results acquired from the previous sequential pattern queries. The basic objective of optimization that we will address in this study is reducing the execution time of a given query. Wojciechowski (2001), in his work, determined situations where one sequential pattern query can be executed or answered with the help of employing the results obtained from another sequential pattern query that has been executed in the past. This study will contribute towards the strategies or plans that can be employed with the help of a data mining query optimizer that aims to exploit the materialized results obtained from the previously executed data mining queries. In this respect, we will present the cost functions, in this paper, for query answering algorithms that, in turn, exploit materialized patterns. Then, these cost functions are employed for the purpose of selecting an optimal query execution strategy, with respect to the query execution time, in the presence of numerous suitable materialized sets of patterns. 7. How it compares and contrasts with the position developed by other researchers. There has been much research done in order to offer database support for the data mining operations. Han et al. (1996), Meo et al. (1996) and Imielinski et al. (1999) have put forward the extensions of the database query languages for the purpose of supporting the data mining operations. Sarawagi, Thomas and Agrawal (1998) and Chaudhuri et al. (1999) have worked over the implementation of Apriori association mining algorithm as well as the construction of the decision tree on a database system in the respective order. In order to express the data mining operations, Law et al. (2003) implement the user-defined functions (UDFs). Nevertheless, all of such endeavors emphasize over mining a single dataset through relatively less complicated conditions. Numerous constraint frequent item-set mining algorithms were formulated in order to employ extra conditions and reduce the search space (Bucila, 2002). Nevertheless, such algorithms are unable to answer our queries efficiently because the conditions in our study of optimizing queries are connected with a collection of (in)-frequent item-sets, which could not possibly be employed directly to prune the search space through their methods. In our study, we aim to formulate a systematic approach for the purpose of determining efficient query strategies that answer these queries. Kramer et al. (2001) have researched the problem of the generalized inductive query evaluation. In spite of the fact that their queries aim to deal with multiple datasets however they emphasized over the algorithmic aspects for implementing the space tree version and also, execute or answer the queries through the general monotone as well as the anti-monotone predicates. Comparatively, we intend to focus on answering the queries that involve the frequency predicates in the most efficient manner. In our study, we will formulate a table-based approach for the purpose of producing the efficient query plans. Our study is also very much different from the research of the Query flocks (Srikant et al., 1997) since they permit only a single predicate that involves frequency over a single dataset while they focus on the complex query conditions. The studies on multiple relational data mining (Blockeel and Sebag, 2003; D?zeroski, 2003) have emphasized over the formulation of the efficient algorithms for the purpose of mining a single dataset that has been materialized, in a database system, in terms of a multi-relation. Also, numerous researchers have formulated the techniques to mine the difference between the data-sets or contrast the sets (Bay and Pazzani, 2001). They aim to create efficient algorithms to determine such a difference. Also, they fundamentally emphasized over assessing two datasets, simultaneously. In contrast, we will present a common framework so as to enable the users to analyze as well as compare the patterns in the multiple data-sets. Furthermore, the users should be able to identify the new algorithms or techniques that can expedite their operations since our techniques are the constituents of a query optimization scheme. 8. Conclusion. In this literature review, which is related to the power of query optimization and the systematic approach to cost-based optimization in data mining environment, we have deduced the following conclusions: 1. Stefan Berchtold and Chritian Bohm (2001) have explained in their study that a standard query processing challenge is the problem associated with the extraction of all objects with the help of using a single query that involves multiple attributes and this problem is prevalent in any database system. This problem particularly happens to take place in the case of feature based retrieval or extraction of data from multiple databases. 2. Babu , Bizarro, and Dewitt (2005) have discussed in their work that multi database systems employ a query optimizer in order to determine the most effective strategy known as the plan for the execution of declarative queries. The choice of the plan by the optimizer, for a query on a specific database and system configuration, is basically a function of the selectivities of the base correlations that participate in the query. Often the query optimizers make ineffective decisions since their models of compile-time-cost employ inaccurate measures of different parameters. 3. Falout, Barber, and Flicker (2000), in their study, have estimated the cost function in the allocation of the task which is the sum of the processing cost and inter processing communication and concluded that they are in fact different in measurement unit. 4. Hong Chen, Sheng Zhou, and Shan Wang (2003), in their research, have elaborated that the processing of multiple queries takes many queries as input, then optimizes these several queries as a whole, and then develops a strategy for the execution of the multiple queries. References Agrawal R., Srikant R. (1995). Mining Sequential Patterns. Proc. 11th ICDE Conf. Babu , S., Bizarro, P. and Dewitt, D. (2005). Proc.of ACM SIGMOD , Int.Conf. of Management of data, June-2005. Bay, D. S. and Pazzani, J.M. (2001). Detecting group differences: Mining contrast sets. Data Min. Knowl. Discov., 5(3):213–246. Berchtold, S. and Bohm, C. (2001), University of Munich , Oettinge Str, Germany. Blockeel, H and Sebag, M. (2003). Scalability and efficiency in multi-relational data mining. SIGKDD Explor. Newsl., 5(1):17–30. Bucila, C., Gehrke, J., Kifer, D. and White, W. (2002). Dualminer: a dual-pruning algorithm for itemsets with constraints. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 42–51. Chaudhuri, S., Fayyad, M. U., and Bernhardt, J. (1999). Scalable classification over sql databases. In Proceedings of the 15th International Conference on Data Engineering, 23-26 March 1999, Sydney, Austrialia, pages 470–479. IEEE Computer Society, 1999. Chen, H., Zhou, S., and Wang , S. (2003). School of Information, Remin University China , ICMD. Christodoulakis, S. (1984). “Implications of certain assumptions in database performance evaluation,” TODS, vol. 9, no. 2, pp. 163–186. D?zeroski, S. (2003). Multi-relational data mining: an introduction. SIGKDD Explor. Newsl., 5(1):1–16. Falout C, Barber, R.Flicker.M, Journal of Intelligent Information Systems 2000. Han J., Lakshmanan L. (1999). Ng R.: Constraint-Based Multidimensional Data Mining. IEEE Computer, Vol. 32, No. 8. Han, J. et al. (1996). Dmql: A data mining query language for relational databases. In In Proc. 1996 SIGMOD 96 Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD 96), pages 27–33, Montreal, Canada. Imielinski, T and Virmani, A. (1999). Msql: a query language for database mining. In Data Mining and Knowledge Discovery, pages 3:393–408. Kramer, S., Raedt, D.L., and Helma, C. (2001). Molecular feature mining in hiv data. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 136–143. Law, Y., Luo, C., Wang, H. and Zaniol, C. (2003). Atlas: a turing complete extension of sql for data mining applications and streams. In Posters of the 2003 ACM SIGMOD international conference on Management of data. Lopez, C., Vicente P. and Bote, G. (2005). University of Extremadura , Badajoz, Spain , “Proc. Of ACM Sigmod Intl.conf.of Management of Data, June. Meo, R., Psaila, G. and Ceri, S. (1996). A new sql-like operator for mining association rules. In In Proc. of International Conference on Very Large Data Bases (VLDB), pages 122–133, Bombay, India. Ramakrishnan, R. and Gehrke, J. (2000). Database Management Systems. McGraw-Hill Higher Education. Sarawagi, S., Thomas, S. and Agrawal, R. (1998). Integrating association rule mining with relational database systems: Alternatives and implications. In Proceedings of the 1998 ACM SIGMOD international conference on Management of data. Srikant R., and Agrawal R. (1996). Mining Sequential Patterns: Generalizations and Performance Improvements. Proc. of the 5th EDBT Conference. Srikant, R., Vu, Q. and Agrawal, R. (1997). Mining association rules with item constraints. In David Heckerman, Heikki Mannila, Daryl Pregibon, and Ramasamy Uthurusamy, editors, Proc. 3rd Int. Conf. Knowledge Discovery and Data Mining, KDD, pages 67–73. Subramanian, S. and Venkataraman, S. (n.d.). Cost-based optimization of decision support queries using transient-views by. ACM SIGMOD international conference on Management of data. Ullman, J. D. (1988). Principles of Database and Knowledge-base Systems, I,II. Palo Alto, CA: Computer Science Press. Wojciechowski, M. (2001). Interactive Constraint-Based Sequential Pattern Mining. Proc. Of the 5th ADBIS Conference. Yu, T.C. and Sun, W. (n.d.). Automatic knowledge acquisition and maintenance for semantic query optimization. IEEE Transactions on Knowledge and Data Engineering. Read More

A Systematic Approach to Cost-Based Optimization in Data Mining Environment - Dissertation Example

Extract of sample "A Systematic Approach to Cost-Based Optimization in Data Mining Environment"

CHECK THESE SAMPLES OF A Systematic Approach to Cost-Based Optimization in Data Mining Environment

Business Intelligence Process

Villa Building and Construction

Case study - knowledge management strategy

Value Stream Mapping of Vibration Test Data in a Product Life Cycle

Commercial management

Human Factors Considerations in the Vision for the Development of Nextgen

The Nature and the Significance of Innovation for Engineers

Supply Chain in Components for Mining Machinery