Free

Data Analytic and Business Intelligence - Assignment Example

Add to wishlist

Summary

The author of the "Data Analytics and Business Intelligence" paper identifies interesting association rules the concepts of support, confidence, and lift are used, explains how to use the concepts of support, confidence, and lift to identity interesting rules…

Download full paper File format: .doc, available for editing

GRAB THE BEST PAPER92.1% of users find it useful

Read Text

Subject: Mathematics
Type: Assignment
Level: Undergraduate
Pages: 8 (2000 words)
Downloads: 3

Extract of sample "Data Analytic and Business Intelligence"

STUDENT ID NUMBER: UNIVERSITY OF CANBERRA FACULTY OF EDUCATION, SCIENCE, TECHNOLOGY & MATHEMATICS ASSIGNMENT 2 WINTER TERM, 2014 DATA ANALYTICS & BUSINESS INTELLIGENCE PG (Advanced) PART A This part consists of three short answer questions, and does not require the use of the computers. Answer all THREE questions. QUESTION A1 (6 marks) Partitioning a dataset from which we want to build a model, into training and testing (or training/validation/testing) datasets. In analysing a data set there are often four values of particular interest: the minimum, the maximum, the mean, and the standard deviation. Together the minimum and the maximum let us know the range of the values; the mean is the average data set value, which often gives insight on what is a typical value; and the standard deviation gives insight on how clustered the data set values are around the average. Training Set- this is used to build a model by fitting the linear regression model in dataset Validation Set- dataset is often used to fine-tune models QUESTION A2 (7 marks) To identify interesting association rules the concepts of support, confidence and lift are used. (1) Explain the three terms: Support - this is a rules that us used in dataset Confidence- this determines the usage of items in Y in activities relating X transaction Lift- is a standard measuring achievements of models created while classifying data. Therefore it is a ratio target response and average response. Lift = (2) Explain how you would use the concepts of support, confidence and lift to identity interesting rules. QUESTION A3 (7 marks) Cluster analysis is a data mining technique used to divide data into meaningful groups. (a) Describe (in your own words) the steps that the k-means algorithm goes through to generate clusters. An important feature to note concerning the decomposition process is that it is what makes it possible for the samples in the original signal to be reordered. However, the re-ordering of these samples is supposed to follow a specific pattern which is determined by the binary equivalents of each sample. This algorithm that involves the rearrangements of the order of the N time domain samples through the counting in binaries that have been flipped from left to right. (b) Describe any pre-processing steps you might need to complete before generating a k-means cluster analysis (using the Euclidian distance measure). The raw data is placed in the K-means such that the data associated with low frequencies is put at the centre while those with high frequencies are placed around the centre. (c) Imagine you are performing a k-means cluster analysis using Rattle. Describe the steps you would go through to determine the optimal number of clusters. The process of cluster analysis from the separate single data undergoes three main essential loops which are also concentric in nature. The outermost loop is the one that runs through the Log2N stages while the middle loop is the one which moves through each of the individual frequency spectra that are in the stage currently being worked on. The final loop which is also the innermost is what now uses the butterfly diagram mentioned earlier in the calculation of the points that are in each frequency spectra. The three loops are the three main stages that constitute the transformation of a given data from the time domain data into the domain data and vice versa. (d) Describe the measures / characteristics you would use to evaluate ak-means cluster analysis you have created in Rattle. The window function basically works by assigning a weighting coefficient to each of the input samples so that the samples which cause leakage are reduced. For instance, all the samples that begin and end the sampling period will end up being reduced to zero and the resulting effect of this is that the discontinuities in the periodic sampled signal will end up being removed. The type of window to use for this purpose will be dependent upon the frequency content of the signal. Another error that can occur in a spectrum and which can be resolved by use of algorithms is known as the scallop loss. It is caused by the discrete nature of the frequency spectrum whereby the signal is displayed as amplitude levels which are at discrete bins which are equally spaced. If the signal frequency coincides with the centre of one of the discrete frequency bins, the correct peak level is displayed. If, however, the signal frequency is not at the centre of a frequency bin, then a reduced level is displayed, causing an error which can be up to 3 dB. PART B This part consists of two practical data analysis questions, which should be answered using software. Answer BOTH questions. QUESTION B1 (6 + 7 + 7 = 20 marks) Data mining techniques have been widely applied in the medical domain to assist in the diagnosis of various medical conditions. Researchers wish to be able to classify tissue samples taken from tumours as either benign or malignant samples. The following dataset scores each tissue sample on 10 characteristics. These characteristics have been established as differing between benign and malignant samples. Each sample is scored on a scale of 1 to 10 (with 1 being the closest to benign, and 10 being the most malignant) for each characteristic. No single characteristic or pattern of characteristics has been identified that can distinguish between benign and malignant samples. A neural network is a good candidate technique for identifying the complex relationship between the 10 characteristics, and the actual classification of the tumour (benign or malignant). The dataset can be found in the file CancerWisconsin.csv on Moodle. The variables in the file are as follows: Data Description Variable Name Values Clump Thickness 1-10 Uniformity of Cell Size 1-10 Uniformity of Cell Shape 1-10 Marginal Adhesion 1-10 Single Epithelial Cell Size 1-10 Bare Nuclei 1-10 Bland Chromatin 1-10 Normal Nucleoli 1-10 Mitoses 1-10 Class Benign or Malignant (a) Load the CancerWisconsin.csv dataset into Rattle. Set Class as the target variable. Partition your data using the default settings. Create a neural network, leaving the number of hidden layer nodes at the default value of 10. Record the performance of the network for the validation partition (choosing appropriate measures from the Evaluate tab, and the validation radio button). Create another neural network model, this time with the number of hidden layer nodes set to 5. Record the performance of the network. Cut and paste the performance measures into your Word document. Comment on the differences in the performance between the two models, and explain the likely causes(s) of any differences. b). Comment on the false positive and false negative rate (as provided by the Error Matrix) of the model with the best performance from part (a) above. Comment on whether it is most important to minimise the false positive or the false negative rate for this particular dataset. (c) Experiment with different numbers of hidden layer nodes to identify the optimum size of the hidden layer. How many samples are required for a network of the size you have determined is optimal? Calculate the number of samples required, and provide your working. #number of hidden samples nn.sizes Distributions ->Bar Plot). Comment on whether the distributionof the target variable mayhave affected theratio between false positives and false negatives in the pruned decision tree created in (d).Explain why there is / is not an effect. (f) The bank would like to minimise the number of false negatives produced by the model. We will set the Loss Matrix parameter and generate a new tree. Set the Loss Matrix parameter to ‘0,2,1,0’. Leave the Complexity parameter as for your pruned tree. Press ‘Execute’. Obtain the Error Matrix for this tree using the validation data partition. Compare theError Matrix with that of the pruned tree in part (e). Using your knowledge of how the Loss Matrix parameter works, explain the reasons for any differences you observe. (g) Create a Random Forest model (leave the tuning parameters at their default values). Compare the performance of the Random Forest model and your pruned decision tree. Using your knowledge of the Random Forest algorithm, explain any difference in performance you observe. While creating Random Forests tree methods are used. From plot above 0.82 was found as the optimal cutoff which will maximizes balanced accuracy rate. This takes the Random Forests predicted rate to be 94.1% with specificity of 94.05% and With this cutoff point, our random forest reaches a prediction accuracy of 93.6% and a sensitivity of 71.1%. This gives a predict true negative and true positive as reasonable. Read More

The process of cluster analysis from the separate single data undergoes three main essential loops which are also concentric in nature. The outermost loop is the one that runs through the Log2N stages while the middle loop is the one which moves through each of the individual frequency spectra that are in the stage currently being worked on. The final loop which is also the innermost is what now uses the butterfly diagram mentioned earlier in the calculation of the points that are in each frequency spectra.

The three loops are the three main stages that constitute the transformation of a given data from the time domain data into the domain data and vice versa. (d) Describe the measures / characteristics you would use to evaluate ak-means cluster analysis you have created in Rattle. The window function basically works by assigning a weighting coefficient to each of the input samples so that the samples which cause leakage are reduced. For instance, all the samples that begin and end the sampling period will end up being reduced to zero and the resulting effect of this is that the discontinuities in the periodic sampled signal will end up being removed.

The type of window to use for this purpose will be dependent upon the frequency content of the signal. Another error that can occur in a spectrum and which can be resolved by use of algorithms is known as the scallop loss. It is caused by the discrete nature of the frequency spectrum whereby the signal is displayed as amplitude levels which are at discrete bins which are equally spaced. If the signal frequency coincides with the centre of one of the discrete frequency bins, the correct peak level is displayed.

If, however, the signal frequency is not at the centre of a frequency bin, then a reduced level is displayed, causing an error which can be up to 3 dB. PART B This part consists of two practical data analysis questions, which should be answered using software. Answer BOTH questions. QUESTION B1 (6 + 7 + 7 = 20 marks) Data mining techniques have been widely applied in the medical domain to assist in the diagnosis of various medical conditions. Researchers wish to be able to classify tissue samples taken from tumours as either benign or malignant samples.

The following dataset scores each tissue sample on 10 characteristics. These characteristics have been established as differing between benign and malignant samples. Each sample is scored on a scale of 1 to 10 (with 1 being the closest to benign, and 10 being the most malignant) for each characteristic. No single characteristic or pattern of characteristics has been identified that can distinguish between benign and malignant samples. A neural network is a good candidate technique for identifying the complex relationship between the 10 characteristics, and the actual classification of the tumour (benign or malignant).

The dataset can be found in the file CancerWisconsin.csv on Moodle. The variables in the file are as follows: Data Description Variable Name Values Clump Thickness 1-10 Uniformity of Cell Size 1-10 Uniformity of Cell Shape 1-10 Marginal Adhesion 1-10 Single Epithelial Cell Size 1-10 Bare Nuclei 1-10 Bland Chromatin 1-10 Normal Nucleoli 1-10 Mitoses 1-10 Class Benign or Malignant (a) Load the CancerWisconsin.csv dataset into Rattle. Set Class as the target variable. Partition your data using the default settings.

Create a neural network, leaving the number of hidden layer nodes at the default value of 10. Record the performance of the network for the validation partition (choosing appropriate measures from the Evaluate tab, and the validation radio button). Create another neural network model, this time with the number of hidden layer nodes set to 5. Record the performance of the network. Cut and paste the performance measures into your Word document. Comment on the differences in the performance between the two models, and explain the likely causes(s) of any differences. b). Comment on the false positive and false negative rate (as provided by the Error Matrix) of the model with the best performance from part (a) above.

CHECK THESE SAMPLES OF Data Analytic and Business Intelligence

How effective are Business Intelligence (BI) tools for supporting decision-making

The techniques that assist in the providence of the value of knowledge are knowledge management and business intelligence.... Sometimes used synonymously with "decision support," though business intelligence is technically much broader, potentially encompassing knowledge management, enterprise resource planning, and data mining, among other practices.... ?? (csumb, 2011) Trying to interpret the actual meanings of the term ‘intelligence' and how it is evolved would give us a better understanding into the terminology of business intelligence itself....

12 Pages (3000 words) Essay

Business Intelligence Technology Issues

?? (csumb, 2011) TECHNOLOGY and business intelligence Present day technological advancements have evolved the meaning of business even further.... [Name of the Writer] [Name of the Professor] [Name of the Course] [Date] business intelligence Technology Issues.... INTRODUCTION TO business intelligence: business intelligence refers to a very comprehensive terminology.... The key business terminology of business intelligence, however, has more than just one meaning associated with it....

3 Pages (750 words) Assignment

Big Data, Business Intelligence, and Data Analytics in Decision Making

Big Data, business intelligence, and Data Analytics in Decision Making Date Benefits of IT Resources and Strategies to a Company In the contemporary company setting, there is a lot of information that should be noted by the management in order to make informed decision in the management of the company.... Without IT assistance in the business, huge volumes of data that relates to the business can go unattended to given the fact that traditional business intelligence analysis cannot process them so that important inferences can be drawn from them....

4 Pages (1000 words) Essay

Business Intelligence and Enterprise Performance Management

Peter Rada of Teen_R_Us concede this when he attended industry conference where business intelligence was discussed.... In other words, business intelligence is the name of analysis made upon the past information of the organization along with providing data management of the organization using different latest tools and technologies.... This business intelligence system provides reports in a style that are practically used by senior managers, leads and other members of team....

7 Pages (1750 words) Case Study

Innate Intelligence

There as been a lot of controversy and debate about standardized testing and intelligence testing in the educational setting.... Being an undergraduate student, the effect of social and political views on standardized testing and intelligence testing has increased the awareness of the factors that determine students' test scores.... uring the earlier decades of the twentieth century, intelligence Quotient [IQ] was hailed as the ultimate measurement of human intelligence and assessment of individual competence....

8 Pages (2000 words) Essay

Data warehouse, data mart and business intelligence

In addition, the data warehouse is a database of DATA WAREHOUSE, DATA MART and business intelligence Data warehouse, Data mart and business intelligence Affiliation Business intelligence offers a wide variety of tools and techniques for collecting, storing, processing, and distributing huge volumes of data and information to improve business decision making capabilities.... business intelligence offers a wide variety of tools and techniques for collecting, storing, processing, and distributing huge volumes of data and information to improve business decision making capabilities....

2 Pages (500 words) Essay

Subject: Managment Information System

business intelligence has become the mainstay of information management in the contemporary world.... Most of the organizations integrates and implements different business intelligence tools and techniques in.... Common The business intelligence and analytics are integrated in the business world.... All these field experience rapid data evolution which requires an accurate and effective data management and analysis platform readily provided by the business intelligence....

12 Pages (3000 words) Research Paper

Business Intelligence Issues

business intelligence ApproachesIntroductionThis assignment shall evaluate three approaches of application of business intelligence (BI).... business intelligence has a broad category of technologies and applications which are applicable in different business intelligence ApproachesIntroductionThis assignment shall evaluate three approaches of application of business intelligence (BI).... business intelligence has a broad category of technologies and applications which are applicable in different sectors of the economy and are used in aiding of decision making....

7 Pages (1750 words) Essay