StudentShare
Contact Us
Sign In / Sign Up for FREE
Search
Go to advanced search...
Free

Practical Data Mining Using C Language - Lab Report Example

Cite this document
Summary
"Practical Data Mining using C Language" paper states that the core source of data in the process of data mining is a database containing a mixture of data of different types. The data therein is what is extracted, and models applied to it such that the types can be isolated and uniquely analyzed…
Download full paper File format: .doc, available for editing
GRAB THE BEST PAPER97% of users find it useful

Extract of sample "Practical Data Mining Using C Language"

Practical Data Mining using C language Name Institution of Afiliation Date By definition, data mining is a practice of examining big databases with the aim of generating new information. Conversely, it is a technique of manipulating big data by breaking it down into small sections and using a set of algorithms extract meaningful data that can be used in analysis of a system. From the process of data mining, new relationships that previously were not identified could be discovered and analyzed. The analysis is done from different perspectives and views to comprehensively generate a structured layout report on the set of data. Hence, the initial stage of data mining is to come up with data sets that can be easily understood and processed to extract relevant information and relation. From the computation of discovered relationships and patterns, diverse methods can be applied on the data sets and would find practical use in a number of fields, including: machine learning, artificial intelligence, database systems and models, and also in conducting statistics. The core source of data in the process of data mining is a database containing a mixture of data of different types. The data therein is what is extracted, and models applied on to it such that the types can be isolated and uniquely analyzed. There are diverse data mining techniques that are in existence today. They include: Classification Clustering Regression Anomaly detection Association rules Reinforcement learning Structured prediction Feature engineering Summarization Any of the above techniques can be used individually or in combination with another or a number of them thereof to analyze a dataset. Now, data mining is the fourth step of machine learning identified universally under Knowledge Defined in Databases (KDD) process. For a successful process of data mining, a decision tree is first generated upon a data set of which information is to be extracted. Hence it is imperative to first construct a data tree. The decision tree is created from well-defined algorithms. The common algorithms used include the ID3 algorithm and C4.5 algorithm. Since the requirement for the assignment is to use the id3 algorithm, the following is a snap description of it. Ideally, it is a straight-forward decision-tree learning algorithm that finds its application uniquely in those sets of data attributes are well defined and belong to clearly distinguished classes. The algorithm analyses the input data set iteratively starting from the root node from which it builds a data tree. At every node, the best data classification attribute is chosen. The code used to do the data mining in this assignment has been written in c language and output from the console following ID3 algorithm. Here is the description of its functionality: Step 1: The program imports the standard library and the string libraries which are pre coded to enable it read the string data from the file. It then sets the string buffer size expanding the buffer memory to 1024 by 1024 bytes so that it can accommodate the longest line available. The fields also acting as data headers correspond to the number of columns present in the data table, which from the data set, it stands at value 22. This is established in the initial stage of the program under program constants. This spans from lines 17 to 39 of the code as commented appropriately. Step 2: The next step is to read the CSV file as the input. Here, pointers to the file are used to read the file input. The main advantage and reason for use of pointers is that the data from the file is extracted without changing the file by the program. Hence the raw data remains intact and the file is not modified in any way. Thus the file is read and by the load file method, taking the CSV file as the parsing parameter, and the returned values being of type “long” since the file could be extremely large. This is as shown below: The second parameter of the loadFile method parses the number of count errors existing in the pointed to file, identified by the pointer indicator *pFile. The definition of the loadFile method is given from code lines 45 all the way to line 77 Step 3: This step entails loading the values of the file onto the program for analysis which is done by the loadValues method, which is of integer type as its return values and the modifier appended being static. The output of the method is stored in the predefined variable RET_OK. The two methods are called in and coordinated by the main method which runs from line 121 to line 146. The description for the analysis process is as described below: In step 1, three library files are invoked. Namely the stdio.h for allowing the program to process the input and output functions, the stdlib.h that holds standard library tools for parsing resources for standard features that the program avails. The methods for file processing, including fopen(), fcreate() and fclose() are defined in this library file. The last one is the string.h header file in which is precoded text processing tool. This allows the program to accept long texts of type String since it is not defined in the c language lexicon dictionary. The next section of code is definition of the fields as constants since they are not expected to change as the program is running. Since they are recursively used, they need to be defined as constants and their fixed names and values be assigned a fixed memory space hence they are defined as such and values assigned to them in a form of an index. Since they are a total of 22 fields they are all indexed and stored in an array of characters, identified as of char data type in c language. The next step is the calling of the load file with two distinct parameters parsed as arguments to the loadFile method. The pointer to the file, named as *pFile causes the program to check the source file for the data that needs to be analysed, while the *errorcount points to the errors identified in the file. The method returns or outputs the results as of type long primarily because the output could be extremely large and hence the allocated memory capacity need to be big enough to contain all the characters of the output. Since the initial name for the provided CSV file was mushroom.csv, it was prudent to rename it to pFile.csv such that there be a consistency in naming with the camelcasing standard of naming variables. Without renaming the original source file to pFile would result to error since the program would return a null result and dim the source file as non-existent. The method is defined at a later stage where the processing features pertaining to flow and control of the data are declared. In the definition of the loadFile method, the memory buffer size that temporarily holds the file data, and the lineno are declared locally and a memory is allocated to them. The buffer size is called in to expand the default buffer memory so as to allow the program to handle big capacity of data. A conditional statement is used to validate the presence of the file called which was renamed to pFile from the original name titled mushroom. In case it is not found or located within the path set, a zero is returned to indicate a fail read attempt. Otherwise the file is read and the line numbers are read iteratively through the entire file, going line after line, which is achieved by the use of the loop statement within the method. The results of the linear analysis are buffered and after the read operation, the results are printed over the terminal window. The load values then do the processing for the line to line analysis. This method checks the special characters found in the dataset and using a counter, records their instances. One of the characters processed is the double quote character that is compared to the defined variable “ch”. The number of its occurrences is analyzed in each line and the buffer memory is updated until the search reaches end of file. Finally, the main method is invoked to integrate the above described methods. The code is straight forward and has been appended with the necessary comments in case of clarification. The code is displayed in the appendix section below hence can be followed through with ease. Appendix #include #include #include // adjust BUFFER_SIZE to suit longest line #define BUFFER_SIZE 1024 * 1024 #define NUM_FIELDS 22 #define MAXERRS 5 #define RET_OK 0 #define RET_FAIL 1 #define FALSE 0 #define TRUE 1 // char* array will point to fields char *pFields[NUM_FIELDS]; // field offsets into pFields array: #define #class 0 #define cap-shape 1 #define cap-surface 2 #define cap-color 3 #define bruises 4 #define odor 5 #define gill-attachment 6 #define gill-spacing 7 #define gill-size 8 #define gill-color 9 #define stalk-shape 10 #define stalk-root 11 #define stalk-surface-above-ring 12 #define stalk-surface-below-ring 13 #define stalk-color-above-ring 14 #define stalk-color-below-ring 15 #define veil-type 16 #define veil-color 17 #define ring-number 18 #define ring-type 19 #define spore-print-color 20 #define population 21 #define habitat 22 long loadFile(FILE *pFile, long *errcount); static int loadValues(char *line, long lineno); static char delim; long loadFile(FILE *pFile, long *errcount){ char sInputBuf [BUFFER_SIZE]; long lineno = 0L; if(pFile == NULL) return RET_FAIL; while (!feof(pFile)) { // load line into static buffer if(fgets(sInputBuf, BUFFER_SIZE-1, pFile)==NULL) break; // skip first line (headers) if(++lineno==1) continue; // jump over empty lines if(strlen(sInputBuf)==0) continue; // set pFields array pointers to null-terminated string fields in sInputBuf if(loadValues(sInputBuf,lineno)==RET_FAIL){ (*errcount)++; if(*errcount > MAXERRS) break; } else { // On return pFields array pointers point to loaded fields ready for load into DB or whatever Fields can be accessed via pFields, e.g. printf("Class=%s, cap-shape=%s, cap-surface=%s\n", pFields[#class], pFields[cap-shape], pFields[cap-surface]),pFields[cap-surface]); } } return lineno; } static int loadValues(char *line, long lineno){ if(line == NULL) return RET_FAIL; if(*(line + strlen(line)-1) == '\r' || *(line + strlen(line)-1) == '\n') *(line + strlen(line)-1) = '\0'; if(*(line + strlen(line)-1) == '\r' || *(line + strlen(line)-1 )== '\n') *(line + strlen(line)-1) = '\0'; char *cptr = line; int fld = 0; int inquote = FALSE; char ch; pFields[fld]=cptr; while((ch=*cptr) != '\0' && fld < NUM_FIELDS){ if(ch == '"') { if(! inquote) pFields[fld]=cptr+1; else { *cptr = '\0'; // zero out " and jump over it } inquote = ! inquote; } else if(ch == delim && ! inquote){ *cptr = '\0'; // end of field, null terminate it pFields[++fld]=cptr+1; } cptr++; } if(fld > NUM_FIELDS-1){ fprintf(stderr, "Expected field count (%d) exceeded on line %ld\n", NUM_FIELDS, lineno); return RET_FAIL; } else if (fld < NUM_FIELDS-1){ fprintf(stderr, "Expected field count (%d) not reached on line %ld\n", NUM_FIELDS, lineno); return RET_FAIL; } return RET_OK; } int main(int argc, char **argv) { FILE *fp; long errcount = 0L; long lines = 0L; if(argc!=3){ printf("Usage: %s csvfilepath delimiter\n", basename(argv[0])); //argv[1] holds the path to csv file parsed below return (RET_FAIL); } if((delim=argv[2][0])=='\0'){ fprintf(stderr,"delimiter must be specified\n"); return (RET_FAIL); } fp = fopen(argv[1] , "r"); //code to open the csv file in read mode if(fp == NULL) { fprintf(stderr,"Error opening file: %d\n",errno); return(RET_FAIL); } lines=loadFile(fp,&errcount); fclose(fp); printf("Processed %ld lines, encountered %ld error(s)\n", lines, errcount); if(errcount>0) return(RET_FAIL); return(RET_OK); } Read More
Cite this document
  • APA
  • MLA
  • CHICAGO
(Practical Data Mining Using C Language Lab Report, n.d.)
Practical Data Mining Using C Language Lab Report. https://studentshare.org/logic-programming/2067497-practical-data-mining-project
(Practical Data Mining Using C Language Lab Report)
Practical Data Mining Using C Language Lab Report. https://studentshare.org/logic-programming/2067497-practical-data-mining-project.
“Practical Data Mining Using C Language Lab Report”. https://studentshare.org/logic-programming/2067497-practical-data-mining-project.
  • Cited: 0 times

CHECK THESE SAMPLES OF Practical Data Mining Using C Language

The Technology of HIMS: Hardware, Software, Peripherals and Processes

Data warehousing that is incorporated in HIMs enable clinicians to make informed judgments and data mining ability of HIMS enable them to retrieve data from the data repositories (Wager, Lee & Glaser, 2009, p.... Other functionalities include front end usage and back end usage which is enabled using proper softwares and programming languages.... The Technology of HIMS: Hardware, Software, Peripherals and Processes HIMS contains that crucial data that is required by the healthcare practitioners, clinicians and policymakers to make clinical judgment and decisions about providing improved healthcare to the patients....
1 Pages (250 words) Essay

Ethnographic Analysis of the Chakma People

A comprehensive ethnography ought to consider among other things, the language, history of the culture, physical geography of the group and its impact on the group's livelihood, the people's views on animal and plant life, their art and craft (material culture), occupation and social structure (Philipsen 1992).... Moreover, they consider a tribal group called Tongchangya as Chakmas since both ethnic communities speak the same language and have similar cultural practices....
8 Pages (2000 words) Essay

Critical Literacy in the Business World

In spite of the imaginary aspects of nonfiction literary works, educators still frequently pick out and instruct texts that portray actual events or real… Likewise, teachers usually motivate students to recognize their selves with the protagonists in the stories (Cioffi, 1992).... This reality only implies that students are socialized and taught not to go against the status quo and to be passive citizens Hence, through enhancing critical perspectives towards texts, products and other symbols emergent in a society, students can transfer these abilities to the public, thus reading their society through a critical lens that directs to empowerment (Cioffi, 1992)....
4 Pages (1000 words) Essay

Baffin Region, Pond Inlet

The Baffin Region, also called Qikiqtaaluk, which covers Baffin Island and the Eastern High Arctic Islands, (Pritzker 1998 p.... 59) is located in northeastern portion Canada.... It spreads out from “the swampy and forested isles of James Bay to the jagged peaks of Ellesmere Island,… north” (Zimmerman 2008 837) and has a total land area of 1,040,417....
5 Pages (1250 words) Research Paper

Best Practices in the Workplace Communication

The paper “Best Practices in the Workplace Communication” will look at the use of effective communication, which can serve to greatly improve the existing relationships not only at the various social situations and at home but also at the work place.... hellip; The author states that the use of effective communication can also aid in causing one to be able to successfully communicate the more difficult or negative messages without tending to destroy the existing trust or create any form of conflict....
4 Pages (1000 words) Essay

Factor for Success in Executing Successful Business

inally, the technology used to analyze the data can be done in different ways based on what type of information that is needed; data mining, Text Mining, or Web Mining is a few techniques used.... Note that SQL is just a programming language that is used in querying database systems.... A clear understanding of the data is also a must since it is impractical to provide an analytical solution using data that you do not understand.... In addition, clear requirements gathering and specifications as well as a real understanding of the data are the other two most important factors....
4 Pages (1000 words) Assignment

EAT117 Electronic Principles

pplication of logic systems is in rule engines which operate using discrete rules.... The first practical application of logic systems was expert systems.... "EAT117 Electronic Principles" paper states that an improvement occurred to RTL to form Direct-Coupled Transistor Logic....
9 Pages (2250 words) Assignment

Electronic Dance Music - Contextualization, Experience with Logic Pro X Software

hellip; Electronic Music Composition as art is attached to like learning a new language (Butler, 2011).... he structure also helped to move toward the acquisition of a personal musical language by developing skills in practical notation and working with specific music composition software, for this case, LOGIC PRO-X.... The paper "Electronic Dance Music - Contextualization, Experience with Logic Pro X Software" discusses the skill of creating a new and original piece of music using composition software, including the structure, arrangement, and production of songs, the impact of mixing in this music style....
7 Pages (1750 words) Essay
sponsored ads
We use cookies to create the best experience for you. Keep on browsing if you are OK with that, or find out how to manage cookies.
Contact Us