Email Classification Data Set Intelligent System Assignment Example | Topics and Well Written Essays

Email ification Data Set Literature Review A data set can be scientifically defined as a collection ofdata in the form of tables, with each column representing particular variables while each row corresponds to the given number of data sets in the study analysis; by listing values for each of the variables (datum). Varied features are used to give descriptions of data set structures and properties. Such characteristics include the types of attributes or variables and their numbers together with the various statistical measures (i.e. standard deviation) that can be applied on them (Hirsh H., 1990). Classification rules of the email messages as induced by machine learning systems are judged based on two distinctive criteria of classification accuracy on an independent test set, and classification complexities recommended by the machine learning community to be applied by the end users for easy and quick information retrievals (Cohen W., 1996). Research works and studies have revealed indications that simple rules are likely to achieve very high accuracy on a number of datasets. According to Shavlik et al (1991), the accuracy level of perceptrons is always hardly distinguishable from more complicated/ technical learning algorithms. In addition, other two studies Rendell and Seshu (1990) and Mingers (1989), have proved this same fact. These studies explain that considering many real world datasets, just a few if any are easy to learn; and the accuracy levels are hardly ever seen to depreciate while pruning increasingly become severe, consecutively. Artificial neural networks, having had the above chronological concepts flow, are representations of computing techniques based on how the human brain performs computations by fitting non-linear functions and recognizing different patterns. In the recent times, their applications have been utilized in the financial, telecommunications, transportation, electronics, entertainment, aerospace, banking, defense, manufacturing and robotic industries (Minsky and Papert, 1969). With the increased rates of information exchange over the internet, interests have been created on how such information should be managed because of their massive reception and delivery. For instance, information industries such as media companies have demanded for an automated classification of email messages and other forms of electronic messages into a more user-specific and friendly packaging, in addition to extraction of information from a chronologically ordered electronic message streams. This paper therefore has identified the enormous lack benchmark collections, which have resultantly created an obstruction in trying to study these problems and come up with appropriate strategies and remedies that will form permanent solutions to this back draw in electronic information management. This research work will bring to table a new technique called Enron corpus that will be analyzed based on its suitability with respect to email folder classification and tentatively provide for results of a state-of-the-art classification under a range of conditions. Although the idea and need to classify such information in that particular order is not new, the technique to do so is what is of concern. Previous milestones covered in trying to classify electronic mails were based on applications which included priority filtering of messages, messages assignment to user-created folders, text classification and security measure strategies that were aimed at identifying SPAM contents carried within messages (data packets) passed over such communication protocols. This project however, will particularly focus on a particular problem of assigning messages to identified user folders based on their (users’) strategies of classifying folders; in this case students. Basing on ways of classification, this project also considered the three characteristics of email in their classification i.e. unstructured texts, categorical text, and numeric data with considerations on their relationships in general. Project preview This case study is an investigation into 100 students who use electronic mails in their daily operations and class work assignments. In total, these students collectively had a 500, 000 recommended email messages. In bid to come up with proper conclusion of this investigatory study, the unnecessary folders such as computer generated folders – my documents that only contained duplicates of already stored messages were deleted/ ignored. This step of ignoring or deleting such folders was recommended to make clear and avoid confusions, which would have otherwise been caused by the self-organized-folders or computer-generated-folders, to the project’s set out goal of analyzing the suitability and flexibility of the corpus data set in exploring appropriate classification of email messages. Out of the outstanding features of the Enron corpus, a large number of users (100 students), messages (500,000 emails) and intended folders containing the electronic mail messages, after prompt classification, were effectively and suitably managed by corpus techniques for evaluating such email classification methods. Additionally, Enron corpus was proved to have a larger number of threads that would make it possible for the testing email analysis methods that could possibly use thread information or thread data. 2. Neural Network design ANN (Artificial Neural Networks) has been the most preferred technique used in the classification processes even for the previous efforts to classify electronic mail messages into spam and non-spam, due to its close similarities between the human brain functionalities and its entire structure. For the purpose of this project, though, main focus was rested on research in Artificial Neural Network and its training facilitation, using back propagation algorithm. For clear understanding of the ANN network layers’ Architecture used in students’ email categorization, below is a diagrammatic representation equivalent to the target categories used in classification Back propagation algorithm was applied in this situation purposefully to modify the Artificial Neural Network model from classification based on performance of the network with the main aim of minimizing errors that might have occurred between the required classification targets and the obtained targets (discriminative data); and generative models which facilitated the generation of original information capability of the network system with the intension of minimizing errors that might have occurred between the model-generated data and the original data. Looking at the description given above, it is obvious that this project was based on Neural Network design with a further consideration of Deep Belief Network model that made it possible for the network to generate visible activations based upon the hidden units’ states through the use of a restricted Boltzmann machine that was strategically located between any two consecutive layers of the networks (refer to the diagram on five layers of ANN shown above). Below are simplified diagrammatic representations of the differences between RBM, DBN and ANN on data visibility between two layers of the ANN network. Dataset processing Sequence Out of the 100 students identified as the sample population, Enron datasets were gathered and organized using a Cognitive Assistant that learnt and organized all data within the project. Each student in the study had a folder containing other folders by year of study category and further split into sub-folders representing subjects/ units/ courses. Only necessary folders were considered in this study for accuracy and consistency during the pre-processing. This classification is as shown below. As shown above the dataset was organized with each student (.i.e. Mike, Juliet and Shali) under Folder one, which represent a sub-directory of email directory. Folder Two represents categories/ classes of folders in which data under each student’s docket is stored. Folder Three is a sub-folder of Folder Two which categorizes the emails further into thinner class of specific year of studies for each and every student. The last category of folder is classification under units or courses of study. This classification criterion is seen to help achieve the main goal of this project. A critical look at the architecture gives an impression of message structure composed of two sections which includes message header consisting of dates, titles, attachments and time of delivery. 3. Neural network performance Evaluation of the ANN on the classification of emails required for the adjustment of all layers of ANN as shown on the architecture above to achieve no or minimal training errors as possible. This action called for a series of actions which included extraction of email features, and the obtained vectors are then applied to Artificial Neural Network system, illustratively presented as below. This system was evaluated by each electronic mail in the test set being applied to the ANN classifier and thereafter the classifier decision acquired is put to a comparison with the actual class label for the actual attainment of an error free classification. Mathematically, the goals of simplicity and precision with regards to information/ email classification as expressed below. Misclassified E-mail Error in Student precision = x 100% Total E-mail per Student Accuracy of Student selection = (.i.e. Mike selection error) x 100% Experimental deductions It is important to note in a summary that for this case study, a Deep Artificial Neural Network classification technique was the center of analysis for electronic mail and folder classification with the major training algorithms being DBN and RBM. Consequently, Enron Corpus was the most preferred application because of its unique features that had profound advantages to the case study. These unique characteristics included its rampant applicability by most institutions hence making it user acceptable; its resistance to distortion s that would have resulted from noise during data transmission; its reach datasets; and ease of learning as was proved during the training phases. Its level of accuracy (which was also noted to be so high) depended upon the users, with possible attainment of 100% accuracy level as compared to other techniques that their levels of accuracy rated slightly lower and could not reach the 100% precision. Data collected for accuracy analysis in a noisy environment for SVM, RB K-NN, NBM, GB K-NN, and IBk, among five randomly selected students were as in the table below: RB RB K-NN GB K-NN IBk SVM Mike 77.56 44.78 73.42 20.78 69.54 Juliet 30.54 40.32 64.37 33.51 55.40 Shali 51.89 65.05 79.02 67.19 73.97 Rachael 82.63 41.89 49.38 28.63 39.22 James 41.22 46.40 75.56 48.31 54.98 MEAN 56.77 47.69 68.35 39.68 58.62 Form the table above, it is evident that RB and GB K-NN (Gaussian) classifiers had immense performance than NBM and SVM (Random) classifiers when tested using a balanced training sets of Gaussian distribution. Considering increasing use of electronic communications, the rate of email use continues to advance. Enron corpus email database, as explained above, has and will hence enable for a large-world pool of shared experiences and techniques in managing email messages. For instance, the actual time (in seconds) that was taken in the RF compared to Stacking and Boostes DTs experimentation was as below: RF Stacking Boosted DTs Lingspam 15.77 seconds 366.77 seconds 88.20 seconds PU 6.92 seconds 56.45 seconds 26.52 seconds Further research works have revealed that based on standard evaluation techniques, random training and or test set spit methods are least appropriate for performing tasks due to time-dependent nature of data they handle. Step-incremental time-based split methods have therefore been proposed, despite their levels of complexity, to be used in providing more actual evaluation sequences and to effectively allow for examination of the statistical importance of all the folders and their sub-folders’ outputs. Comparative illustration: RF DT NB SVM Mike 4.21 1.78 0.42 9.54 Juliet 4.54 0.32 0.37 55.40 Shali 4.89 0.05 0.02 3.97 Rachael 5.68 1.89 0.38 39.22 James 5.42 1.40 0.56 5.98 MEAN 4.95 1.02 0.35 22.82 RF still has the advantage of small number of tree and random characteristics, unlike to the complex SVM. Conclusion Both supervised and semi-supervised training settings for RF, results of outstanding performance in terms of classification as compared to other algorithms such as SVM, NB and DT (Patrick J. et al. 2002). A deduction that email filing into defined folder is not an easy task was realized. It highly depends on the user classification styles i.e. use of topic, sender e.t.c. to perform the automatic classification. References Hirsh, H. (1990). Learning from Data with Bounded Inconsistency in B.W. Porter & R.J. Mooney (Eds.), Proceedings of the Seventh International Conference on Machine Learning (pp.32−39); Morgan Kaufmann. Rendell L. and Seshu R. (1990). Learning Hard Concepts through Constructive Induction. Computational Intelligence, 6, 247−270 Shavlik J., Mooney R. J., and Towell G. (1991). Symbolic and Neural Learning Algorithms: An Experimental Comparison. Machine Learning, 6, 111−143 Awad W. A. and Elseuofi S. M. (2011). “Machine learning methods for E-mail classification”, International Journal of Computer Applications (0975 – 8887), Volume 16– No.1, February 2011 Mingers J. (1989). An Empirical Comparison of Pruning Methods for Decision Tree Induction. Machine Learning, 4(2), 227−243 Minsky and Papert. (1969). Perceptrons: An introduction to computational geometry, MIT press, expanded edition. Crawford E., Koprinska I. and Patrick J. (2002). A multi-learner approach to e-mail classification, in: Proc. 7th Australasian Document Computing Symposium (ADCS). Cohen W. (1996). Learning rules that classify e-Mail, in: Proc. AAAI Symposium on Machine Learning in Information Access, pp. 18-25. E. Crawford, I. Koprinska, J. Patrick, A multi-learner approach to e-mail classification, in: Proc. 7th Australasian Document Computing Symposium (ADCS), 2002. Read More

Email Classification Data Set Intelligent System - Assignment Example

Extract of sample "Email Classification Data Set Intelligent System"

CHECK THESE SAMPLES OF Email Classification Data Set Intelligent System

How to Maintain Network Security

Business Intelligence HW

Hierarchical communication

Condition monitoring, fault diagnosis, fault classification or fiding fault for machenary

Future of Business Intelligence, Data Classification and Prediction

Artificial Intelligence: Concepts and Future in Business

Asset Management Software Documentation

Issues and Solutions Utilizing an Information Classification Schema