StudentShare
Contact Us
Sign In / Sign Up for FREE
Search
Go to advanced search...
Free

Email Classification Data Set Intelligent System - Assignment Example

Cite this document
Summary
The paper "Email Classification Data Set Intelligent System" discusses the project based on Neural Network design with further consideration of the Deep Belief Network model that made it possible for the network to generate visible activations based upon the hidden units states…
Download full paper File format: .doc, available for editing
GRAB THE BEST PAPER98.8% of users find it useful
Email Classification Data Set Intelligent System
Read Text Preview

Extract of sample "Email Classification Data Set Intelligent System"

Email ification Data Set Literature Review A data set can be scientifically defined as a collection ofdata in the form of tables, with each column representing particular variables while each row corresponds to the given number of data sets in the study analysis; by listing values for each of the variables (datum). Varied features are used to give descriptions of data set structures and properties. Such characteristics include the types of attributes or variables and their numbers together with the various statistical measures (i.e. standard deviation) that can be applied on them (Hirsh H., 1990). Classification rules of the email messages as induced by machine learning systems are judged based on two distinctive criteria of classification accuracy on an independent test set, and classification complexities recommended by the machine learning community to be applied by the end users for easy and quick information retrievals (Cohen W., 1996). Research works and studies have revealed indications that simple rules are likely to achieve very high accuracy on a number of datasets. According to Shavlik et al (1991), the accuracy level of perceptrons is always hardly distinguishable from more complicated/ technical learning algorithms. In addition, other two studies Rendell and Seshu (1990) and Mingers (1989), have proved this same fact. These studies explain that considering many real world datasets, just a few if any are easy to learn; and the accuracy levels are hardly ever seen to depreciate while pruning increasingly become severe, consecutively. Artificial neural networks, having had the above chronological concepts flow, are representations of computing techniques based on how the human brain performs computations by fitting non-linear functions and recognizing different patterns. In the recent times, their applications have been utilized in the financial, telecommunications, transportation, electronics, entertainment, aerospace, banking, defense, manufacturing and robotic industries (Minsky and Papert, 1969). With the increased rates of information exchange over the internet, interests have been created on how such information should be managed because of their massive reception and delivery. For instance, information industries such as media companies have demanded for an automated classification of email messages and other forms of electronic messages into a more user-specific and friendly packaging, in addition to extraction of information from a chronologically ordered electronic message streams. This paper therefore has identified the enormous lack benchmark collections, which have resultantly created an obstruction in trying to study these problems and come up with appropriate strategies and remedies that will form permanent solutions to this back draw in electronic information management. This research work will bring to table a new technique called Enron corpus that will be analyzed based on its suitability with respect to email folder classification and tentatively provide for results of a state-of-the-art classification under a range of conditions. Although the idea and need to classify such information in that particular order is not new, the technique to do so is what is of concern. Previous milestones covered in trying to classify electronic mails were based on applications which included priority filtering of messages, messages assignment to user-created folders, text classification and security measure strategies that were aimed at identifying SPAM contents carried within messages (data packets) passed over such communication protocols. This project however, will particularly focus on a particular problem of assigning messages to identified user folders based on their (users’) strategies of classifying folders; in this case students. Basing on ways of classification, this project also considered the three characteristics of email in their classification i.e. unstructured texts, categorical text, and numeric data with considerations on their relationships in general. Project preview This case study is an investigation into 100 students who use electronic mails in their daily operations and class work assignments. In total, these students collectively had a 500, 000 recommended email messages. In bid to come up with proper conclusion of this investigatory study, the unnecessary folders such as computer generated folders – my documents that only contained duplicates of already stored messages were deleted/ ignored. This step of ignoring or deleting such folders was recommended to make clear and avoid confusions, which would have otherwise been caused by the self-organized-folders or computer-generated-folders, to the project’s set out goal of analyzing the suitability and flexibility of the corpus data set in exploring appropriate classification of email messages. Out of the outstanding features of the Enron corpus, a large number of users (100 students), messages (500,000 emails) and intended folders containing the electronic mail messages, after prompt classification, were effectively and suitably managed by corpus techniques for evaluating such email classification methods. Additionally, Enron corpus was proved to have a larger number of threads that would make it possible for the testing email analysis methods that could possibly use thread information or thread data. 2. Neural Network design ANN (Artificial Neural Networks) has been the most preferred technique used in the classification processes even for the previous efforts to classify electronic mail messages into spam and non-spam, due to its close similarities between the human brain functionalities and its entire structure. For the purpose of this project, though, main focus was rested on research in Artificial Neural Network and its training facilitation, using back propagation algorithm. For clear understanding of the ANN network layers’ Architecture used in students’ email categorization, below is a diagrammatic representation equivalent to the target categories used in classification Back propagation algorithm was applied in this situation purposefully to modify the Artificial Neural Network model from classification based on performance of the network with the main aim of minimizing errors that might have occurred between the required classification targets and the obtained targets (discriminative data); and generative models which facilitated the generation of original information capability of the network system with the intension of minimizing errors that might have occurred between the model-generated data and the original data. Looking at the description given above, it is obvious that this project was based on Neural Network design with a further consideration of Deep Belief Network model that made it possible for the network to generate visible activations based upon the hidden units’ states through the use of a restricted Boltzmann machine that was strategically located between any two consecutive layers of the networks (refer to the diagram on five layers of ANN shown above). Below are simplified diagrammatic representations of the differences between RBM, DBN and ANN on data visibility between two layers of the ANN network. Dataset processing Sequence Out of the 100 students identified as the sample population, Enron datasets were gathered and organized using a Cognitive Assistant that learnt and organized all data within the project. Each student in the study had a folder containing other folders by year of study category and further split into sub-folders representing subjects/ units/ courses. Only necessary folders were considered in this study for accuracy and consistency during the pre-processing. This classification is as shown below. As shown above the dataset was organized with each student (.i.e. Mike, Juliet and Shali) under Folder one, which represent a sub-directory of email directory. Folder Two represents categories/ classes of folders in which data under each student’s docket is stored. Folder Three is a sub-folder of Folder Two which categorizes the emails further into thinner class of specific year of studies for each and every student. The last category of folder is classification under units or courses of study. This classification criterion is seen to help achieve the main goal of this project. A critical look at the architecture gives an impression of message structure composed of two sections which includes message header consisting of dates, titles, attachments and time of delivery. 3. Neural network performance Evaluation of the ANN on the classification of emails required for the adjustment of all layers of ANN as shown on the architecture above to achieve no or minimal training errors as possible. This action called for a series of actions which included extraction of email features, and the obtained vectors are then applied to Artificial Neural Network system, illustratively presented as below. This system was evaluated by each electronic mail in the test set being applied to the ANN classifier and thereafter the classifier decision acquired is put to a comparison with the actual class label for the actual attainment of an error free classification. Mathematically, the goals of simplicity and precision with regards to information/ email classification as expressed below. Misclassified E-mail Error in Student precision = x 100% Total E-mail per Student Accuracy of Student selection = (.i.e. Mike selection error) x 100% Experimental deductions It is important to note in a summary that for this case study, a Deep Artificial Neural Network classification technique was the center of analysis for electronic mail and folder classification with the major training algorithms being DBN and RBM. Consequently, Enron Corpus was the most preferred application because of its unique features that had profound advantages to the case study. These unique characteristics included its rampant applicability by most institutions hence making it user acceptable; its resistance to distortion s that would have resulted from noise during data transmission; its reach datasets; and ease of learning as was proved during the training phases. Its level of accuracy (which was also noted to be so high) depended upon the users, with possible attainment of 100% accuracy level as compared to other techniques that their levels of accuracy rated slightly lower and could not reach the 100% precision. Data collected for accuracy analysis in a noisy environment for SVM, RB K-NN, NBM, GB K-NN, and IBk, among five randomly selected students were as in the table below: RB RB K-NN GB K-NN IBk SVM Mike 77.56 44.78 73.42 20.78 69.54 Juliet 30.54 40.32 64.37 33.51 55.40 Shali 51.89 65.05 79.02 67.19 73.97 Rachael 82.63 41.89 49.38 28.63 39.22 James 41.22 46.40 75.56 48.31 54.98 MEAN 56.77 47.69 68.35 39.68 58.62 Form the table above, it is evident that RB and GB K-NN (Gaussian) classifiers had immense performance than NBM and SVM (Random) classifiers when tested using a balanced training sets of Gaussian distribution. Considering increasing use of electronic communications, the rate of email use continues to advance. Enron corpus email database, as explained above, has and will hence enable for a large-world pool of shared experiences and techniques in managing email messages. For instance, the actual time (in seconds) that was taken in the RF compared to Stacking and Boostes DTs experimentation was as below: RF Stacking Boosted DTs Lingspam 15.77 seconds 366.77 seconds 88.20 seconds PU 6.92 seconds 56.45 seconds 26.52 seconds Further research works have revealed that based on standard evaluation techniques, random training and or test set spit methods are least appropriate for performing tasks due to time-dependent nature of data they handle. Step-incremental time-based split methods have therefore been proposed, despite their levels of complexity, to be used in providing more actual evaluation sequences and to effectively allow for examination of the statistical importance of all the folders and their sub-folders’ outputs. Comparative illustration: RF DT NB SVM Mike 4.21 1.78 0.42 9.54 Juliet 4.54 0.32 0.37 55.40 Shali 4.89 0.05 0.02 3.97 Rachael 5.68 1.89 0.38 39.22 James 5.42 1.40 0.56 5.98 MEAN 4.95 1.02 0.35 22.82 RF still has the advantage of small number of tree and random characteristics, unlike to the complex SVM. Conclusion Both supervised and semi-supervised training settings for RF, results of outstanding performance in terms of classification as compared to other algorithms such as SVM, NB and DT (Patrick J. et al. 2002). A deduction that email filing into defined folder is not an easy task was realized. It highly depends on the user classification styles i.e. use of topic, sender e.t.c. to perform the automatic classification. References Hirsh, H. (1990). Learning from Data with Bounded Inconsistency in B.W. Porter & R.J. Mooney (Eds.), Proceedings of the Seventh International Conference on Machine Learning (pp.32−39); Morgan Kaufmann. Rendell L. and Seshu R. (1990). Learning Hard Concepts through Constructive Induction. Computational Intelligence, 6, 247−270 Shavlik J., Mooney R. J., and Towell G. (1991). Symbolic and Neural Learning Algorithms: An Experimental Comparison. Machine Learning, 6, 111−143 Awad W. A. and Elseuofi S. M. (2011). “Machine learning methods for E-mail classification”, International Journal of Computer Applications (0975 – 8887), Volume 16– No.1, February 2011 Mingers J. (1989). An Empirical Comparison of Pruning Methods for Decision Tree Induction. Machine Learning, 4(2), 227−243 Minsky and Papert. (1969). Perceptrons: An introduction to computational geometry, MIT press, expanded edition. Crawford E., Koprinska I. and Patrick J. (2002). A multi-learner approach to e-mail classification, in: Proc. 7th Australasian Document Computing Symposium (ADCS). Cohen W. (1996). Learning rules that classify e-Mail, in: Proc. AAAI Symposium on Machine Learning in Information Access, pp. 18-25. E. Crawford, I. Koprinska, J. Patrick, A multi-learner approach to e-mail classification, in: Proc. 7th Australasian Document Computing Symposium (ADCS), 2002. Read More
Tags
Cite this document
  • APA
  • MLA
  • CHICAGO
(Email Classification Data Set Intelligent System Assignment Example | Topics and Well Written Essays - 2000 words, n.d.)
Email Classification Data Set Intelligent System Assignment Example | Topics and Well Written Essays - 2000 words. https://studentshare.org/information-technology/1793423-intelligent-systemsoftware-engineering-degree
(Email Classification Data Set Intelligent System Assignment Example | Topics and Well Written Essays - 2000 Words)
Email Classification Data Set Intelligent System Assignment Example | Topics and Well Written Essays - 2000 Words. https://studentshare.org/information-technology/1793423-intelligent-systemsoftware-engineering-degree.
“Email Classification Data Set Intelligent System Assignment Example | Topics and Well Written Essays - 2000 Words”. https://studentshare.org/information-technology/1793423-intelligent-systemsoftware-engineering-degree.
  • Cited: 0 times

CHECK THESE SAMPLES OF Email Classification Data Set Intelligent System

How to Maintain Network Security

hellip; In the context of network security definition, it consists of concerns linked to network communication privacy, confidentiality of data over the network, accessing unauthorised classified data, access to prohibited network domains and utilising Internet for concealed communication (Network Security.... Moreover, massive funds are allocated for network security along with advanced security hardware devices including Hardware Security Modules (HSM), which are deployed to secure mission critical data....
24 Pages (6000 words) Essay

Business Intelligence HW

On the other hand, BI is a system or systems that provide unswerving background facts and coverage tools to maintain and advance the decision-making process.... You need to set up your mental picture for your business intelligence strategy before you bring the aspect of technology into the discussion.... Enterprises that are more complex may benefit from customization, even though you may still want to think about opening with an industry-standard sculpt as a pattern or a set of guides....
3 Pages (750 words) Research Paper

Hierarchical communication

The development of the linking pins would require the creation of committee containing high-ranking officials from all the integral agencies in the justice system.... Recruitment of such linking pins would demand experience in the justice system.... It is vital to establish linking pins since the justice agencies have stringent restrictions on circulation of data....
2 Pages (500 words) Essay

Condition monitoring, fault diagnosis, fault classification or fiding fault for machenary

However, it needs to be kept in mind that such methods may not always be successful given the complex interaction between various levels of components in any practical mechanical system.... CHAPTER SIX Condition Monitoring and Fault Classification Using Artificial intelligent Techniques from Vibration reciprocating air compressor 1.... In recent years, there has been a growing trend to introduce more intelligent methods in order to deal with condition monitoring and fault classification for machines (Mills, 2010)....
25 Pages (6250 words) Dissertation

Future of Business Intelligence, Data Classification and Prediction

The paper "Future of Business Intelligence, data Classification and Prediction" highlights that information overload is as damaging as lack of it as too much information may lead to action paralysis.... It is always useful to employ filters to screen data and separate critical from non-essential.... nbsp;… data mining usage may allow analysis on data repository with analysis extending beyond the original scope of data....
8 Pages (2000 words) Coursework

Artificial Intelligence: Concepts and Future in Business

Thus, a system of integrating knowledge, management, and decision support processes is in great demand (Moss et al, 2003).... Knowledgebase from human experts is converted into the IF-THEN rules upon which the system depend, and the powerlessness of an expert system to make use of inductive learning and inference to get used to the rule base to changing situations.... the artificial neural system, also acknowledged as an artificial neural network, electronic neural network, or merely neural net....
9 Pages (2250 words) Coursework

Asset Management Software Documentation

The paper "Asset Management Software Documentation " discusses that the system is self explanatory and no hard jargons are used.... The design is easy to understand, and the user can move from one of the system to another easily without needing assistance.... hellip; The system has a crash dump sector which if the system happens to crash will record the current activities that led to the crash and also store the default of the system thereby allowing for a reboot and restoration of the system, though any information being keyed in or processes taking place during the crash time will have to be done again....
5 Pages (1250 words) Essay

Issues and Solutions Utilizing an Information Classification Schema

By then security was considered as far as secret word administration, get to control of undertaking data, confidential security utilizing some cryptographic model, or algorithm level security resemblance overflow.... In this paper "Issues and Solutions Utilizing an Information classification Schema", the author is an Information Security Manager tasked with reporting and researching on the protection of the product development information now and in the future....
12 Pages (3000 words) Term Paper
sponsored ads
We use cookies to create the best experience for you. Keep on browsing if you are OK with that, or find out how to manage cookies.
Contact Us