Web Content Mining Essay Example | Topics and Well Written Essays

Web Content Mining Supervisor’s Department Introduction Content mining is the process of searching and retrieving information relevant to a topic of interest. The term became connected with the Web following last decade’s massive spread of information resources and the emergence of a number of somewhat orthogonal retrieval methods. While the vast majority of Web information to date is text based; media based information, such as images and audio/video recordings are becoming increasingly available. From a theoretical standpoint, information retrieval (IR) is the science of searching for documents, for information within documents, and for metadata about documents (Wikipedia.org). An information retrieval process involves a query (an expression about the topic under search), a search method, and one or more target documents that potentially contain the sought topic information. IR has deep theoretical Computer Science roots; but it also draws on what has become a daily activity of tens of millions of people--seeking information of interest on the Web, for all kinds of purposes. Focusing on text as the most currently available and sought form of content, search methods along with their underlying methodologies and supporting implementations have presented us with a great variation in structure. While search engines adopt the least structured methods, portals and directories attempt to adopt a fully structured taxonomy based approach. Wikis have evolved as a semi-structured approach that gained popularity as an effective means of topic-based information retrieval. Section 2 of this paper first examines search engine techniques, advantages, and drawbacks. Section 3 discusses IR through directories and portals. Wikis (specifically Wikipedia) are a state-of-the-art technique that this paper focuses on its background and usability; in Section 4. A conclusion is presented in Section 5. 2. Search Engines Search engines are the earliest and most popular way of searching Web information. There are two basic types of engines: 1) Directory based; and 2) Crawler based. The first type relies on manual or automated URL submissions, which may be human-reviewed. The second adopts sophisticated mathematical algorithms for launching Internet agents that follow each and every link in every discovered page. In both cases, pages (and not whole sites) are indexed into a relational database, where they can be later searched by keywords. Web pages (the information resources or documents) can also be searched using more advanced methods that involve the use of Boolean expressions to involve inclusions and/or exclusions, as well as to indicate required terms to appear in results. Not all search engines accept identical means of advanced search. Typically, most users repeatedly try simple keywords until they get an interesting set of documents. No matter how sophisticated the search algorithms are, the searchers’ sifting through text within billions of indexed pages is a highly undeterministic process. There are also commercial factors involved in the relative ranking of pages. Many search engine optimization (SEO) firms now specialize in maintaining good rank in search results for their clients’ sites. It is no simple process for a site or a page to be submitted to search engine that it would be found. SEO has become a discipline that involves statistical, marketing, and financial aspects. Search engines typically produce highly unstructured results. There is often further mining within a candidate resulting page to find the relevant information. There is also a very large variation in the number of results a user can expect (can be zero and can be over a million). That is not to mention the overlap among resulting pages, since many of them may be sourced from the same site. Search engines do not effectively understand us due to the current lack of ways to precisely indicate the context within which we seek a piece of information. This is a main driver of the emerging Semantic Web technology, where search would not be simply syntax driven. Search engines, though, will remain the most popular search technique due to their simplicity, ubiquity, and suitability for certain situations. The development and eventual maturity of other models will help recognizing what IR instances they are most suited for. In the writer’s viewpoint, search engines are most effective in the cases such as the following: User is looking for the home page of a corporate supplier or provider User is unsure of the commonly used name of the topic being searched User needs a broad perspective about the topic being searched User needs a variation of the nature of resulting pages (commercial, informational, academic, etc.) It is out of the scope of this paper to delve into the specifics and properties of existing search engines. That is because the author chose to focus the discussion on wikis (section 4) as a mediatory solution for information retrieval. 3. Portals, Directories, and Online Libraries At the other extreme, online directories are the most structured method of presenting information for retrieval. A directory is a highly organized taxonomy-based front-end of a relational database, whose records follow a hierarchical categorization of topics. An example is a Yellow Pages directory. A portal is a more general term that has been associated with the Web. It is similar in many ways to a directory, but is rather more general and is meant to cover all possible topics (unless otherwise desired). It has initially emerged as a search engine companion when the Yahoo.com was first launched. An Internet Web portal attempts to classify all possible information into a fixed hierarchical taxonomy. Looking through a portal requires the user to become familiar with the way it classifies information, which may not always fits the user’s conception. Although in its presentation is very well structured; the terminal links often result in few and not necessarily relevant pages. One may attribute that to the difficulty in classifying all areas of knowledge under one hierarchy. As well, the fact that entries are manually added may be a limiting factor to the number of documents under the terminal links. Portals have proven more effective for use within corporate intranets, where the scope of information to be disseminated is well defined and easier to structure. The most structured among all would be naturally a library. Online libraries have been created out of conventional ones by OCR scanning library content and posting it online using well known library indexing techniques. OCR (Optical Character Recognition) allows the scanned document text to be recognized as text and not just as part of an image, hence enabling the search of text strings within a document. Online libraries that specialize in one or more recognizable knowledge areas have also been implemented, where books and other forms of publications are posted in PDF formats. A good example is Safarionline.com in IT, where a user can search for books by a limited variety of well known keys (e.g. title, author, ISBN …), browse through tables of contents and individual chapters, and download chapters of choice. It can be foreseen that this model could potentially cover all conventional publications, where information retrieval is most and best structured. However, Web content that a user may be seeking is far wider in scope than formal publications and so its mining cannot be simply replaced by online libraries and structured directories/portals. 4. Wikis A wiki is a collection of web pages designed to enable anyone who could use a simple markup language, similar to HTML, to contribute content (Mader and Stewart, 2007). They are designed to create a collaborative environment were contributors could review, edit, and add content. Although they may contain a portal of categories, Wikis appear to the use to have a flat structure, with each page containing an article about a specific topic with relevant links and references. While they do offer search capability, typically the user needs to search the topic name rather than arbitrary text within the topic article. Wikis are gaining recognition as effective knowledge-based IR systems that make for an intermediary environment between unstructured and fully structured systems (Milne, 2008). There are both public and corporate or community-driven wikis. The most well-known and fully established public (Internet) wiki is Wikipedia.org. It is very unlikely at this time that a user will type in a topic name without finding a full comprehensive article about it (currently contains more than 2 million topics). That includes names of entities and corporations and not only objects and terms. It has become more common to see startup businesses, especially in IT, making their case (presenting their brand) in a wiki page article that carries their commercial name as a topic title. The wiki page then acts like a white paper, which objectively explains the techniques used within the company’s products and what distinguishes them from other similar products. 4.1 Wiki Characteristics Wikis have been characterized as follows (Cunninham, 2001): All users are invited to create new pages or to edit any existing pages, using a common web browser without any add-ons. Wikis promote meaningful topic associations between different pages by making page link creation almost intuitively easy A wiki is not a carefully/centrally crafted site for visitors. It seeks the involvement of users/visitors in an ongoing process of creation, collaboration, and refinement that constantly changes/improves a topic page quality. The following is quoted from (Nickul et Al., 2009): In Wikipedia, rather than one authority (typically a committee of scholars) centrally defining all subjects and content, people all over the world who are interested in a certain topic can collaborate asynchronously to create a living, breathing work. Wikipedia combines the collaborative aspects of wiki sites (websites that let visitors add, remove, edit, and change content) with the presentation of authoritative content built on rich hyperlinks between subjects to facilitate ultra-fast cross-references of facts and claims. Wikipedia does have editors, but everyone is welcome to edit. Volunteers emerge over time, editing and re-editing articles that interest them. Consistency and quality improve as more people participate, though the content isnt always perfect when first published. Anonymous visitors often make edits to correct typos or other minor errors. Defending the site against vandals (or just people with agendas) can be a challenge, especially on controversial topics, but so far the site seems to have held up. Wikipedias openness allows it to cover nearly anything, which has created some complications as editors deleted pages they didnt consider worthy of inclusion. Its always a conversation. So much being said by the above two references, the spirit of a wiki’s contribution to semi-structured information retrieval clearly identifies itself as distinct and potential compared to other known Web IR systems. 4.2 Methodology, Theory, and Related Concepts Wikipedia is therefore not a formal structure and does not involve a static taxonomy. In the sense of its semantic orientation, it has been suggested that wikis are knowledge-based IR concepts known in Ontology and the Semantic Web (Milne 2008). The author might add that its relation-oriented topic-based architecture has some similarity with the Darwin Information Typing Architecture (DITA). Ontology is a computer science term with origins in philosophy. It is a knowledge representation scheme that defines the basic terms and relations comprising the vocabulary of a topic area. It provides the means for the explicit specification of a shared conceptualization underlying knowledge representation in a knowledge base (Gomez-Perez et Al., 2005). Ontology has a web representation language called OWL (Web Ontology Language), an XML based language with primitives to define classes, restrictions (constraints), and properties. OWL builds on another XML-based markup language called RDF (Resource Description Framework). In relation to ontology, the semantic web is considered to be the future vision of the current Web initiated by Tim Barnes Lee. It represents knowledge with RDF elements using relations between information items like “includes,” “describes,” and “wrote.” (Daconta et Al., 2003) These relations that are not currently reflected on the Web are formal primitives, whose aim is to make the Web better understand our context of queries and come up with more tuned pages or documents. DITA is yet another, less theoretical, XML vocabulary that relies on topic elements to represent information. A topic can be of one of three types: reference, task, or concept. Each of these types can be further specialized or extended. Topics are written neatly and separately, are linked by relation tables, and can be re-used in multiple documents (Linton and Bruski, 2006). There is a potential of DITA becoming a solid semantic-oriented platform for constructing wikis. Finally, the W3C Web 2.0 architecture refers to a second generation Web that facilitates, among other things, secure collaboration, information sharing, and wikis. Web 2.0 information architecture models promote the idea of working backwards from practice to theory to develop useful sets of patterns. Looking throughout the recent past and identifying useful and re-usable information aspects potentially lead to reliable and realistic models describing how discovered patterns fit together (Nickul et Al., 2009). Having touched upon the most closely related topics in knowledge representation, the author cannot directly connect wikis with any one particular theory or methodology. Wikis are simpler collaborative and shared databases of topics. However, it is the author’s opinion that the future of wikis, as both general and specialized online open encyclopedia, is likely to be impacted by one or more of the described technologies. 4.3 An Article’s Lifecycle The lifecycle of an article on Wikipedia.org has been clearly described in (Ayers et Al., 2008) Article birth: An Internet user searches for a topic and does not find it. The user decides the article is a good candidate for introduction. So, the user authors the article and saves it to Wikipedia. Deletion: Another more literate user reads the article and finds it un-encyclopedic or of poor quality. The new user nominates the article for deletion by tagging it with a red bar. Maintenance tagging: A yellow tag is used by a reviewing visitor to indicate that an article needs maintenance, which may be corrections, additions, references, or any other type of editorial work. Editing: Visitors who feel they have something to add to improve the content of an existing article may just do that. WikiProject maintains a list of new articles. This is where experts may want to vent for articles needing an expert’s review or improvement. Merging: A frequent reader might feel that two or more articles relate enough that they could be better merged into one consolidated article. A procedure exists for the reader to accomplish that. Categorization: Although not widely publicized at the site’s front page (home page), Wikipedia.org does have a portal like categorical index. There is also a procedure for a user to propose a better category for a newly created article that has already been assigned to an unsuitable category. Incoming links: Wikilinks pointing to the new article should be added to appropriate articles. There is also a procedure to accomplish this by the article writer. Renaming: Provisions exist to rename articles, which are referred to as an “article move.” There is no particular order that the above actions must follow, and not all (or any) of them are bound to take place. However, what these actions show is the strong likelihood that an article will be collaboratively tuned and improved over a period of time until it becomes a stable topic. It is also likely that over broader cycles, interested editors may find it necessary to review an article that happens to contain outdated content. 4.4 Trustworthiness Since wikis are collaboratively edited by public contributors, their trustworthiness can be questioned. There are two counter arguments here: 1. Like open source software, wiki content gets read and reviewed by mass number of users. It is likely that commonly visible errors or malicious content will be corrected (Encyclopedia Britannica, 2007). 2. Unless peer-reviewed and published by a recognizable publishing house, all Internet content is subject to the same issue. How could a seriously written and reviewed wiki compare to millions of chaotic blog entries and other arbitrarily written Internet content! It is the author’s opinion that while no content is 100% free of errors, trustworthiness is a function of the amount of exposure, reviewing, and recognition that content gains over time. There aren’t any significant Wikipedia.org scams known to date, for example. Also at the end, the reader should always take content judiciously and apply it with due care and verification according to its intended use. And if we think of the utility of publicly edited Internet content, one could hardly find a match for a full length article with references to be generated on the spot in response to entering an encyclopedia like term or name (only with much broader scope). We feel that wikis (and more specifically Wikipedia) contain solid mechanisms to be a self-organizing system through open community collaborative interaction. 5. Conclusion This paper considered the vital topic of information retrieval on the internet, also commonly known as Web content mining. The paper classified IR systems according to their structure. A discussion has been presented for the two extremities: the highly unstructured search engine approach and the well structured directory approach. Use cases and limitations were illustrated for each. The paper then focused on the most recent approach, which is the semi-structured wikis. As a most common example, the paper discussed the theory and practice of Wikipedia.org. Evidence was presented to the potential and suitability of this approach for taking advantage of emerging future Web technologies. It was also demonstrated that the collaborative editing environment, as compared to central authoritative encyclopedia like editing works very well for the accuracy expected from a public editing domain like the Internet; and for the typically sought accuracy by internet users. Wikis are one of the items that expect to receive special attention during the development of the currently yet loosely defined Web 2.0, which is meant to be more user context centric by enhancing the semantic cognition ability of Web IR systems. 6. References Mader and Stewart, “Wikipatterns,” John Wiley and Sons, 2007 David Milne, “Knowledge-based Information Retrieval with Wikipedia,” Google tech Talks (http://www.youtube.com/watch?v=NFCZuzA4cFc), 2008 http://www.wikipedia.org “Wiki,” Encyclopedia Britannica, 2007 B. Leuf and W. Cunninham, “The Wiki Way: Quick Collaboration on the Web,” Addison Wesley, 2001 D. Nickull, D. Hinchcliffe, and James Governor, “Web 2.0 Architectures,” O’Reilly, 2009 A. Gomez-Perez, M. Fernandez-Lopez, O. Corcho, “Ontological Engineering,” Springer Verlag, 2005 M.C. Daconta, L.J. Obrst, and K.T. Smith, “The Semantic Web,” John Wiley, 2003 J. Linton and K. Bruski, “Introduction to DITA: A User Guide to the Darwin Information typing Architecture,” Comtech Services Inc., 2006 Tim O’Reilly, “What is Web 2.0—Design Patterns and Business models for the Next Generation of Software,” 2005 P. Ayers, C. Matthews, and B. Yates, “How Wikipedia Works,” No Starch press, 2008 Read More

Web Content Mining - Essay Example

Extract of sample "Web Content Mining"

CHECK THESE SAMPLES OF Web Content Mining

Data Mining: Concepts and Techniques

The Role of Social Media in Branding in the UK

Data Mining and Web Personalization

Future of Business Intelligence, Data Classification and Prediction

Data Warehouse and Data Mining in Business

Data Warehousing and Data Mining

Exam Decision Support System and Business Intelligence

Survey in Multimedia Data Mining by Content in Social Media