Difference between Web Content, Web Structure, and Web Usage Mining

Web mining is the application of data mining techniques to extract knowledge from web data, including web documents, hyperlinks between documents, usage logs of websites, etc. Web mining aims to discover and retrieve useful and interesting patterns from large data sets and classic data mining. Big data act as data sets on web mining. Web data includes information, documents, structure, and profile. Web mining is based on two concepts defined, process-based and data-driven. In general, the use of web mining typically involves several steps, such as collecting data, selecting the data before processing, knowledge discovery, and analysis.

The internet has become a crucial part of our lives nowadays, so the techniques that help extract data on the web are an interesting area of research. These techniques help to extract knowledge from Web data, in which at least one of structure or usage (Weblog) data is used in the mining process (with or without other types of the web). In general, Web mining tasks can be classified into three categories:

Web content mining
Web structure mining
Web usage mining

All three categories focus on the process of knowledge discovery of implicit, previously unknown, and potentially useful information from the web. Each of them focuses on different mining objects of the web. Let's study all of the three categories in brief for good understanding.

What is Web Content Mining?

Web Content Mining can be used for the mining of useful data, information, and knowledge from web page content. Web content mining performs scanning and mining of the text, images, and group of web pages according to the content of the input by displaying the list in search engines.

It is also quite different from data mining because web data are mainly semi-structured or unstructured, while data mining deals primarily with structured data. Web content mining is also different from text mining because of the semi-structured nature of the web, while text mining focuses on unstructured texts. Thus, Web content mining requires creative applications of data mining and text mining techniques and its own unique approaches.

In the past few years, there has been a rapid expansion of activities in the web content mining area. This is not surprising because of the phenomenal growth of web content and the significant economic benefit of such mining. However, due to the heterogeneity and the lack of structure of web data, automated discovery of targeted or unexpected knowledge information still present many challenging research problems. Web content mining could be differentiated from two approaches, such as:

1. Agent-based Approach

This approach involves intelligent systems. It aims to improve information finding and filtering. It usually relies on autonomous agents that can identify relevant websites. And it could be placed into the following three categories, such as:

Intelligent Search Agents: These agents search for relevant information using domain characteristics and user profiles to organize and interpret the discovered information.
Information Filtering or Categorization: These agents use information retrieval techniques and characteristics of open hypertext Web documents to retrieve automatically, filter, and categorize them.
Personalized Web Agents: These agents learn user preferences and discover Web information based on other users' preferences with similar interests.

2. Data based approach

Data based approach is used to organize semi-structured data present on the internet into structured data. It aims to model the web data into a more structured form to apply standard database querying mechanisms and data mining applications to analyze it.

Web Content Mining Challenges

Web content mining has the following problems or challenges also with their solutions, such as:

Data Extraction: Extraction of structured data from Web pages, such as products and search results. Extracting such data allows one to provide services. Two main types of techniques, machine learning and automatic extraction, are used to solve this problem.
Web Information Integration and Schema Matching: Although the Web contains a huge amount of data, each website (or even page) represents similar information differently. Identifying or matching semantically similar data is an important problem with many practical applications.
Opinion extraction from online sources: There are many online opinion sources, e.g., customer reviews of products, forums, blogs, and chat rooms. Mining opinions are of great importance for marketing intelligence and product benchmarking.
Knowledge synthesis: Concept hierarchies or ontology are useful in many applications. However, generating them manually is very time-consuming. The main application is to synthesize and organize the pieces of information on the web to give the user a coherent picture of the topic domain. A few existing methods that explore the web's information redundancy will be presented.
Segmenting Web pages and detecting noise: In many Web applications, one only wants the main content of the Web page without advertisements, navigation links, copyright notices. Automatically segmenting Web pages to extract the pages' main content is an interesting problem.

What is Web Structure Mining?

The challenge for Web structure mining is to deal with the structure of the hyperlinks within the web itself. Link analysis is an old area of research. However, with the growing interest in Web mining, the research of structure analysis has increased. These efforts resulted in a newly emerging research area called Link Mining, which is located at the intersection of the work in link analysis, hypertext, web mining, relational learning, inductive logic programming, and graph mining.

Web structure mining uses graph theory to analyze a website's node and connection structure. According to the type of web structural data, web structure mining can be divided into two kinds:

Extracting patterns from hyperlinks in the web: a hyperlink is a structural component that connects the web page to a different location.
Mining the document structure: analysis of the tree-like structure of page structures to describe HTML or XML tag usage.

The web contains a variety of objects with almost no unifying structure, with differences in the authoring style and content much greater than in traditional collections of text documents. The objects in the WWW are web pages, and links are in, out, and co-citation (two pages linked to by the same page). Attributes include HTML tags, word appearances, and anchor texts. Web structure mining includes the following terminology, such as:

Web graph:directed graph representing web.
Node: web page in the graph.
Edge: hyperlinks.
In degree: the number of links pointing to a particular node.
Out degree: number of links generated from a particular node.

An example of a technique of web structure mining is the PageRank algorithm used by Google to rank search results. A page's rank is decided by the number and quality of links pointing to the target node.

Link mining had produced some agitation on some traditional data mining tasks. Below we summarize some of these possible tasks of link mining which are applicable in Web structure mining, such as:

Link-based Classification: The most recent upgrade of a classic data mining task to linked Domains. The task is to predict the category of a web page based on words that occur on the page, links between pages, anchor text, html tags, and other possible attributes found on the web page.
Link-based Cluster Analysis: The data is segmented into groups, where similar objects are grouped together, and dissimilar objects are grouped into different groups. Unlike the previous task, link-based cluster analysis is unsupervised and can be used to discover hidden patterns from data.
Link Type: There is a wide range of tasks concerning predicting the existence of links, such as predicting the type of link between two entities or predicting the purpose of a link.
Link Strength: Links could be associated with weights.
Link Cardinality: The main task is to predict the number of links between objects. page categorization used to
- Finding related pages.
- Finding duplicated websites and finding out the similarity between them.

What is Web Usage Mining?

Web Usage Mining focuses on techniques that could predict the behavior of users while they are interacting with the WWW. Web usage mining, discovering user navigation patterns from web data, trying to discover useful information from the secondary data derived from users' interactions while surfing the web. Web usage mining collects the data from Weblog records to discover user access patterns of web pages. Several available research projects and commercial tools analyze those patterns for different purposes. The insight knowledge could be utilized in personalization, system improvement, site modification, business intelligence, and usage characterization.

The only information left behind by many users visiting a Web site is the path through the pages they have accessed. Most of the Web information retrieval tools only use textual information, while they ignore the link information that could be very valuable. In general, there are mainly four kinds of data mining techniques applied to the web mining domain to discover the user navigation pattern, such as:

1. Association Rule Mining

Association rule is the most basic rule of data mining methods which is used more than other methods in web usage mining. This method enables the website for more efficient content organization or provides recommendations for an effective cross-selling product.

These rules are statements in the form X => Y where (X) and (Y) are the set of available items in a series of transactions. The rule of X => Y states that transactions that contain items in X may also include items in Y. Association rules in the web usage mining are used to find relationships between pages that frequently appear next to one another in user sessions.

2. Sequential Patterns

Sequential patterns are used to discover the subsequence in a large volume of sequential data. In web usage mining, sequential patterns are used to find user navigation patterns that frequently appear at meetings. The sequential patterns may seem to be association rules. But the sequential patterns are included the time, which means that the sequence of events that occurred is defined in sequential patterns. Algorithms that are used to extract association rules can also be used to generate sequential patterns. Two types of algorithms are used for sequential mining patterns.

The first type of algorithm is based on association rules mining. Many common algorithms of sequential mining patterns have been changed for mining association rules. For example, GSP and AprioriAll are two developed species of Apriori algorithms that are used to extract association rules. But some researchers believe that association rules mining algorithms do not have enough performance in the long sequential patterns mining.
The second type of sequential patterns mining algorithms has been introduced in which the tree structure and Markov chain are used to represent survey patterns. For example, in one of these algorithms called WAP-mine, the tree structure called WAP-tree is used to explore access patterns to the web. Evaluation results show that its performance is higher than an algorithm such as GSP.

3. Clustering

Clustering techniques diagnose groups of similar items among high volumes of data. This is done based on distance functions which measure the degree of similarity between different items. Clustering in web usage mining is used for grouping similar meetings. What is important in this type of search is the contrast between the user and individual groups. Two types of interesting clustering can be found in this area: user clustering and page clustering.

Clustering of user records is usually used to analyze web mining and web analytics tasks. More knowledge derived from clustering is used to partition the market in e-commerce. Different methods and techniques are used for clustering, which includes:

Using the similarity graph and the amount of time spent viewing a page to estimate the similarity of meetings.
Using genetic algorithms and user feedback.
Clustering matrix.
K -means algorithm, which is the most classic clustering method.

The repetitive patterns are first extracted from the user's sessions using association rules in other clustering methods. Then, these patterns are used to construct a graph where the nodes are the visited pages. The edges of the graph connect two or more pages. If these pages exist in a pattern extracted, the weight will be assigned to the edges that show the relationship between the nodes. Then, for clustering, this graph is recursively divided to user behavior groups are detected.

4. Classification Mining

Discovering classification rules allows one to develop a profile of items belonging to a particular group according to their common attributes. This profile can classify new data items added to the database. In Web Mining, classified techniques allow one to develop a profile for clients who access particular server files based on demographic information available on those clients or their navigation patterns.

Advantages

Web usage mining has many advantages, making this technology attractive to corporations, including government agencies.

This technology has enabled e-commerce to do personalized marketing, resulting in higher trade volumes. Government agencies are using this technology to classify threats and fight against terrorism.
Companies can establish better customer relationships by understanding the customer's needs better and reacting to customer needs faster. They can increase profitability by target pricing based on the profiles created. They can even find customers who might default to a competitor. The company will try to retain the customer by providing promotional offers to the specific customer, thus reducing the risk of losing a customer or customers.
More benefits of web usage mining, particularly personalization, are outlined in specific frameworks like the probabilistic latent semantic analysis model, which offers additional features to user behavior and access patterns. This is because the process provides the user with more relevant content through collaborative recommendations.
There are also elements unique to web usage mining that show the technology's benefits. These include the way semantic knowledge is applied when interpreting, analyzing and reasoning about usage patterns during the mining phase.

Disadvantages

Web usage mining by itself does not create issues, but when used on data of personal nature, this technology might cause concerns.

The most criticized ethical issue involving web usage mining is the invasion of privacy. Privacy is considered lost when information concerning an individual is obtained, used, or disseminated, especially if this occurs without the individual's knowledge or consent. The obtained data will be analyzed, made anonymous, and then clustered to form anonymous profiles.
These applications de-individualize users by judging them by their mouse clicks rather than by identifying information. De-individualization, in general, can be defined as a tendency to judge and treat people based on group characteristics instead of on their characteristics and merits.
The companies collecting the data for a specific purpose might use the data for totally different purposes, violating the user's interests.

Web Usage Mining Applications

The main objective of web usage mining is to collect data about the user's navigation patterns. This information can improve the Web sites in the user view. There are three main applications of this mining, such as:

1. Privatization of web content

Web usage mining techniques can be used for the personalization of web users. For example, user behavior can be immediately predicted by comparing her current survey patterns with those extracted from the log files. Recommendation systems with a real application in this area suggest links that direct the user to his favorite pages. Some sites also organize their product catalogs based on the predicted interests of a specific user and represent them.

2. Pre - recovery

The results of web usage mining can be used to improve the performance of Web servers and Web-based applications. Web usage mining can be used for retrieving and caching strategies and thus reduce the response time of Web servers.

3. Improvement of Web site design

Usability is one of the most important issues in designing and implementing websites. The results of web usage mining can help to appropriate the design of websites. Adaptive websites are an application of this type of mining. Website content and structure are dynamically reorganized based on data derived from user behavior in these sites.

Difference between Web Content, Web Structure, and Web Usage Mining

Here are the following difference between web content, web structure, and web usage mining, such as:

Terms	Web Content		Web Structure	Web Usage
Terms	IR View	DB View	Web Structure	Web Usage
View of data	Unstructured Structured	Semi-structured Website as DB	Link structure	Interactivity
Main data	Text documents Hypertext documents	Hypertext documents	Link structure	Server logs Browser logs
Method	Machine Learning Statistical (Including NLP)	Proprietary algorithm Association rules	Proprietary algorithm	Machine learning Statistical Association Rules
Representation	Bag of words, n-gram terms Phrases, concepts, or ontology Relational	Edged labeled graph Relational	Graph	Relational Table Graph
Application Categories	Categorization Clustering Finding Extract rules Finding Patterns in text	Finding frequent substructures Web site schema discovery	Categorization Clustering	Site construction Adaptation and management