Web crawler Java

The web crawler is basically a program that is mainly used for navigating to the web and finding new or updated pages for indexing. The crawler begins with a wide range of seed websites or popular URLs and searches depth and breadth to extract hyperlinks.

The web crawler should be kind and robust. Here, kindness means that it respects the rules set by robots.txt and avoids frequent website visits. The robust means the ability to avoid spider webs and other malicious behavior.

These are the following steps to create a web crawler:

In the first step, we first pick a URL from the frontier.
Fetch the HTML code of that URL.
Get the links to the other URLs by parsing the HTML code.
Check whether the URL is already crawled before or not. We also check whether we have seen the same content before or not. If both the condition doesn't match, we add them to the index.
For each extracted URL, verify that whether they agree to be checked(robots.txt, crawling frequency)

We use jsoup, i.e., Java HTML parsing library by adding the following dependency in our POM.xml file.

<dependency> 
            <groupId>org.jsoup</groupId> 
            <artifactId>jsoup</artifactId> 
            <version>1.10.2</version> 
</dependency> 

Let's start with the basic code of a web crawler and understand how it works:

WebCrawlerExample.java

// import required classes and packages
packagejavaTpoint.javacodes;
// import classes available in jsoup
importorg.jsoup.Jsoup; 
importorg.jsoup.nodes.Document; 
importorg.jsoup.nodes.Element; 
importorg.jsoup.select.Elements; 
// import exception and collection classes  
importjava.io.IOException; 
importjava.util.HashSet; 
// create WebCrawlerExample to understand the working of it and how we can implement it in Java
publicclassWebCrawlerExample { 
	// create set that will store links
privateHashSet<String>urlLink; 
    // initialize set using constructor
publicWebCrawlerExample() { 
	urlLink = newHashSet<String>(); 
    } 
    // create getPageLink() method that finds all the page link in the given URL
publicvoidgetPageLinks(String URL) {
	
        // we use the conditional statement to check whether we have already crawled the URL or not.
if (!urlLink.contains(URL)) { 
try { 
                // if the URL is not present in the set, we add it to the set
if (urlLink.add(URL)) { 
System.out.println(URL); 
                } 
                // fetch the HTML code of the given URL by using the connect() and get() method and store the result in Document
                Document doc = Jsoup.connect(URL).get();
                // we use the select() method to parse the HTML code for extracting links of other URLs and store them into Elements  
                Elements availableLinksOnPage = doc.select("a[href]"); 
                // for each extracted URL, we repeat process 
for (Element ele : availableLinksOnPage) { 
	// call getPageLinks() method and pass the extracted URL to it as an argument
getPageLinks(ele.attr("abs:href")); 
                } 
            } 
            // handle exception
catch (IOException e) { 
	// print exception messages
System.err.println("For '" + URL + "': " + e.getMessage()); 
            } 
        } 
    } 
    // main() method start
publicstaticvoid main(String[] args) { 
	WebCrawlerExampleobj = newWebCrawlerExample();
	
	// pick a URL from the frontier and call the getPageLinks()method
	obj.getPageLinks("https://www.javatpoint.com/digital-electronics"); 
    } 
} 

Output:

Let's do some modifications to the above code by setting the link of depth extraction. The only difference between the code of the previous one and the current one is that it crawl the URL until a specified depth. The getPageLink() method takes an integer argument that represents the depth of the link.

WebCrawlerExampleWithDepth.java

// import required classes and packages
packagejavaTpoint.javacodes;
//import classes available in jsoup
importorg.jsoup.Jsoup; 
importorg.jsoup.nodes.Document; 
importorg.jsoup.nodes.Element; 
importorg.jsoup.select.Elements; 

//import exception and collection classes  
importjava.io.IOException; 
importjava.util.HashSet; 
publicclassWebCrawlerExampleWithDepth { 
	// initialize MAX_DEPTH variable with final value
privatestaticfinalintMAX_DEPTH = 2;
    // create set that will store links
privateHashSet<String>urlLinks; 
    // initialize set using constructor
publicWebCrawlerExampleWithDepth() { 
	urlLinks = newHashSet<>(); 
    } 
    // create method that finds all the page link in the given URL 
publicvoidgetPageLinks(String URL, int depth) { 
	
	//we use the conditional statement to check whether we have already crawled the URL or not.
	// we also check whether the depth reaches to MAX_DEPTH or not
if ((!urlLinks.contains(URL) && (depth <MAX_DEPTH))) { 
System.out.println(">> Depth: " + depth + " [" + URL + "]"); 
            // use try catch block for recursive process
try { 
	// if the URL is not present in the set, we add it to the set
	urlLinks.add(URL); 
	// fetch the HTML code of the given URL by using the connect() and get() method and store the result in Document
                Document doc = Jsoup.connect(URL).get(); 

                // we use the select() method to parse the HTML code for extracting links of other URLs and store them into Elements
                Elements availableLinksOnPage = doc.select("a[href]"); 
                // increase depth
depth++; 
                // for each extracted URL, we repeat above process
for (Element page : availableLinksOnPage) { 
	
	// call getPageLinks() method and pass the extracted URL to it as an argument
getPageLinks(page.attr("abs:href"), depth); 
                } 
            } 
            // handle exception
catch (IOException e) { 
	// print exception messages
	System.err.println("For '" + URL + "': " + e.getMessage()); 
            } 
        } 
    } 
    // main() method start
publicstaticvoid main(String[] args) { 
	// create instance of the WebCrawlerExampleWithDepth class
	WebCrawlerExampleWithDepthobj = newWebCrawlerExampleWithDepth ();
	
	// pick a URL from the frontier and call the getPageLinks()method and pass 0 as starting depth
	obj.getPageLinks("https://www.javatpoint.com/digital-electronics/", 0); 
    } 
} 

Output:

Difference Between Data Crawling and Data Scraping

Data crawling and Data scrapping both are two important concepts of data processing.

Data crawling means dealing with large data sets where we develop our own crawler that crawl to the deepest of web pages.

Data scrapping means retrieving data/information from any source.

Data Scraping	Data Crawling
Data scrapping extracts data not only from the web but also from any source.	Data crawling extracts data only from the web.
In data scrapping, duplication is not necessarily a part.	In data crawling, duplication is an essential part.
It can be done at any scale, i.e., small or large.	It is mostly done on a large scale.
It requires both the crawl parser and agent.	It requires only a crawl agent.

Let's take one more example to crawl articles using a web crawler in Java.

ExtractArticlesExample.java

// import required classes and packages
packagejavaTpoint.javacodes;
// //import classes available in jsoup
importorg.jsoup.Jsoup;
importorg.jsoup.nodes.Document;
importorg.jsoup.nodes.Element;
importorg.jsoup.select.Elements;
//import exception, FileWriter and collection classes  
importjava.io.FileWriter;
importjava.io.IOException;
importjava.util.ArrayList;
importjava.util.HashSet;
importjava.util.Iterator;
importjava.util.List;
// create ExtractArticlesExample to understand how we can extract articles
public class ExtractArticlesExample {
	// initialize MAX_DEPTH variable with final value
private static final int MAX_DEPTH = 2;
	
	// create set and nested list for storing links and articles 
privateHashSet<String>urlLinks;
private List<List<String>> articles;
    // initialize set and list
publicExtractArticlesExample() {
	urlLinks = new HashSet<>();
articles = new ArrayList<>();
    }
    //get all URLs that start with "https://www.javatpoint.com/" and add them to the set
public void getPageLinks(String URL, int depth) {
	
	//we use the conditional statement to check whether we have already crawled the URL or not.
	// we also check whether the depth reaches to MAX_DEPTH or not
if (urlLinks.size() != 50 && !urlLinks.contains(URL) && (depth < MAX_DEPTH) && (URL.startsWith("http://www.javatpoint.com") || URL.startsWith("https://www.javatpoint.com"))){ 
System.out.println(">> Depth: " + depth + " [" + URL + "]"); 

            // use try catch block for recursive process
try { 
	// if the URL is not present in the set, we add it to the set
	urlLinks.add(URL); 
	
	// fetch the HTML code of the given URL by using the connect() and get() method and store the result in Document
                Document doc = Jsoup.connect(URL).get(); 

                // we use the select() method to parse the HTML code for extracting links of other URLs and store them into Elements
                Elements availableLinksOnPage = doc.select("a[href]"); 

                // increase depth
depth++; 

                // for each extracted URL, we repeat above process
for (Element ele : availableLinksOnPage) { 
	if(ele.attr("abs:href").startsWith("http://www.javatpoint.com") || ele.attr("abs:href").startsWith("https://www.javatpoint.com")) {
		// call getPageLinks() method and pass the extracted URL to it as an argument
getPageLinks(ele.attr("abs:href"), depth);
	}
                } 
            } 
            // handle exception
catch (IOException e) { 
	// print exception messages
	System.err.println("For '" + URL + "': " + e.getMessage()); 
            } 
        } 
    }

    //Connect to each link saved in the article and find all the articles in the page
public void getArticles() {
	Iterator<String>i = urlLinks.iterator();
while (i.hasNext()) {
	
	
		// create variable doc that store document data
		Document doc;
		
		// we put the recursive code in a try-catch block
try {
	
doc = Jsoup.connect(i.next()).get();
                Elements availableArticleLinks = doc.select("a[href]");


for (Element ele : availableArticleLinks) {
	
                    //we get only those article's  title which contain java 8
                    // use matches() and regx method to check whether text contains Java 8 or not
	if (ele.text().contains("python")) {
		System.out.println(ele.text());
		// create temp list that stores articles
		ArrayList<String> temp = new ArrayList<>();
		temp.add(ele.text()); //get title of the article
		temp.add(ele.attr("abs:href")); //get the URL of the article
                        // add article list in the nested article list
		articles.add(temp);
                    }
                }
            } 
            // handle exception 
catch (IOException e) {
	// show error message 
System.err.println(e.getMessage());
            }
	}
    }

    // create writeToFile() method to write data into file
public void writeToFile(String fName) {
	// declare variable of type FileWriter
FileWriterwr;

        //use try-catch block to write data into file
try {
	// initialize FileWriter for fName
wr = new FileWriter(fName);

for(inti = 0; i<articles.size(); i++) {
	
	try {
                    String article = "- Title: " + articles.get(i).get(0) + " (link: " + articles.get(i).get(1) + ")\n";

                    // show the article and save it to the specified file
System.out.println(article);
wr.write(article);

	}catch (IOException e) {
System.err.println(e.getMessage());
                }
            }
            // close FileWriter class object
wr.close();
        } catch (IOException e) {
System.err.println(e.getMessage());
        }
    }
    // main() method start
public static void main(String[] args) {
	// create instance of the ExtractArticlesExample class
	ExtractArticlesExampleobj = new ExtractArticlesExample();
	
	// call getPageLinks() method to get all the page links of the specified URL
	obj.getPageLinks("http://www.javatpoint.com", 0);

        // call getArticles() method to find all the articles
	obj.getArticles();

        // call writeToFile() method to write all the articles in the specified file
	obj.writeToFile("Web Crawler Example");
    }
}

Output: