Automating Data Collection Using HtmlUnit
The secret behind any successful business operation is theier powerful data collection and strategizing the business activities based on the data. In both the cases it is important to have automated data collection system to avoid manual efforts and errors and also to speed up the data collection. If the data is available in the form of API or web service, it can be retrieved through SOAP or REST APIs. But if the data is to be extracted from web pages then we have to use a headless browser to parse the web page and extract the data.
HtmlUnit is a Java library that provides the functionality of a headless browser and allows simulating the behavior of all the major browsers prevailing in the market (IE, Firefox, Chrome etc‚Ä¶). We can also configure it to act like a mobile device browser. It has a very good support of developer community and the frequency of releases is also good incorporating new features.
In this blog I will try to show how HtmlUnit can be used to collect necessary data from internet using Java and possible use cases where HtmlUnit can help in gathering required data.
Where it can be used
We can use HtmlUnit to collect various types of data available on the internet and use it to build some web applications or can use the data for decision-making of any business operations. Some of the applications where we can use HtmlUnit are mentioned below.
- Find the products and catalogs from the merchant site. It can also be used to find the price of the product or similar products on a competitor‚Äôs websites and thereby it can help in deciding the price of the product.
- Get the weather data available on the internet and can be used for further processing. This can also help in the decision-making of any business processes.
- Get the scores of any live matches and can build a web application over it.
- Get the ticket availability from any site and can use that data to build a web application for the same.
- Get the latest news from various sites and build an application over it.
- Find the best offers available on various sites for any particular product.
- It can also be used in an IoT application to collect data from various sensors and devices and can build a web application over it.
- It can also be used as a web crawler to index the required web pages.
- It can also be used for regression testing of our web application. This can be done by simply loading the required page in HtmlUnit and verifying the content of the page.
In all the above cases it is assumed that the user knows what data to collect and the source of that data. There is a possibility that some of these data are available as API, in such cases it is the decision that a developer needs to take whether to get data from an API or through browsing the web page. Getting data through API is usually a better choice but browsing the web page can be an advantageous if the web page gives some additional info which is not available in API response. It is worth to note here that HtmlUnit also supports firing the HTTP request and get the data from API but that is again a decision that a developer needs to take whether to use HtmlUnit for that purpose or not.
How it can be used
Before we start implementing data collection using HtmlUnit, it is important to identify 2 things.
- What data we want to extract and
- The source of that data (the web address where the data is available)
Once we have finalized on above 2 things, it is very easy to extract data using HtmlUnit. Here are the steps.
- Download the latest HtmlUnit jar package from here
- Unzip and place all jars in any suitable location. Change the project settings to add all the libraries (extracted from the zip) to the project. That is all what is required to use HtmlUnit.
- Now we need to instantiate the object of WebClient class. This class represents a browser object. You can also call the overloaded constructor to specify the browser version.
WebClient webClient = new WebClient();
- Optional: Then we may set the WebClient settings depending on our requirements. I have explained some of the common settings in next section.
- The next step is to load the page in the WebClient.
HtmlPage page = webClient.getPage(‚Äúhttp://www.google.com‚ÄĚ);
- After loading the page, we can access each element of the page whether it is some information present on the page, as a text element or an anchor element to navigate to the next page.
DomNode title = page.querySelector(‚Äútitle‚ÄĚ); // Get the title element
String strTitle = title.getTextContent(); // Get the title of the page
That is all that we need to do to use HtmlUnit to extract the data from web pages. HtmlUnit also provides various settings and configurations so that we can configure it as per our need. I have listed some of the major configurations below.
- You can set any browser version to simulate the behavior of any particular browser. Also you can create your own browser version object with custom version number and user agent.
- The library also supports cookies. You can enable or disable them as per your need.
- Also WebClient options allow us to enable/disable the use of insecure SSL, Redirection of web pages, popup blocker, native ActiveX components, Applets etc.
- It also provides settings to throw exception or continue in case of any JS or CSS error on the page.
- Since HtmlUnit is a headless browser, it is difficult to read the alert messages. But to overcome this, HtmlUnit provides a listener to capture the alert messages.
All in all, HtmlUnit is a very flexible headless browser which we can configure as per our requirement.
A lot of data is already available on the internet but finding the right source has become a very important and critical task for any business operation. Once the data to be captured has been finalized and its source has been identified, it is very simple to grab the data using HtmlUnit. Being a light weight library, easy to use, configurability, good developer support, frequent updates and free of cost, HtmlUnit becomes a perfect choice for data collection in Java based application.