In this post under Jsoup, I will explain with example how to parse an HTML data.
The HTML data can be present in a local file, in a String, or at a URL.
Jsoup provides overloaded methods to parse html data at these locations
For our example we will have local file named “Input1.html” in the classpath of the java code with below content.
Input1.html
<html>
<head>
<title>Input1</title>
</head>
<body>
<p>Input1</p>
</body>
</html>
In our example we will parse the above file and also the html data present at url “www.google.com”.
Below is main code showing how to parse html documents.
Main Code
1 import java.io.File;
2 import java.io.IOException;
3 import java.net.URL;
4
5 import org.jsoup.Jsoup;
6 import org.jsoup.nodes.Document;
7
8 public class JsoupDemo1 {
9 public static void main(String[] args) throws IOException {
10 File file = new File("Input1.html");
11 Document document = Jsoup.parse(file, "UTF-8");
12 System.out.println(document.title());
13
14 URL url = new URL("http://www.google.com");
15 document = Jsoup.parse(url, 10000);
16 System.out.println(document.title());
17 }
18 }
As you can see in the above code, at line 11 and 15, we are calling different overloaded versions of “parse” static method available in Jsoup class.
The return of “parse” static method is an instance of Document class which represents the parsed html document.
Once parsed, we are printing the documents title to the console.
In this way we can parse html contents using Jsoup library
The output will be as shown below
Output
Input1
Google