Validating html documents against whitelist of html tags

You may have a requirement where your application has to accept html data as input from the user and you have to make sure that the input contains only those tag that are allowed by your application.

In this post under Jsoup, I will show how to implement the above requirement.

First we have to come up with the list of html tags that your application allows. This is called Whitelist. For our example, we allow only “p” and “span”.

Now we will implement the requirement and below is the code

Main


1  import org.jsoup.Jsoup;
2  import org.jsoup.safety.Safelist;
3  
4  public class JsoupDemo5 {
5      public static void main(String[] args) {
6          String validHtml = "<p>Welcome To Valid HTML</p>\r\n" + 
7                  "<span>Welcome To Valid HTML</span>\r\n"; 
8          
9          String inValidHtml = "<p>Welcome To In Valid HTML</p>\r\n" + 
10                 "<script>Welcome To In Valid HTML</script>\r\n"; 
11         
12         Safelist safelist = new Safelist();
13         safelist.addTags("p", "span");
14         
15         System.out.println("Is 'validHTML' Valid: " + Jsoup.isValid(validHtml, safelist));
16         System.out.println("Is 'inValidHTML' Valid: " + Jsoup.isValid(inValidHtml, safelist));
17     }
18 }

In the above code, we have two inputs “validHtml” which contains html data using “p” and “span” tag. Second we have “inValidHtml” which contains html data using “p” and “script” tag.

Since in our whitelist only “p” and “span” tag is allowed, “validHtml” input data will be validated successfuly whereas in case of “inValidHtml”, the validation fails.

We create the whitelist of html tags by taking help jsoup class Safelist, which is shown at line 12 and 13. We create an instance of Safelist class and add “p” and “span” tags to it.

Next we will take the help of Jsoup’s static method “isValid” to figure out which one out of “validHtml” and “inValidHtml” is valid. If an input data is valid this method will return true or else false. Refer to line 15 and 16.

At line 15, we call “isValid” method and pass “validHtml” and safelist instance as input parameters. In the case of “validHtml” input, it has all the valid tags, so the return value will be true.

At line 16, we again call “isValid” method and pass “inValidHtml” and safelist instance as input parameters. In this case of “inValidHtml” input, it doesn’t have the valid tags, so the return value will be false.

In this way we can validated html data entered by the user.

Below is the output for our example

Output

Is ‘validHTML’ Valid: true
Is ‘inValidHTML’ Valid: false

Leave a Reply