Clearing HTML content of invalid or blacklisted tags

In this post under Jsoup, I will show how to remove invalid or blacklisted tags from the user inputed HTML content.

In our example we will have an html content with one invalid tag “script” and we will have whitelist (i.e., list of allowed tags) containing “p” and “span” tag.

When we run our code, Jsoup will use the whitelist and scan the html content. Then it removes any tag not listed in the whitelist.

Below is the java code that shows how we can achieve it.

Main Class


1  import org.jsoup.Jsoup;
2  import org.jsoup.safety.Safelist;
3  
4  public class JsoupDemo6 {
5      public static void main(String[] args) {
6          String inValidHtml = "<p>Welcome To In Valid HTML</p>\r\n" + 
7                  "<script>Welcome To In Valid HTML</script>\r\n"; 
8          
9          Safelist safelist = new Safelist();
10         safelist.addTags("p", "span");
11         
12         String result = Jsoup.clean(inValidHtml, safelist);
13         System.out.println(result);
14     }
15 }

The variable “inValidHtml” will contain the html content. Please note html content usually available inside “body” tag should be used as input.

At line 9 and 10, we create an instance of “Safelist” class and populate its list with “p” and “span” tag. Here an instance of “Safelist” class represents the whitelist.

At line 12, we will call Jsoup’s static method “clean”. To this we pass, the html content and “Safelist” class instance.

The return value is a String with invalid tags removed.

Below is the output

Output


<p>Welcome To In Valid HTML</p>

Leave a Reply