JSoup – Entity exigency

If you’re using JSoup to parse and render HTML documents which could contain a small range of exotic characters you may not like the choice of entity escaping it offers you. Fortunately it’s pretty easy to tweak it to suit your needs.

A quirk of HTML is its ability to include characters in a document which don’t actually exist in the character set it’s delivered in. For example, the following piece of text contains a number of characters that do not exist in the common ISO-8859-1 or US-ASCII character sets:-

HTML text containing entities

 Yet, rendered in HTML using entity names in place of the plain characters…

<html>
    <head>
        <title>JSoup Entities</title>
    </head>

    <body>
        <p style="font-size: 12px; font-family: Verdana; color: darkblue;">
            &lsquo;The price is &ndash; 100&euro;&rsquo;
        </p>
    </body>
</html>

…most browsers will correctly display them, even when the document was rendered and delivered using a character set that doesn’t feature them.

Well, they will if they haven’t been run through a default configuration of JSoup that is.

Famine

The JSoup open source library might not be a life-saver but it can certainly be a huge time saver if your Java project requires you to read, modify or generate HTML documents. Unfortunately though, when it parses an HTML document such as the example above it will convert the entity codes into their real character equivalents and then render them afresh when regenerating the HTML version. So unless it has a configured entity code for a particular character that character will get written straight out to the HTML. Worse still, if any of those characters don’t exist in the encoding you’re using, you’ll end up with the usual Java ‘unknown character’ question-marks:-

HTML text with the entities dropped by JSoup

One way around this is to use an encoding that contains all the characters you need, such as UTF-8:-

<html>
    <head>
        <title>JSoup Entities</title>
    </head>

    <body>
        <p style="font-size: 12px; font-family: Verdana; color: darkblue;">
            ‘The price is – 100€’
        </p>
    </body>
</html>

But you’re going to need to ensure your web server sends the right encoding in its HTTP headers – for that page or for all your pages if you’re switching them all – or the best you’ll get is a random match with whatever encoding your user’s browser is using (or no match at all if your web server is already sending a character encoding header that doesn’t match what you’re now using).

Feast

Some would say that using the right character set and configuring your web-server correctly is the right way to handle this, and with something as broadly supported as UTF-8 you’re unlikely to come across users with browsers that won’t handle it. Unfortunately the right solution isn’t always a luxury we can immediately afford.

Fortunately JSoup offers a solution that will get your entity codes back for you, though you mightn’t entirely like it. Before calling the html() method to render your document you can set a number of rendering options on its OutputSettings object. One of these options is the escapeMode for rendering entities rather than plain characters. There are three options, defined in the Entities.EscapeMode enumeration:-

  • base: Use a basic set of HTML entity names.
  • extended: Use a much wider set of known HTML entity names.
  • xhtml: Use the minimal set of entities which will work with XHTML documents.

It uses base by default, and if we change that to extended before rendering our document…

Document jsoupDoc = Jsoup.parse(sourceHtml);
jsoupDoc.outputSettings().escapeMode(EscapeMode.extended);
String htmlDoc = jsoupDoc.html();

…it fills in the missing entity codes for us:-

<html>
    <head>
        <title>JSoup Entities</title>
    </head>

    <body>
        <p style="font-size&colon; 12px&semi; font-family&colon; Verdana&semi; color&colon; darkblue&semi;">
            &lsquo;The price is &ndash; 100&euro;&rsquo;
        </p>
    </body>
</html>

Unfortunately, as you can see above, extended in this case means really extended. We end up with entity codes all over the place for simple symbols like the colon and semi-colon in our CSS styles.

Buffet

Out of the box JSoup only offers us these feast or famine options; you can’t directly extend the set of entity codes with a few ‘exotic’ characters without also ending up entity-encoding lots of simple ones too. But with open source out of the box is never the end of the road. And in this case we don’t actually have too far to go to modify the set of entity codes it uses as they’re defined in a couple of property files – entities-base.properties and entities-full.properties for each of base and extended respectively.

These property files are shipped in jsoup.jar and live under the org/jsoup/nodes folder. Now, you might bridle at the thought of cracking the JAR file and changing them in place – especially if you’re using a tool like Maven which will keep pulling the original out of its central repository, but you can instead extract the org/jsoup/nodes/entities-base.properties and override the shipped version by adding it to your application’s classpath before jsoup.jar gets picked up. With a web application for example, we just need to make sure it appears in WEB-INF/classes/org/jsoup/nodes to make sure it gets picked up in preference to the standard one.

We can then edit this copy and add any additional entity codes we want:-

ndash=02013
rsquo=02019
lsquo=02018
euro=020AC

And our generated HTML will contain the extra codes we’ve added:-

<html>
    <head>
        <title>JSoup Entities</title>
    </head>

    <body>
        <p style="font-size: 12px; font-family: Verdana; color: darkblue;">
            &lsquo;The price is &ndash; 100&euro;&rsquo;
        </p>
    </body>
</html>

Obviously we need to keep an eye on this ‘tweak’ if we upgrade our JSoup version, but it does offer a quick and easy way of adjusting JSoup’s entity escaping if we need to.

Leave a Reply

Your email address will not be published. Required fields are marked *