Validating a HTML page with Java


This is a quick article to give you some hints about validating HTML pages in Java.
I first found JCabi’s library to validate documents against the W3C validators. This library submits a HTML document to the online W3C validator. However, I got several issues to make it work. I could fix some of them, but not all. My unit tests were working in Eclipse, but not in Maven. And depending on the JDK I was using, it was not always working (some errors with XPath factories).

I then found an open source projet on GitHub, w3cValidator.
It looked promising, but there was an ambiguity with the license. Besides, I realized it was not a good idea to validate automatically too many pages on the online W3C validator. I then checked how I could install locally a W3C validator. If I succeeded in installing it on Ubuntu, it was only after a lot of efforts. To be realistic, it was not possible to automate such an installation on Travis CI. So, I definitely dropped the idea of relying on W3C validators. I had to find some library.

I tried JTidy, but I found it quite limited. It was validating things that were valid XML but invalid HTML.

I eventually discovered Nu HTML checker.
This project describes itself as follows…

The Nu Html Checker (v.Nu) is a name for the backend of,, and the HTML5 facet of the legacy W3C Validator.

It is Java. It is available in Maven repositories. And no license issue for my project. It looked great, except there is no documentation about how to use it as a library. Instead, it aims at being used as a stand-alone Java process.

I could have used the project’s main class to validate my HTML pages, but it is full of System.exit() statements. So, not a good idea. By digging the classes, I finally extracted some piece of code to use it. You can find it in this Gist.

I duplicated it here, in this article.
First, update your POM will the following dependency.


And here is a Java snippet to validate the content of a HTML page.

 * Verifies that a HTML content is valid.
 * @param htmlContent the HTML content
 * @return true if it is valid, false otherwise
 * @throws Exception
public boolean validateHtml( String htmlContent ) throws Exception {

	InputStream in = new ByteArrayInputStream( htmlContent.getBytes( "UTF-8" ));
	ByteArrayOutputStream out = new ByteArrayOutputStream();

	SourceCode sourceCode = new SourceCode();
	ImageCollector imageCollector = new ImageCollector(sourceCode);
	boolean showSource = false;
	MessageEmitter emitter = new TextMessageEmitter( out, false );
	MessageEmitterAdapter errorHandler = new MessageEmitterAdapter( sourceCode, showSource, imageCollector, 0, false, emitter );
	errorHandler.setErrorsOnly( true );

	SimpleDocumentValidator validator = new SimpleDocumentValidator();
	validator.setUpMainSchema( "", new SystemErrErrorHandler());
	validator.setUpValidatorAndParsers( errorHandler, true, false );
	validator.checkHtmlInputSource( new InputSource( in ));

	return 0 == errorHandler.getErrors();

And it works.

I also thought about a last option this monrning.
Now that I found the Nu checker, I will not change for it, but I think it could work too. It relies on using the JDK validator. Validating a HTML page can be done by validating the XML against the HTML’s XML schema. In some way, I think it is what JTidy does, but I am not sure.

Anyway, I hope this article will have brought some ideas to you.

6 thoughts on “Validating a HTML page with Java

  1. HI , I try to implements the checker for validating all my page , but I want to send error message to console any ideal?

    1. I have not tried it, but I guess that by replacing SystemErrErrorHandler by your own error handler, you could redirect error messages wherever you want.

      1. I was just thinking about want I did couple weeks ago and thought that leave my MyErrorHandler for other people.
        Just call : new MyErrorHandler(htmlContent)

        private final static class MyErrorHandler implements ErrorHandler {
        		private boolean encounteredError;
        		private final String text;
        		MyErrorHandler(final String text) {
        			this.text = Objects.requireNonNull(text);
        		public void error(final SAXParseException exception) {
        		public void fatalError(final SAXParseException exception) {
        		public void warning(final SAXParseException exception) {
        		private void handle(final SAXParseException exception) {
        			encounteredError = encounteredError
        					| handleHtmlError(exception, encounteredError, text);
        	static boolean handleHtmlError(final SAXException exception,
        			final boolean encounteredError, final String text) {
        		if (!encounteredError) {
        		if (Objects.requireNonNull(exception) instanceof SAXParseException) {
        			final SAXParseException sax = (SAXParseException) exception;
        					.format("Broken XHTML:%n================================================ BEGIN ==============================================%n%s%n ========================================================== END ==========================================================%n",
        					"Parser detected error: %s Line: %s Col: %s%n",
        					sax.getMessage(), sax.getLineNumber(),
        		} else {
        			LOGGER.warning(String.format("Parser detected error: %s%n",
        		return true;
  2. Very nice! I am looking for some more …..

    Any idea how I can get only valid HTML from given “htmlContent”.

    Example….my htmlContent looks like….

    some. some text text

    As you can see I have the last three orphan tags which should go away and return value would be only

    some. some text text

    Is there a way in the api I get my expected result as stated above?

    Your help greatly appreciated.

    1. If I understand correctly, you only want to extract the valid part of a given html snippet? Sorry, I have no idea. I do not think there is any built-in solution for this. Even IDE editors (such as Eclipse) only give the position of invalid parts. With such a solution, you would have to perform the extraction yourself.

      Maybe you can achieve the same thing with nuChecker.
      I do not remember whether errors come with their positions.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s