Validating a HTML page with Java

Hi,

This is a quick article to give you some hints about validating HTML pages in Java.
I first found JCabi’s library to validate documents against the W3C validators. This library submits a HTML document to the online W3C validator. However, I got several issues to make it work. I could fix some of them, but not all. My unit tests were working in Eclipse, but not in Maven. And depending on the JDK I was using, it was not always working (some errors with XPath factories).

I then found an open source projet on GitHub, w3cValidator.
It looked promising, but there was an ambiguity with the license. Besides, I realized it was not a good idea to validate automatically too many pages on the online W3C validator. I then checked how I could install locally a W3C validator. If I succeeded in installing it on Ubuntu, it was only after a lot of efforts. To be realistic, it was not possible to automate such an installation on Travis CI. So, I definitely dropped the idea of relying on W3C validators. I had to find some library.

I tried JTidy, but I found it quite limited. It was validating things that were valid XML but invalid HTML.

I eventually discovered Nu HTML checker.
This project describes itself as follows…

The Nu Html Checker (v.Nu) is a name for the backend of html5.validator.nu, validator.w3.org/nu, and the HTML5 facet of the legacy W3C Validator.

It is Java. It is available in Maven repositories. And no license issue for my project. It looked great, except there is no documentation about how to use it as a library. Instead, it aims at being used as a stand-alone Java process.

I could have used the project’s main class to validate my HTML pages, but it is full of System.exit() statements. So, not a good idea. By digging the classes, I finally extracted some piece of code to use it. You can find it in this Gist.

I duplicated it here, in this article.
First, update your POM will the following dependency.

<dependency>
	<groupId>nu.validator</groupId>
	<artifactId>validator</artifactId>
	<version>15.3.14</version>
	<scope>test</scope>
	<exclusions>
		<exclusion>
			<groupId>org.eclipse.jetty</groupId>
			<artifactId>*</artifactId>
		</exclusion>
	</exclusions>
</dependency>

And here is a Java snippet to validate the content of a HTML page.

/**
 * Verifies that a HTML content is valid.
 * @param htmlContent the HTML content
 * @return true if it is valid, false otherwise
 * @throws Exception
 */
public boolean validateHtml( String htmlContent ) throws Exception {

	InputStream in = new ByteArrayInputStream( htmlContent.getBytes( "UTF-8" ));
	ByteArrayOutputStream out = new ByteArrayOutputStream();

	SourceCode sourceCode = new SourceCode();
	ImageCollector imageCollector = new ImageCollector(sourceCode);
	boolean showSource = false;
	MessageEmitter emitter = new TextMessageEmitter( out, false );
	MessageEmitterAdapter errorHandler = new MessageEmitterAdapter( sourceCode, showSource, imageCollector, 0, false, emitter );
	errorHandler.setErrorsOnly( true );

	SimpleDocumentValidator validator = new SimpleDocumentValidator();
	validator.setUpMainSchema( "http://s.validator.nu/html5-rdfalite.rnc", new SystemErrErrorHandler());
	validator.setUpValidatorAndParsers( errorHandler, true, false );
	validator.checkHtmlInputSource( new InputSource( in ));

	return 0 == errorHandler.getErrors();
}

And it works.

I also thought about a last option this monrning.
Now that I found the Nu checker, I will not change for it, but I think it could work too. It relies on using the JDK validator. Validating a HTML page can be done by validating the XML against the HTML’s XML schema. In some way, I think it is what JTidy does, but I am not sure.

Anyway, I hope this article will have brought some ideas to you.


About this entry