attoparser is a Java parser for XML and HTML markup.
It is a SAX-style event-based parser —though it does not implement the SAX standard—
but it can also act as a DOM-style parser.
Its goals are:
The answer is simple: if you don't need neither DTD/Schema validation nor entity resolution, then yes, it can.
First, you should create an implementation of IAttoHandler, usually by extending one of its predefined abstract implementations:
public class MyHandler extends AbstractStandardMarkupAttoHandler { /* * Provide implementations for the events you are interested on. */ }
Then simply execute the parser using your handler:
final Reader documentReader = ...; final IAttoParser parser = new MarkupAttoParser(); // this is thread-safe and can be reused final IAttoHandler handler = new MyHandler(); parser.parse(documentReader, handler);
Java-based | Requires Java SE 5.0 or newer. |
Easy to deploy | attoparser is just a .jar library with no additional dependencies. No need to worry about the versions your JDK build includes of the SAX, DOM or any other XML-related standards. |
Light | attoparser's only .jar file weighs just about 85 Kbytes. |
Event-based (SAX style) | attoparser offers an event-based interface, calling handler methods on a user-provided handler class implementing a specific interface —usually extending one of the provided abstract classes providing different levels of event detail—. This works in an equivalent way to the implementation of the ContentHandler interface when using standard SAX parsers. |
HTML-specific intelligence | attoparser offers specific intelligence in order to correctly parse HTML markup. For example: it can report an <img src="..."> element as a standalone element even if it is not minimized (<img src="..." />) and it has no closing tag. |
Optional DOM-style | attoparser also offers a prebuilt handler class that translates parsing events into a fully-featured attoDOM (attoparser-customized Document Object Model) tree of nodes, which can be modified and written back to markup if needed. |
Optional well-formedness | Users are not restricted to parsing only well-formed markup (from an XML standpoint). attoparser can be configured to ignore well-formedness rules like tag balancing, attribute values delimited by commas, correct XML/XHTML/HTML prolog specification, etc. This makes attoparser especially well-suited for parsing HTML code. |
Small memory footprint | Unless specifically required by the user's handler implementation, attoparser avoids copying the document contents in memory by working always with the original char[] buffer, providing (offset,len) pairs for delimiting event artifacts. |
Full event location | Each event artifact (and attoDOM node) provides its location at the original document with its line and column number. |
Several levels of detail | Users can specify the level of detail they need for their events by choosing a specific abstract base class for their handler implementations. For example, if a user is not interested in delimiting element (tag) names or attributes, he/she can choose a detail level that ignores tag contents, resulting in a performance improvement. |
Document reconstruction | attoparser takes all the required measures to ensure that, when needed, the original markup will be completely reconstructable after parsing. No single character or artifact is ignored or left out of event reporting at the most detailed level. This is a useful feature when the parser is used for processing templates. |
No escaping/unescaping |
No text escaping or unescaping is applied to parsed artifacts, and also no entity substitution
—e.g. á to á — is performed, allowing
the user to apply his/her own rules where required. This frees the parser from making
possibly invalid assumptions about markup due to differences between XML and HTML escaping rules,
and also allows a complete reconstruction of the original document after parsing, if needed.
|
attoparser is Open Source Software, and it is distributed under the terms of the Apache License 2.0.
attoparser is stable and production-ready. Current version is 2.0.7.RELEASE.