Package org.attoparser.select Description
Handlers for filtering a part or several parts of markup during parsing
in a fast and efficient way.
Handler Implementations
There are two main handlers (implementations of IMarkupHandler
for
markup selection in this package:
BlockSelectorMarkupHandler
-
For selecting entire blocks of markup (i.e.
elements and all the nodes in their subtrees). This can be used, for example, for extracting
fragments of markup during the parsing of the document, in a way so that discarded markup does
never reach higher layers of the document processing infrastructure.
NodeSelectorMarkupHandler
-
For selecting only specific nodes in markup (i.e. not including their subtrees). This can be used
for modifying certain specific tags in markup during parsing, for example by
adding additional attributes to them that are not present in the original parsed markup.
Markup Selector Syntax
Markup selectors used by handlers in this package use a specific syntax with features borrowed from
XPath, CSS and jQuery selectors, in order to provide ease-of-use for most users. Many times there are several
ways to express the same selector, depending on the user's preferences.
For example, all the following equivalent selectors will select every <div> with class
content, in any position in markup:
//div[class='content']
//div[@class='content']
div[class='content']
div[@class='content']
//div.content
div.content
These are the different operations this syntax allows:
Basic selectors
- x
//x
-
Both are equivalent, and mean children of the current node with name x, at any depth in
markup. If a reference resolver is being used, they will also be equivalent to
%x (see below).
- /x
-
Means direct children of the current node with name x.
- x/y
-
Means direct children with name y of elements with name x, being the parent
x elements at any level in markup.
- x//y
-
Means children (at any level) with name y of elements with name x, being the parent
x elements also at any level in markup.
- text()
comment()
cdata()
doctype()
xmldecl()
procinstr()
-
These can be used like x (in the same places) but instead of selecting elements (i.e. tags)
will select, respectively: text nodes, comments, CDATA sections, DOCTYPE clauses, XML Declarations and
Processing Instructions.
Attribute matching
- x[z='v']
x[z="v"]
x[@z='v']
x[@z="v"]
-
All four equivalent, mean elements with name x and an attribute called z with value
v. Note attribute values can be surrounded by single or double quotes, and attribute names
can be specified with a leading @ (as in XPath) or without it (more similar to jQuery). For
the sake of simplicity, only the single-quoted, no-@ syntax will be used for the rest of
the examples below.
- [z='v']
//[z='v']
-
Means any elements with an attribute called z with value v.
- x[z]
-
Means elements with name x and an attribute called z, with any value.
- x[!z]
-
Means elements with name x and no attribute called z.
- x[z1='v1' and z2='v2']
-
Means elements with name x and attributes z1 and z2 with values
v1 and v2, respectively.
- x[z1='v1' or z2='v2']
-
Means elements with name x and, either an attribute z1 with value
v1, or an attribute z2 with value v2.
- x[z1='v1' and (z2='v2' or z3='v3')]
-
Selects according to the specified attribute complex expression. As can be seen, these expressions
can be parenthesized to introduce a certain evaluation order.
- x[z!='v']
x[z^='v']
x[z$='v']
x[z*='v']
-
Similar to x[z='v'] but applying different operators to attribute matching instead of
equality (=). Respectively: not equal (!=),
starts with (^=), ends with ($=) and
contains (*=).
- x.z
x[class='z']
-
When parsing in HTML mode (and only then), these two selectors will be completely equivalent. Besides,
in this case the selector will look for an x element which has the z class, knowing that
HTML's class attribute allows the specification of several classes separated by white space. So
something like <x class="z y w"> will be matched by this selector.
- x#z
x[id='z']
-
When parsing in HTML mode (and only then), these two selectors will be completely equivalent, matching those
x elements that have an ID with value z.
Index-based matching
- x[i]
-
Means elements with name x positioned in index i among its siblings.
Sibling here means node child of the same parent element, matching the same
conditions (in this case "having x as name"). Note indexes start with
0.
- x[z='v'][i]
-
Means elements with name x, attribute z with value v and positioned in
number i among its siblings (same name, same attribute with that value).
- x[even()]
x[odd()]
-
Means elements with name x positioned in an even (or odd) index among its siblings.
Note even includes the index number 0.
- x[>i]
x[<i]
-
Mean elements with name x positioned in an index greater (or lesser) than i
among its siblings.
- text()[i]
comment()[>i]
-
Applies the specified index-based matching operations to nodes of types other than elements: texts,
comments, CDATA sections, etc.
Reference-based matching
- x%ref
-
Means elements with name x and matching markup selector reference
with value ref. These markup selector references usually have a user-defined
meaning and are resolved to a markup selector without references by means of an instance of the
IMarkupSelectorReferenceResolver
interface passed to the selecting
markup handlers (BlockSelectorMarkupHandler
and
NodeSelectorMarkupHandler
) during construction.
For example, a reference resolver could be
configured that converts (resolves) %someref into
div[class='someref' or id='someref']. Also, the
Thymeleaf template engine uses this mechanism
for resolving %fragmentName (or simply fragmentName, as explained below) into
//[th:fragment='fragmentName' or data-th-fragment='fragmentName'].
- %ref
-
Means any elements (whichever the name) matching reference with value ref.
- ref
-
Equivalent to %ref. When a markup selector reference resolver has been configured,
ref can bean both "element with name x" and
"element matching reference x" (both will match).