org.archive.wayback.util.htmllex
Class ContextAwareLexer
java.lang.Object
org.archive.wayback.util.htmllex.NodeUtils
org.archive.wayback.util.htmllex.ContextAwareLexer
public class ContextAwareLexer
- extends NodeUtils
The Lexer that comes with htmlparser does not handle non-escaped HTML
entities within SCRIPT tags - by default, something like:
Can cause the lexer to skip over a large part of the document. Technically,
the above isn't legit HTML, but of course, folks do stuff like that all the
time. So, this class uses a ParseContext object, passed in at construction,
which observes the SCRIPT and STYLE tags, both setting properties on the
ParseContext, and using that state information to perform a parseCDATA()
call instead of a nextNode() call at the right time, to try to keep the
SAX parsing in sync with the document.
- Author:
- brad
|
Method Summary |
org.htmlparser.Node |
nextNode()
|
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
ContextAwareLexer
public ContextAwareLexer(org.htmlparser.lexer.Lexer lexer,
ParseContext context)
nextNode
public org.htmlparser.Node nextNode()
throws org.htmlparser.util.ParserException
- Throws:
org.htmlparser.util.ParserException
Copyright © 2005-2011 Internet Archive. All Rights Reserved.