org.archive.wayback.util.htmllex
Class ContextAwareLexer

java.lang.Object
  extended by org.archive.wayback.util.htmllex.NodeUtils
      extended by org.archive.wayback.util.htmllex.ContextAwareLexer

public class ContextAwareLexer
extends NodeUtils

The Lexer that comes with htmlparser does not handle non-escaped HTML entities within SCRIPT tags - by default, something like: Can cause the lexer to skip over a large part of the document. Technically, the above isn't legit HTML, but of course, folks do stuff like that all the time. So, this class uses a ParseContext object, passed in at construction, which observes the SCRIPT and STYLE tags, both setting properties on the ParseContext, and using that state information to perform a parseCDATA() call instead of a nextNode() call at the right time, to try to keep the SAX parsing in sync with the document.

Author:
brad

Field Summary
 
Fields inherited from class org.archive.wayback.util.htmllex.NodeUtils
SCRIPT_TAG_NAME, STYLE_TAG_NAME
 
Constructor Summary
ContextAwareLexer(org.htmlparser.lexer.Lexer lexer, ParseContext context)
           
 
Method Summary
 org.htmlparser.Node nextNode()
           
 
Methods inherited from class org.archive.wayback.util.htmllex.NodeUtils
isCloseTagNodeNamed, isNonEmptyOpenTagNodeNamed, isOpenTagNodeNamed, isRemarkNode, isTagNode, isTagNodeNamed, isTextNode
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ContextAwareLexer

public ContextAwareLexer(org.htmlparser.lexer.Lexer lexer,
                         ParseContext context)
Method Detail

nextNode

public org.htmlparser.Node nextNode()
                             throws org.htmlparser.util.ParserException
Throws:
org.htmlparser.util.ParserException


Copyright © 2005-2011 Internet Archive. All Rights Reserved.