org.archive.wayback.replay
Class TextDocument

java.lang.Object
  extended by org.archive.wayback.replay.TextDocument

public class TextDocument
extends java.lang.Object

Class which wraps functionality for converting a Resource(InputStream + HTTP headers) into a StringBuilder, performing several common URL resolution methods against that StringBuilder, inserting arbitrary Strings into the page, and then converting the page back to a byte array.

Version:
$Date$, $Revision$
Author:
brad

Field Summary
 java.lang.StringBuilder sb
          the internal StringBuilder
 
Constructor Summary
TextDocument(Resource resource, CaptureSearchResult result, ResultURIConverter uriConverter)
           
 
Method Summary
 byte[] getBytes()
           
 java.lang.String getCharSet()
           
protected  java.lang.String getCharsetFromBytes(Resource resource)
          Attempts to figure out the character set of the document using the excellent juniversalchardet library.
protected  java.lang.String getCharsetFromHeaders(Resource resource)
          Attempt to divine the character encoding of the document from the Content-Type HTTP header (with a "charset=")
protected  java.lang.String getCharsetFromMeta(Resource resource)
          Attempt to find a META tag in the HTML that hints at the character set used to write the document.
 java.lang.String getJSIncludeString(java.lang.String jsUrl)
           
protected  java.lang.String guessCharset()
          Use META tags, byte-character-detection, HTTP headers, hope, and prayer to figure out what character encoding is being used for the document.
 java.lang.String includeJspString(java.lang.String jspPath, javax.servlet.http.HttpServletRequest httpRequest, javax.servlet.http.HttpServletResponse httpResponse, WaybackRequest wbRequest, CaptureSearchResults results, CaptureSearchResult result, Resource resource)
           
 void insertAtEndOfBody(java.lang.String toInsert)
           
 void insertAtStartOfBody(java.lang.String toInsert)
           
 void insertAtStartOfHead(java.lang.String toInsert)
           
 void readFully()
          Read bytes from input stream, using best-guess for character encoding
 void readFully(java.lang.String charSet)
           
 void resolveAllPageUrls()
          Update all URLs inside the page, so they resolve correctly to absolute URLs within the Wayback service.
 void resolveASXRefUrls()
           
 void resolveCSSUrls()
           
 void resolvePageUrls()
          Update URLs inside the page, so those URLs which must be correct at page load time resolve correctly to absolute URLs.
 void setCharSet(java.lang.String charSet)
           
 void stripHTML()
           
 void writeToOutputStream(java.io.OutputStream os)
          Write the contents of the page to the client.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

sb

public java.lang.StringBuilder sb
the internal StringBuilder

Constructor Detail

TextDocument

public TextDocument(Resource resource,
                    CaptureSearchResult result,
                    ResultURIConverter uriConverter)
Parameters:
resource -
result -
uriConverter -
Method Detail

getCharsetFromHeaders

protected java.lang.String getCharsetFromHeaders(Resource resource)
                                          throws java.io.IOException
Attempt to divine the character encoding of the document from the Content-Type HTTP header (with a "charset=")

Parameters:
resource -
Returns:
String character set found or null if the header was not present
Throws:
java.io.IOException

getCharsetFromMeta

protected java.lang.String getCharsetFromMeta(Resource resource)
                                       throws java.io.IOException
Attempt to find a META tag in the HTML that hints at the character set used to write the document.

Parameters:
resource -
Returns:
String character set found from META tags in the HTML
Throws:
java.io.IOException

getCharsetFromBytes

protected java.lang.String getCharsetFromBytes(Resource resource)
                                        throws java.io.IOException
Attempts to figure out the character set of the document using the excellent juniversalchardet library.

Parameters:
resource -
Returns:
String character encoding found, or null if nothing looked good.
Throws:
java.io.IOException

guessCharset

protected java.lang.String guessCharset()
                                 throws java.io.IOException
Use META tags, byte-character-detection, HTTP headers, hope, and prayer to figure out what character encoding is being used for the document. If nothing else works, assumes UTF-8 for now.

Parameters:
resource -
Returns:
String charset for Resource
Throws:
java.io.IOException

resolvePageUrls

public void resolvePageUrls()
Update URLs inside the page, so those URLs which must be correct at page load time resolve correctly to absolute URLs. This means ensuring there is a BASE HREF tag, adding one if missing, and then resolving: FRAME-SRC, META-URL, LINK-HREF, SCRIPT-SRC tag-attribute pairs against either the existing BASE-HREF, or the page's absolute URL if it was missing.


resolveAllPageUrls

public void resolveAllPageUrls()
Update all URLs inside the page, so they resolve correctly to absolute URLs within the Wayback service.


resolveCSSUrls

public void resolveCSSUrls()

resolveASXRefUrls

public void resolveASXRefUrls()

stripHTML

public void stripHTML()

readFully

public void readFully(java.lang.String charSet)
               throws java.io.IOException
Parameters:
charSet -
Throws:
java.io.IOException

readFully

public void readFully()
               throws java.io.IOException
Read bytes from input stream, using best-guess for character encoding

Throws:
java.io.IOException

getBytes

public byte[] getBytes()
                throws java.io.UnsupportedEncodingException
Returns:
raw bytes contained in internal StringBuilder
Throws:
java.io.UnsupportedEncodingException

writeToOutputStream

public void writeToOutputStream(java.io.OutputStream os)
                         throws java.io.IOException
Write the contents of the page to the client.

Parameters:
os -
Throws:
java.io.IOException

insertAtStartOfHead

public void insertAtStartOfHead(java.lang.String toInsert)
Parameters:
toInsert -

insertAtEndOfBody

public void insertAtEndOfBody(java.lang.String toInsert)
Parameters:
toInsert -

insertAtStartOfBody

public void insertAtStartOfBody(java.lang.String toInsert)
Parameters:
toInsert -

includeJspString

public java.lang.String includeJspString(java.lang.String jspPath,
                                         javax.servlet.http.HttpServletRequest httpRequest,
                                         javax.servlet.http.HttpServletResponse httpResponse,
                                         WaybackRequest wbRequest,
                                         CaptureSearchResults results,
                                         CaptureSearchResult result,
                                         Resource resource)
                                  throws javax.servlet.ServletException,
                                         java.io.IOException
Parameters:
jspPath -
httpRequest -
httpResponse -
wbRequest -
results -
Returns:
Throws:
java.io.IOException
javax.servlet.ServletException
java.text.ParseException

getJSIncludeString

public java.lang.String getJSIncludeString(java.lang.String jsUrl)
Parameters:
jsUrl -
Returns:

getCharSet

public java.lang.String getCharSet()
Returns:
the charSet

setCharSet

public void setCharSet(java.lang.String charSet)
Parameters:
charSet - the charSet to set


Copyright © 2005-2009 Internet Archive. All Rights Reserved.