Abstract RequestHandler implementation which performs the minimal behavior
for self registration with a RequestMapper, requiring subclasses to implement
only handleRequest().
Retains all information about a particular Wayback configuration
within a ServletContext, including holding references to the
implementation instances of the primary Wayback classes:
RequestParser
ResourceIndex(via WaybackCollection)
ResourceStore(via WaybackCollection)
QueryRenderer
ReplayDispatcher
ExceptionRenderer
ResultURIConverter
ServletRequestContext which proxies to an ARCRecordingProxy, and unwraps
the "application/x-arc-record" MIME response into the inner HTTP response,
sending all HTTP headers AS-IS, and the HTTP Entity.
Abstract implementation of the RequestParser interface, which provides some
convenience methods for accessing data in Map's, and also
allows for configuring maxRecords, and earliest and latest timestamp strings.
Result: this key being present indicates that this particular capture
was not actually stored, and that other values within this SearchResult
are actually values from a different record which *should* be identical
to this capture, had it been stored.
flag indicates that this document was NOT downloaded, but that the
origin server indicated that the document had not changed, based on
If-Modified HTTP request headers.
Result: this key is present when the CAPTURE_DUPLICATE_ANNOTATION is also
present, with the value indicating the last date that was actually
stored for this duplicate.
Mapper which reads an identity Funky format CDX line, outputting:
key - canonicalized original URL + timestamp
val - everything else
input lines are a hybrid format:
ORIG_URL
DATE
'-' (literal)
MIME
HTTP_CODE
SHA1
REDIRECT
START_OFFSET
ARC_PREFIX (sans .arc.gz)
ROBOT_FLAG (combo of AIF - no: Archive,Index,Follow, or '-' if none)
Ex:
http://www.myow.de:80/news_show.php? 20061126032815 - text/html 200 DVKFPTOJGCLT3G5GUVLCETHLFO3222JM - 91098929 foo A
Need to:
.
Abstract class containing common methods for determining the character
encoding of a text Resource, most of which should be refactored into a
Util package.
Classic ReplayRenderer which uses a combination of server-side modification
and embedded javascript to rewrite URLs within an HTML page to make embedded
URLs point back to a specific ArchivalURL AccessPoint.
SearchResultFilter that abstracts multiple SearchResultFilters -- if all
filters return INCLUDE, then the result is included, but the first to
return ABORT or EXCLUDE short-circuits the rest
Class that provides SearchResult Filtering based on multiple
ExclusionFilterFactory instances by returning a single composite
SearchResultFilter based on the results of each ExclusionFilter.
The Lexer that comes with htmlparser does not handle non-escaped HTML
entities within SCRIPT tags - by default, something like:
Can cause the lexer to skip over a large part of the document.
A CompositeSearchResultSource that autmatically manages it's list of sources
based on 3 configuration files, and a background thread:
Config 1: Mapping of ranges to hosts responsible for that range
this class is aware of the local host name, so uses this file
to determin which range(s) should be local
Config 2: Mapping of ranges to one or more MD5s that compose that range
when all of these MD5s have been copied local, this index
becomes active, and each request uses a composite of these
local files
Config 3: Mapping of MD5s to locations from which they can be retrieved
when a file that should be local is missing, these locations
will be used to retrieve a copy of that file
Background Thread: compares current set of files to the various
configurations files, gets files local that need to be and
updates the composite set searched when the correct set of
MD5s are localized.
Lean and mean ParseEventHandler implementing current best-known server-side
HTML rewrite rules, and should be much faster than the fully configurable
version.
ServletRequestContext interface which uses a ResourceFileLocationDB to
reverse proxy an incoming HTTP request for a file by name to it's actual
back-end location.
Subclass of File, which allows binary searching, returning Iterators
that allow scanning forwards and backwards thru the (sorted) file starting
from a particular prefix.
FlatFile() -
Constructor for class org.archive.wayback.util.flatfile.FlatFile
Store this UIResults object in the given HttpServletRequest, then
forward the request to target, in this case, an image, html file, .jsp,
any file which can return a complete document.
Deprecated. Determine the correct ResultsPartitioner to use given the SearchResults
search range, and use that to break the SearchResults into partitions.
Simple worker, which gets tasks from an IndexQueue, in the case, the name
of ARC/WARC files to be indexed, retrieves the ARC/WARC location from a
ResourceFileLocationDB, creates the index, which is serialized into a file,
and then hands that file off to a ResourceIndex for merging, using an
IndexClient.
Tests if the String argument looks like it could be a legitimate
authority fragment of a URL, that is, is it an IP address, or, are the
characters legal in an authority, and does the string end with a legal
TLD.
RecordReader which reads pointers to actual files from an internal
LineRecordReader, producing a LineRecordReader for the files pointed to by
the actual input.
Class which starts a background thread that repeatedly scans an incoming
directory and merges files found therein(which are assumed to be in CDX
format) with a BDBIndex.
Alter the HTML document in page, updating URLs in the attrName attributes
of all tagName tags such that:
1) absolute URLs are prefixed with: wmPrefix + pageTS 2) server-relative
URLs are prefixed with: wmPrefix + pageTS + (host of page) 3)
path-relative URLs are prefixed with: wmPrefix + pageTS + (attribute URL
resolved against pageUrl)
RequestParser which attempts to extract data from an HTML form, that is, from
HTTP GET request arguments containing a query, an optional count (results
per page), and an optional current page argument.
Common interface to decouple application-specific handlers from the
ParseEventDelegator object: Any object interested in registering for specific
low-level events can implement this interface, and can be added to the
ParseEventDelegator parserVisitors list, and it will be given an opportunity
to register with the ParseEventDelegator for specific events it is
interested in.
Class which allows matching based on:
a) one of several strings, any of which being found in the path cause match
b) one of several strings, any of which being found in the query cause match
c) one of several strings, *ALL* of which being found in the url cause match
Brutally simple, barely functional class to allow simple recording of
millisecond level timing within a particular request, enabling rough logging
of the time spent in various parts of the handling of a WaybackRequest
Read the single Spring XML configuration file located at the specified
path, performing PropertyPlaceHolder interpolation, extracting all beans
which implement the RequestHandler interface, and construct a
RequestMapper for those RequestHandlers, on the specified ServletContext.
Containing object for data associated with one region (month/year/etc) in the
graph, including the:
label
highlighted value index
int array of values to graph within this region
the global max int value across all values in the overall graph
Called at webapp context initialization, to allow the RequestHandler to
register itself with the RequestMapper, which will delegate request
handling to the appropriate RequestHandler.
Render the contents of a WaybackException in either html, javascript, or
css format, depending on the guessed context, so errors in embedded
documents do not cause unneeded errors in the embedding document.
This class maintains a mapping of RequestHandlers and ShutDownListeners, to
allow (somewhat) efficient mapping and delegation of incoming requests to
the appropriate RequestHandler.
Class which repeatedly builds a ResourceFileList for a set of
ResourceFileSource objects, serializing them into files, and dropping them
into the incoming directory of a ResourceFileLocationDBUpdater.
CaptureSearchResult Filter that uses a LiveWebCache to retrieve robots.txt
documents from the live web, and filters SearchResults based on the rules
therein.
Class which parses a robots.txt file, storing the rules contained therein,
and then allows for testing if path/userAgent tuples are blocked by those
rules.
RobotRules() -
Constructor for class org.archive.wayback.accesscontrol.robotstxt.RobotRules
ReplayDispatcher instance which uses a configurable ClosestResultSelector
to find the best result to show from a given set, and a list of
ReplayRendererSelector to determine how best to replay that result to a user.
Called before registerPortListener(), to enable the registration process
and subsequent handleRequest() calls to access the ServletContext, via
the getServletContext() method.
Single static method to read a Spring XML configuration, extract
RequestHandlers, and return a RequestMapper which delegates requests to
those RequestHandlers.
An class which assists in UI generation, primarily through Locale-aware
String formatting, and also helps in escaping (hopefully properly) Strings
for use in HTML.
Class which wraps functionality for converting a Resource(InputStream +
HTTP headers) into a StringBuilder, performing several common URL
resolution methods against that StringBuilder, inserting arbitrary Strings
into the page, and then converting the page back to a byte array.
Sad but needed subclass of the ArchiveReaderFactory, allows config of
timeouts for connect and reads on underlying HTTP connections, and overrides
the one getArchiveReader(URL,long) method to enable setting the timeouts.
ReplayRenderer implementation which returns the archive document as
pristinely as possible -- no modifications to response code, HTTP headers,
or original byte-stream.
Simple class which acts as the go-between between Java request handling code
and .jsp files which actually draw various forms of results for end user
consumption.
Constructor for "Replay" UIResults, where the request
successfully matched something from the index, the document was retrieved
from the ResourceStore, and is going to be shown to the user.
Takes an input URL String argument, downloads, stores in an ARCWriter,
and returns a FileRegion consisting of the compressed ARCRecord containing
the response, or a forged, "fake error response" ARCRecord which can be
used to send the content to an OutputStream.
Filter class that observes a stream of SearchResults tracking for each
complete record, a mapping of that records Digest to:
Arc/Warc Filename
Arc/Warc offset
HTTP Response
MIME-Type
Redirect URL
If subsequent SearchResults are missing these fields ("-") and the Digest
field is in the map, then the SearchResults missing fields are replaced with
the values from the previously seen record with the same digest, and an
additional annotation field is added.
Abstract subclass of BaseRequestParser, which allows retrieving
configured maxRecords, and earliest and latest timestamp config from an
delegate instance.
Produce a debug message to this classes logger, computing the time
taken to query the index, retrieve the resource (if a replay request)
and render the results to the client.
A set of Ziplines files, which are CDX files specially compressed into a
series of GZipMembers such that:
1) each member is exactly 128K, padded using a GZip comment header
2) each member contains complete lines: no line spans two GZip members
If the data put into these files is sorted, then the data within the files
can be uncompressed when needed, minimizing the total data to be uncompressed
This SearchResultSource assumes a set of alphabetically partitioned Ziplined
CDX files, so that each file is sorted, and no regions overlap.