A ResourceIndex locates documents within a WaybackCollection through a single method:
public SearchResults query(final WaybackRequest request) throws ResourceIndexNotAvailableException, ResourceNotInArchiveException, BadQueryException, AccessControlException;The ResourceIndex is responsible for deciding which SearchResults subclass, CaptureSearchResults or UrlSearchResults, is appropriate for the WaybackRequest argument, and for populating the returned SearchResults object with matching records.
When the request indicates the user wishes to find specific captures of a single URL, CaptureSearchResults should be returned. When the request may return results for multiple URLs, for example a query attempting to locate all URLs beginning with a given prefix within the WaybackCollection, a URLSearchResults object should be returned.
This ResourceIndex implementation assumes a local database of all documents within the WaybackCollection. The type of database is specified with the source property.
The following configuration is required for a LocalResourceIndex:
The following configurations are optional for LocalResourceIndexes:
For specific Spring configuration examples of these ResourceIndex options, please refer to the following files distributed within the wayback .war file:
This ResourceIndex implementation requests an external Wayback installation to satisfy index requests, and can be useful for distributed installations, as well as for experimenting with new Wayback configurations and installations using an existing ResourceIndex. For example, a development system can be configured to use a production index remotely, minimizing the requirements and setup required to test new configurations.
The actual index must be stored on another Wayback installation, and is requested as XML through this implementation.
The following configuration is required for a RemoteResourceIndex:
The following configurations are optional for LocalResourceIndexes:
For a Spring configuration example of this ResourceIndex option, please refer to the following files distributed within the wayback .war file:
Sometimes URLs found in the field can have multiple forms, for example:
http://www.example.com/img/foo.gif http://www.example.com/docs/../img/foo.gifare both valid representations of the exact same URL. Another, less certain example would be:
http://www.example.com/Interview.html http://www.example.com/interview.htmlwhich differ only in the capitalization of the letter "i". On some operating systems, these two URLs legitimately specify two distinct documents. On Windows platforms, they refer to the same document. If the document on a web server is actually named "Interview.html", but a web designer creates a web page that refers to this document using the lowercase "interview.html", then the link will work, and they and the web site visitors may never notice the difference. The same situation on a different operating system would probably not work (although some web server plugins and modules will also correct this problem transparently) and the web designer would probably notice and correct the problem. In practice, we have found that it is very rare for the two URLs above with different capitalization to refer to different documents, and they can be treated as equivalent in most situations.
Another example, which occurs far more often in the real world, involves web servers injecting a session ID inside paths to documents hosted on that web server. These session IDs allow the web server to track individual user's states. Here are some example URLs demonstrating path session ID injection:
http://www.example.com/(S(4hqa0555fwsecu455xqckv45))/page1.aspx http://www.example.com/(S(4hqa0555fwsecu455xqckv45))/page2.aspx http://www.example.com/(S(a63098d96360a63098d96360))/page3.aspxIn these examples, the first two URLs are using one session ID, and the third uses a different session ID. If page3.aspx refers to page1.aspx using an anchor like this:
<a href="page1.aspx">page1</a>and a user visiting page3.aspx clicks the link to page1, then the wayback will recieve a request for the URL:
http://www.example.com/(S(a63098d96360a63098d96360))/page1.aspxIf page1.aspx was captured using the different session ID, then the wayback will be unable to locate this document in the index, even though it was captured.
This session ID problem can be mitigated by canonicalizing the URLs as they are placed in the index, so the index would contain the following URLs, instead of the original form, which the crawler captured:
http://www.example.com/page1.aspx http://www.example.com/page2.aspx http://www.example.com/page3.aspxIf the same canonicalization scheme is used to transform incoming requests, before attempting to lookup URLs in the index, then the software is able to locate and return the documents correctly.
Currently the Wayback includes only a single reference implementation of a canonicalization scheme, which is currently called AggressiveUrlCanonicalizer. This implementation provides the following canonicalization:
At the IA, we have recently switched to building CDX files using the -identity option on the arc-indexer and warc-indexer tools. The -identity option requires passing records through the url-client tool before sorting and merging into production CDX files. By keeping the original "identity" CDX files, we have been able to test various URL canonicalization strategies without the overhead of re-processing all the ARC/WARC source materials.
In upcoming wayback releases, we intend to provide more canonicalization implementations, including a configurable implementation that will allow broad customization capabilities.
We also intend to alter the format of wayback indexes significantly. Using this new format will be optional, but once indexes are created in the new format is created, other indexes with different canonicalization strategies can be built from them without requiring a complete reindex of the original ARC/WARC content.
The new format will also allow a degree of dynamic canonicalization at run-time, meaning different strategies can be tested using the same indexes, and site-specific canonicalization strategies may be possible.
We anticipate that allowing (advanced) users to easily change between canonicalization strategies within the same wayback session will promote better community understanding of the impacts of different strategies, and will enable the community to build a set of best practices for URL canonicalization.
Heritrix 1.12 and above have the capability to write WARC files, which omit storing documents that have not changed since a previous visit. For specifics on activating these features, please refer to the Heritrix documentation. When Heritrix is using these features, and notices that a document has not changed since the last time it was visited, it creates an abbreviated WARC record, indicating that the document was retrieved but not stored. In this abbreviated WARC record is an indicator of the SHA1 digest of the document.
The wayback uses these identical SHA1 digests to map the location (ARC/WARC + offset) of the original record that was stored to subsequent records that were not. When a request for a subsequent capture that was not stored is received by wayback, it will return the content of the previous stored record.
The matching of these digests occurs at query time, and is configured by setting the "dedupeRecords" option of the LocalResourceIndex to "true".