Releases
Full listing of changes and bug fixes are not currently available prior
to release 1.2.0.
Release 1.4.2
Features
-
Added exactSchemeOnly configuration to AccessPoint, allowing
explicit distinction between http:// and https://(ACC-32
)
-
Now times out requests to a slow/non-responsive RemoteResourceIndex
and remote(HTTP 1.1) ResourceStore nodes.(ACC-38
)
-
experimental OpenSearchQuery .jsp implementations(ACC-56
)
-
FileProxyServlet now accepts /OFFSET trailing path in addition to
Content-Range HTTP header.(ACC-74
)
-
warc-indexer now has -all option to produce a CDX line for ALL
records, not just captures and revisits(ACC-75
)
-
now includes file+offset for all records, keying off mime-time of
warc/revist to determine revisits at query time.(ACC-76
)
-
Allow prefixing of original HTTP headers with a fixed string.
(ACC-77
)
-
Now Wayback rewrites Content-Base HTTP headers.(ACC-78
)
-
Timeline.jsp improvements which prevent Timeline from being severely
distorted on some pages.
-
Improvement to ArchivalUrl client-rewrite.js to preserve link text,
working around a bug in Internet Explorer.
Bug Fixes
-
Now all mime-types are escaped to prevent spaces from getting into
the CDX files.(ACC-45
)
-
Some CSS URLs were being rewritten twice. (ACC-53
)
-
No longer writing original pages Content-Length HTTP header to
output, which caused original pages with Lower-Case "L" in
"Content-length" to return wrong length, truncating replayed
documents. This caused some replayed pages to not have embedded
disclaimers, nor javascript rewriting of links and images.
(ACC-60
)
-
Fixed severe problem with live web robots.txt retrieval where wrong
offset was being writting into the live web ResourceIndex.
(ACC-62
)
-
Charset extraction from HTTP headers is now case-insensitive.
(ACC-63
)
-
No longer adding content to HTML pages with FrameSet tags, as they
were being broken.(ACC-65
)
-
No longer set GMT as default timezone for entire JVM.(ACC-70
)
Release 1.4.1
Features
-
Index filter which allows including/excluding records based on HTTP
response code field.(ACC-43
)
-
Outputs log message instead of stack dump when failing to access
a Resource.
Bug Fixes
-
Some redirect records were not being located in index due to bad
logic in Duplicate record filter.(ACC-30
)
-
Wayback was not throwing a NotInArchiveException when
Self-Redirect replay filter removes all records. (unreported)
-
Location HTTP header values were not being escaped before
placing in CDX, causing some records to have too many columns.
(ACC-31
)
-
Search Result summary counts were incorrect in Url Prefix
searches.(ACC-33
)
-
Implemented NoCache.jsp, a replay insert which adds a
Cache-Control: no-cache
HTTP header to all replayed
documents.(ACC-34
)
-
Timeline.jsp was using Request Date, not Capture date, which
caused Proxy Mode Timeline to show the wrong date.
(ACC-36
)
-
Advanced Search reference implementation .jsp was broken.
(ACC-37
)
-
AnchorDate and AnchorWindow functionality is now disabled by
default, and can be enabled via configuration on an AccessPoint.
(ACC-46
)
Release 1.4.0
Features
- @
Completely new implementation of ResourceStore classes,
including recursive local directory scanning, scanning multiple
local directories, an experimental remote directory scanning
capability, and groundwork for future support of both non ARC/WARC
file formats and large scale automatic indexing.
- @
Complete overhaul of the Replay system, allowing
jspInserts within ArchivalUrl, DomainPrefix, and Proxy replay
modes. Also includes groundwork for future fine-grained mime-type
and url-based Replay customizations.
-
Added capability to explicitly set Locale to use for an
AccessPoint, overriding the default behavior of using the user
agents specified preferred language.
-
New flat file implementation of FileLocationDB. See
CDXCollection.xml within the .war file for and example usage.
-
AnchorDate feature, tracking the date with which a user begins a
replay session. During this session, wayback will always attempt to
remain near this date, preventing time-drift within a replay
session.
-
AnchorWindow feature, which allows users to specify a maximum time
window in either direction of the AnchorDate that they wish to view
replayed content. When a user has set this option, Wayback will not
display captures outside the specified window.
-
New command line tool location-db
to create a location DB
offline, populating with lines read from STDIN.
-
Added new AccessControlSettingOperation authentication control
component, allowing the configuration of the appropriate Exclusion
system per-request, as defined by arbitrary BooleanOperators. See
ComplexAccessPoint.xml within the .war file for an example usage.
-
Added .asx archival URL replay, which rewrites links inside
archived .asx files, attempting to make them point back into the
Wayback service.
-
Now accept "http:/" as identical to "http://" in the beginning of
a URL, working around a browser bug which stripped multiple "/"s in
URL paths.
- @
Refactoring of ResourceIndex interfaces, to allow for
future update-able ResourceIndex implementations beyond BDBIndex
based ResourceIndexes.
- *
Major internal refactoring of WaybackRequest object,
providing more stable get/set methods for accessing the standard
internal fields with type-safety.
- *
Major internal refactoring of SearchResults into
CaptureSearchResults and UrlSearchResults, which was previously
under-specified and often confusing. These new classes provide more
stable get/set methods for accessing the standard internal fields
with type-safety.
- *
Changed locations of replay, query, and exception .jsp
files within .war file to underneath WEB-INF, so they are not
directly accessible via HTTP.
-
German translation of default Wayback UI. Thanks Andreas!
-
Czech translation of default Wayback UI. Thanks Luká? Mat?jka!
(<<
ACC-29
)
-
All threads now notified of shut downs, allowing resources to be
released cleanly.
- *
Refactor of all Request and Result related constants from
WaybackConstants to WaybackRequest and the *SearchResult(s)
classes.
- *
Refactor of the various UI*Results classes, which are used
by Query, Replay, and Exception .jsp files to access context
information into the single class, UIResults, which has a more
stable interface.
-
New AccessPoint.urlRoot optional configuration, enabling explicit
control over URLs generated for the UI.
Bug Fixes
-
(ACC-24) Fixed bug in Proxy mode which prevented the correct number
of results from being returned from the index during Replay.
-
(ACC-21) fixed bug where some CSS import declarations where not
being correctly rewritten.
-
(ACC-26) fixed rare String OOB exception when marking up pages with
some forms of Javascript generated HTML.
-
(ACC-28) verifies that detected encoding is supported in local JVM
before attempting to decode a resource into a String.
-
(unreported) fixed declared page encoding of help, advanced search
and index page to UTF-8.
-
Explicitly set character encoding on returned documents, instead of
relying on Tomcat to return the correct encoding.
Migration notes to 1.4.0 from 1.2.X
Wayback 1.4.0 includes substantial code changes aimed at extending
current capabilities, enabling planned future features, and
stabilizing interfaces used in .jsp customizations. Since these
changes would already require a significant update of existing
customizations made to .jsp files, many non-vital cleanups to the
source tree were included. The goal of implementing all of these
features within this single release is to minimize future required
updates.
Below is a somewhat inclusive list of changes that will be required
when upgrading to Wayback 1.4.0 from 1.2.X, divided into two main
categories: changes required to Spring configuration, and changes
required for .jsp customizations. Depending on the scope of the
existing customizations in your installations, it may be simpler
to modify your existing customizations to conform to new interfaces
and packages, and in other cases, it may be simpler to begin with the
new reference implementations and modify them to meet your needs.
If there are changes not addressed here, or if you have questions
regarding specific issues when upgrading, please direct these
questions to the archive-access-discuss forum.
Spring upgrade information
New features with the @
mark indicate features that will directly
impact Spring XML configuration files used with 1.2.X.
- org.archive.wayback.resourcestore.http.FileLocationDB
now:
org.archive.wayback.resourcestore.locationdb.BDBResourceFileLocationDB
- org.archive.wayback.resourcestore.http.FileLocationDBServlet
now:
org.archive.wayback.resourcestore.locationdb.ResourceFileLocationDBServlet
- org.archive.wayback.resourcestore.http.ArcProxyServlet
now:
org.archive.wayback.resourcestore.locationdb.FileProxyServlet
-
All ReplayUI implementations changed completely, now located in:
ArchivalUrlReplay.xml, DomainPrefixReplay.xml, ProxyReplay.xml.
Customizations to jspInserts should be straightforward on
inspecting these files.
- org.archive.wayback.resourcestore.Http11ResourceStore
now:
org.archive.wayback.resourcestore.SimpleResourceStore
. See
RemoteCollection.xml for configuration example.
-
The new automatic indexing is most simply upgraded by modifying
the new example in BDBCollection.xml with your custom paths.
.jsp upgrade information
New features with the *
mark indicate features that will directly
impact customizations made to .jsp files used with 1.2.X. The bulk of
the changes fit three categories:
-
class name and package changes requiring import tag updates.
Please see .jsps in new distribution for updated packages.
-
.jsp path changes due to webapp directory tree cleanup. Again,
please see the current locations in the new distribution.
-
Java changes within .jsp files due to UIResults refactoring.
Previously each type of response page had a unique class used
to marshal context information to the .jsp files. These have all
been refactored into a single class,
org.archive.wayback.core.UIResults
which has methods to
access the appropriate data in each case. Additionally, many
convenience methods that were present on the various UI*Results
classes have been removed, since convenience methods are now
available on the core classes:
- WaybackRequest
- CaptureSearchResult
- CaptureSearchResults
- UrlSearchResult
- UrlSearchResults
As an example, the Timestamp class is no longer used in the .jsp
files, since all time information uses the Date class for
localization. All of the above classes now have methods to
directly return Dates.
For specific examples, please see the reference .jsp files
included with the new distribution.
Release 1.2.1
Features
-
Now explicitly sets the charset
component of replayed HTML
page Content-Type
HTTP headers in Archival URL mode. This
overrides Tomcat's default behavior of explicitly setting this value
to Tomcat's default
encoding character set, if a document
does not set it explicitly. The original Content-Type
HTTP
header value is now returned as HTTP header
X-Wayback-Orig-Content-Type
.
Bug Fixes
-
added getter/setter for replay image, css, javascript, and html
error handling .jsps
-
now returns "closest" indicator on XML query results, fixing problem
with WAXToolbar/Proxy mode.(ACC-11
)
- auto-indexer
now closes ARC/WARC files after indexing, fixing
out-of-filehandle problem(ACC-12
)
- location-client
now syncs .warc and .warc.gz files with
locationDB, in addition to .arc and .arc.gz files.(ACC-13
)
-
fixed problem which prevented captures archived after webapp was
deployed from being returned. Now captures up to the current moment
are returned. (ACC-14
)
-
changed all .jsp files to return UTF-8(ACC-18
)
-
now sending correct end Date to remote NutchWAX index.
(ACC-20
)
-
fixed String OOB exception when attempting to rewrite some CSS text
(ACC-17
)
-
now updates CSS "import 'URL';" and 'import "URL";' content.
Previously only updated "import url(URL);" content.
-
fixed Replay redirect loop when using RemoteResourceIndex
(ACC-15
)
Release 1.2.0
Features
-
now supports compressed and uncompressed ARC and WARC files.
-
initial revision of "deduplicated" WARC record handling, which
returns the last version that was actually stored when
subsequent captures are not saved because they have not changed.
-
now filters (literal) duplicate records from the ResourceIndex,
in case the same capture (url + date) appears twice, or in two
CDX files.
-
UrlCanonicalizer is now pluggable, current functionality is now
implemented in AggressiveUrlCanonicalizer. Added
IdentityUrlCanonicalizer, which performs no canonicalization.
- bin-search
command line tool now outputs a single stream of
sorted results from multiple files, instead of returning matches
from each file sequentially.
-
extracted several replay features into separate jspInserts that
can now be mixed and matched.
-
now handles most text/css URL rewriting, both inside HTML pages,
and in externally linked .css files.
-
externalized comment embedded inside replayed HTML pages into
jspInsert: ArchiveComment.jsp.
-
non-javascript Archival URL replay mode, where all URL rewriting
occurs on the server. This includes a non-javascript
Timeline jspInsert.
-
added two-month timeline partition.
-
root page of webapp now lists access points, when users make
a request that does not specify one. Also, now access point
"slash-pages" are available "without the slash".
Bug Fixes
-
Now rewrite Location and Content-Base HTTP headers in non-HTML
Archival URL replayed documents.
-
now rewrites all background
attributes found in returned
pages (archival URL mode only) instead of just on BODY tags.
-
now rewrites src
attributes on INPUT tags.
-
command line tools now allow whitespace arguments, important for
tools accepting delimiter arguments.
-
replay URLs in query results now include non-standard ports, if
needed.
-
Timezone is now explicitly set to GMT/UTC, fixing a Calendar
result partitioning problem.
-
uncaught character-encoding exceptions now handled, plus
slightly improved detection of correct character encoding by
removing internal whitespace in declared encoding names.
-
archival URL parsing of query end-date now assumes latest
possible date given a partial end-date, instead of earliest
possible date.
-
re-implemented lost "closest" indicator for XML results.
-
now supports multiple auto index threads, one per ResourceStore,
and also multiple auto index merge threads, one per BDB
ResourceIndex.
-
fixed hard-coded maximum year issue.
-
reimplemented NotInArchive logging, which was lost in 1.0.0.