org.archive.wayback.util.url
Class AggressiveUrlCanonicalizer
java.lang.Object
org.archive.wayback.util.url.AggressiveUrlCanonicalizer
- All Implemented Interfaces:
- UrlCanonicalizer
public class AggressiveUrlCanonicalizer
- extends Object
- implements UrlCanonicalizer
Class that performs the standard Heritrix URL canonicalization. Eventually,
this should all be configurable, or perhaps be able to read the settings
used within a Heritrix crawler... or even multiple crawlers... this is hard.
- Version:
- $Date: 2010-09-29 05:28:38 +0700 (Wed, 29 Sep 2010) $, $Revision: 3262 $
- Author:
- brad
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
AggressiveUrlCanonicalizer
public AggressiveUrlCanonicalizer()
doStripRegexMatch
protected boolean doStripRegexMatch(StringBuilder url,
Matcher matcher)
- Run a regex against a StringBuilder, removing group 1 if it matches.
Assumes the regex has a form that wants to strip elements of the passed
string. Assumes that if a match, group 1 should be removed
- Parameters:
url - Url to search in.matcher - Matcher whose form yields a group to remove
- Returns:
- true if the StringBuilder was modified
urlStringToKey
public String urlStringToKey(String urlString)
throws org.apache.commons.httpclient.URIException
- Specified by:
urlStringToKey in interface UrlCanonicalizer
- Parameters:
urlString - String representation of an URL, in as original, and
unchanged form as possible.
- Returns:
- a lookup key appropriate for searching within a ResourceIndex.
- Throws:
org.apache.commons.httpclient.URIException - if the input url String is not a valid URL.
canonicalize
public String canonicalize(String url)
- Idempotent operation that will determine the 'fuzziest'
form of the url argument. This operation is done prior to adding records
to the ResourceIndex, and prior to lookup. Current version is exactly
the default found in Heritrix. When the configuration system for
Heritrix stabilizes, hopefully this can use the system directly within
Heritrix.
- Parameters:
url - to be canonicalized.
- Returns:
- canonicalized version of url argument.
main
public static void main(String[] args)
- Parameters:
args - program arguments
Copyright © 2005-2011 Internet Archive. All Rights Reserved.