org.archive.wayback.util.url
Class AggressiveUrlCanonicalizer

java.lang.Object
  extended by org.archive.wayback.util.url.AggressiveUrlCanonicalizer
All Implemented Interfaces:
UrlCanonicalizer

public class AggressiveUrlCanonicalizer
extends Object
implements UrlCanonicalizer

Class that performs the standard Heritrix URL canonicalization. Eventually, this should all be configurable, or perhaps be able to read the settings used within a Heritrix crawler... or even multiple crawlers... this is hard.

Version:
$Date: 2010-09-29 05:28:38 +0700 (Wed, 29 Sep 2010) $, $Revision: 3262 $
Author:
brad

Constructor Summary
AggressiveUrlCanonicalizer()
           
 
Method Summary
 String canonicalize(String url)
          Idempotent operation that will determine the 'fuzziest' form of the url argument.
protected  boolean doStripRegexMatch(StringBuilder url, Matcher matcher)
          Run a regex against a StringBuilder, removing group 1 if it matches.
static void main(String[] args)
           
 String urlStringToKey(String urlString)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

AggressiveUrlCanonicalizer

public AggressiveUrlCanonicalizer()
Method Detail

doStripRegexMatch

protected boolean doStripRegexMatch(StringBuilder url,
                                    Matcher matcher)
Run a regex against a StringBuilder, removing group 1 if it matches. Assumes the regex has a form that wants to strip elements of the passed string. Assumes that if a match, group 1 should be removed

Parameters:
url - Url to search in.
matcher - Matcher whose form yields a group to remove
Returns:
true if the StringBuilder was modified

urlStringToKey

public String urlStringToKey(String urlString)
                      throws org.apache.commons.httpclient.URIException
Specified by:
urlStringToKey in interface UrlCanonicalizer
Parameters:
urlString - String representation of an URL, in as original, and unchanged form as possible.
Returns:
a lookup key appropriate for searching within a ResourceIndex.
Throws:
org.apache.commons.httpclient.URIException - if the input url String is not a valid URL.

canonicalize

public String canonicalize(String url)
Idempotent operation that will determine the 'fuzziest' form of the url argument. This operation is done prior to adding records to the ResourceIndex, and prior to lookup. Current version is exactly the default found in Heritrix. When the configuration system for Heritrix stabilizes, hopefully this can use the system directly within Heritrix.

Parameters:
url - to be canonicalized.
Returns:
canonicalized version of url argument.

main

public static void main(String[] args)
Parameters:
args - program arguments


Copyright © 2005-2011 Internet Archive. All Rights Reserved.