org.archive.wayback.util.url
Class AggressiveUrlCanonicalizer

java.lang.Object
  extended by org.archive.wayback.util.url.AggressiveUrlCanonicalizer
All Implemented Interfaces:
UrlCanonicalizer

public class AggressiveUrlCanonicalizer
extends java.lang.Object
implements UrlCanonicalizer

Class that performs the standard Heritrix URL canonicalization. Eventually, this should all be configurable, or perhaps be able to read the settings used within a Heritrix crawler... or even multiple crawlers... this is hard.

Version:
$Date: 2009-07-17 17:14:42 -0700 (Fri, 17 Jul 2009) $, $Revision: 2771 $
Author:
brad

Constructor Summary
AggressiveUrlCanonicalizer()
           
 
Method Summary
 java.lang.String canonicalize(java.lang.String url)
          Idempotent operation that will determine the 'fuzziest' form of the url argument.
protected  boolean doStripRegexMatch(java.lang.StringBuilder url, java.util.regex.Matcher matcher)
          Run a regex against a StringBuilder, removing group 1 if it matches.
static void main(java.lang.String[] args)
           
 java.lang.String urlStringToKey(java.lang.String urlString)
          return the canonical string key for the URL argument.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

AggressiveUrlCanonicalizer

public AggressiveUrlCanonicalizer()
Method Detail

doStripRegexMatch

protected boolean doStripRegexMatch(java.lang.StringBuilder url,
                                    java.util.regex.Matcher matcher)
Run a regex against a StringBuilder, removing group 1 if it matches. Assumes the regex has a form that wants to strip elements of the passed string. Assumes that if a match, group 1 should be removed

Parameters:
url - Url to search in.
matcher - Matcher whose form yields a group to remove
Returns:
true if the StringBuilder was modified

urlStringToKey

public java.lang.String urlStringToKey(java.lang.String urlString)
                                throws org.apache.commons.httpclient.URIException
return the canonical string key for the URL argument.

Specified by:
urlStringToKey in interface UrlCanonicalizer
Parameters:
urlString -
Returns:
String lookup key for URL argument.
Throws:
org.apache.commons.httpclient.URIException

canonicalize

public java.lang.String canonicalize(java.lang.String url)
Idempotent operation that will determine the 'fuzziest' form of the url argument. This operation is done prior to adding records to the ResourceIndex, and prior to lookup. Current version is exactly the default found in Heritrix. When the configuration system for Heritrix stabilizes, hopefully this can use the system directly within Heritrix.

Parameters:
url - to be canonicalized.
Returns:
canonicalized version of url argument.

main

public static void main(java.lang.String[] args)
Parameters:
args -


Copyright © 2005-2009 Internet Archive. All Rights Reserved.