|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectorg.archive.wayback.util.url.AggressiveUrlCanonicalizer
public class AggressiveUrlCanonicalizer
Class that performs the standard Heritrix URL canonicalization. Eventually, this should all be configurable, or perhaps be able to read the settings used within a Heritrix crawler... or even multiple crawlers... this is hard.
| Constructor Summary | |
|---|---|
AggressiveUrlCanonicalizer()
|
|
| Method Summary | |
|---|---|
java.lang.String |
canonicalize(java.lang.String url)
Idempotent operation that will determine the 'fuzziest' form of the url argument. |
protected boolean |
doStripRegexMatch(java.lang.StringBuilder url,
java.util.regex.Matcher matcher)
Run a regex against a StringBuilder, removing group 1 if it matches. |
static void |
main(java.lang.String[] args)
|
java.lang.String |
urlStringToKey(java.lang.String urlString)
return the canonical string key for the URL argument. |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Constructor Detail |
|---|
public AggressiveUrlCanonicalizer()
| Method Detail |
|---|
protected boolean doStripRegexMatch(java.lang.StringBuilder url,
java.util.regex.Matcher matcher)
url - Url to search in.matcher - Matcher whose form yields a group to remove
public java.lang.String urlStringToKey(java.lang.String urlString)
throws org.apache.commons.httpclient.URIException
urlStringToKey in interface UrlCanonicalizerurlString -
org.apache.commons.httpclient.URIExceptionpublic java.lang.String canonicalize(java.lang.String url)
url - to be canonicalized.
public static void main(java.lang.String[] args)
args -
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||