org.archive.wayback.accesscontrol.robotstxt
Class RobotExclusionFilter

java.lang.Object
  extended by org.archive.wayback.accesscontrol.robotstxt.RobotExclusionFilter
All Implemented Interfaces:
ObjectFilter<CaptureSearchResult>

public class RobotExclusionFilter
extends java.lang.Object
implements ObjectFilter<CaptureSearchResult>

CaptureSearchResult Filter that uses a LiveWebCache to retrieve robots.txt documents from the live web, and filters SearchResults based on the rules therein. This class caches parsed RobotRules that are retrieved, so using the same instance to filter multiple SearchResults from the same host will be more efficient. Instances are expected to be transient for each request: The internally cached StringBuilder is not thread safe.

Version:
$Date$, $Revision$
Author:
brad

Field Summary
 
Fields inherited from interface org.archive.wayback.util.ObjectFilter
FILTER_ABORT, FILTER_EXCLUDE, FILTER_INCLUDE
 
Constructor Summary
RobotExclusionFilter(LiveWebCache webCache, java.lang.String userAgent, long maxCacheMS)
          Construct a new RobotExclusionFilter that uses webCache to pull robots.txt documents.
 
Method Summary
 int filterObject(CaptureSearchResult r)
          inpect record and determine if it should be included in the results or not, or if processing of new records should stop.
protected  java.util.List<java.lang.String> searchResultToRobotUrlStrings(java.lang.String resultHost)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

RobotExclusionFilter

public RobotExclusionFilter(LiveWebCache webCache,
                            java.lang.String userAgent,
                            long maxCacheMS)
Construct a new RobotExclusionFilter that uses webCache to pull robots.txt documents. filtering is based on userAgent, and cached documents newer than maxCacheMS in the webCache are considered valid.

Parameters:
webCache -
userAgent -
maxCacheMS -
Method Detail

searchResultToRobotUrlStrings

protected java.util.List<java.lang.String> searchResultToRobotUrlStrings(java.lang.String resultHost)

filterObject

public int filterObject(CaptureSearchResult r)
Description copied from interface: ObjectFilter
inpect record and determine if it should be included in the results or not, or if processing of new records should stop.

Specified by:
filterObject in interface ObjectFilter<CaptureSearchResult>
Parameters:
r - Object which should be checked for inclusion/exclusion or abort
Returns:
int of FILTER_INCLUDE, FILTER_EXCLUDE, or FILTER_ABORT


Copyright © 2005-2009 Internet Archive. All Rights Reserved.