org.archive.wayback.accesscontrol.robotstxt
Class RobotExclusionFilter

java.lang.Object
  extended by org.archive.wayback.resourceindex.filters.ExclusionFilter
      extended by org.archive.wayback.accesscontrol.robotstxt.RobotExclusionFilter
All Implemented Interfaces:
ObjectFilter<CaptureSearchResult>

public class RobotExclusionFilter
extends ExclusionFilter

CaptureSearchResult Filter that uses a LiveWebCache to retrieve robots.txt documents from the live web, and filters SearchResults based on the rules therein. This class caches parsed RobotRules that are retrieved, so using the same instance to filter multiple SearchResults from the same host will be more efficient. Instances are expected to be transient for each request: The internally cached StringBuilder is not thread safe.

Version:
$Date: 2010-09-29 05:28:38 +0700 (Wed, 29 Sep 2010) $, $Revision: 3262 $
Author:
brad

Field Summary
 
Fields inherited from class org.archive.wayback.resourceindex.filters.ExclusionFilter
filterGroup
 
Fields inherited from interface org.archive.wayback.util.ObjectFilter
FILTER_ABORT, FILTER_EXCLUDE, FILTER_INCLUDE
 
Constructor Summary
RobotExclusionFilter(LiveWebCache webCache, String userAgent, long maxCacheMS)
          Construct a new RobotExclusionFilter that uses webCache to pull robots.txt documents.
 
Method Summary
 int filterObject(CaptureSearchResult r)
          inpect record and determine if it should be included in the results or not, or if processing of new records should stop.
protected  List<String> searchResultToRobotUrlStrings(String resultHost)
           
 
Methods inherited from class org.archive.wayback.resourceindex.filters.ExclusionFilter
setFilterGroup
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

RobotExclusionFilter

public RobotExclusionFilter(LiveWebCache webCache,
                            String userAgent,
                            long maxCacheMS)
Construct a new RobotExclusionFilter that uses webCache to pull robots.txt documents. filtering is based on userAgent, and cached documents newer than maxCacheMS in the webCache are considered valid.

Parameters:
webCache - LiveWebCache from which documents can be retrieved
userAgent - String user agent to use for requests to the live web.
maxCacheMS - long number of milliseconds to cache documents in the LiveWebCache
Method Detail

searchResultToRobotUrlStrings

protected List<String> searchResultToRobotUrlStrings(String resultHost)

filterObject

public int filterObject(CaptureSearchResult r)
Description copied from interface: ObjectFilter
inpect record and determine if it should be included in the results or not, or if processing of new records should stop.

Parameters:
r - Object which should be checked for inclusion/exclusion or abort
Returns:
int of FILTER_INCLUDE, FILTER_EXCLUDE, or FILTER_ABORT


Copyright © 2005-2011 Internet Archive. All Rights Reserved.