org.archive.wayback.accesscontrol.robotstxt
Class RobotExclusionFilter
java.lang.Object
org.archive.wayback.resourceindex.filters.ExclusionFilter
org.archive.wayback.accesscontrol.robotstxt.RobotExclusionFilter
- All Implemented Interfaces:
- ObjectFilter<CaptureSearchResult>
public class RobotExclusionFilter
- extends ExclusionFilter
CaptureSearchResult Filter that uses a LiveWebCache to retrieve robots.txt
documents from the live web, and filters SearchResults based on the rules
therein.
This class caches parsed RobotRules that are retrieved, so using the same
instance to filter multiple SearchResults from the same host will be more
efficient.
Instances are expected to be transient for each request: The internally
cached StringBuilder is not thread safe.
- Version:
- $Date: 2010-09-29 05:28:38 +0700 (Wed, 29 Sep 2010) $, $Revision: 3262 $
- Author:
- brad
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
RobotExclusionFilter
public RobotExclusionFilter(LiveWebCache webCache,
String userAgent,
long maxCacheMS)
- Construct a new RobotExclusionFilter that uses webCache to pull
robots.txt documents. filtering is based on userAgent, and cached
documents newer than maxCacheMS in the webCache are considered valid.
- Parameters:
webCache - LiveWebCache from which documents can be retrieveduserAgent - String user agent to use for requests to the live web.maxCacheMS - long number of milliseconds to cache documents in the
LiveWebCache
searchResultToRobotUrlStrings
protected List<String> searchResultToRobotUrlStrings(String resultHost)
filterObject
public int filterObject(CaptureSearchResult r)
- Description copied from interface:
ObjectFilter
- inpect record and determine if it should be included in the
results or not, or if processing of new records should stop.
- Parameters:
r - Object which should be checked for inclusion/exclusion or abort
- Returns:
- int of FILTER_INCLUDE, FILTER_EXCLUDE, or FILTER_ABORT
Copyright © 2005-2011 Internet Archive. All Rights Reserved.