org.archive.wayback.hadoop
Class CDXSort

java.lang.Object
  extended by org.apache.hadoop.conf.Configured
      extended by org.archive.wayback.hadoop.CDXSort
All Implemented Interfaces:
org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool

public class CDXSort
extends org.apache.hadoop.conf.Configured
implements org.apache.hadoop.util.Tool


Nested Class Summary
static class CDXSort.CDXCanonicalizerMapClass
          Mapper which reads an identity CDX line, outputting: key - canonicalized original URL + timestamp val - everything else
static class CDXSort.CDXMapClass
          Mapper which reads a canonicalized CDX line, splitting into: key - URL + timestamp val - everything else
static class CDXSort.DeReffingCDXCanonicalizerMapClass
           
static class CDXSort.FunkyCDXCanonicalizerMapClass
          Mapper which reads an identity Funky format CDX line, outputting: key - canonicalized original URL + timestamp val - everything else input lines are a hybrid format: ORIG_URL DATE '-' (literal) MIME HTTP_CODE SHA1 REDIRECT START_OFFSET ARC_PREFIX (sans .arc.gz) ROBOT_FLAG (combo of AIF - no: Archive,Index,Follow, or '-' if none) Ex: http://www.myow.de:80/news_show.php? 20061126032815 - text/html 200 DVKFPTOJGCLT3G5GUVLCETHLFO3222JM - 91098929 foo A Need to: .
static class CDXSort.FunkyDeReffingCDXCanonicalizerMapClass
           
 
Constructor Summary
CDXSort()
           
 
Method Summary
 org.apache.hadoop.mapred.RunningJob getResult()
          Get the last job that was run using this instance.
static void main(String[] args)
           
 int run(String[] args)
          The main driver for sort program.
 
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
 

Constructor Detail

CDXSort

public CDXSort()
Method Detail

run

public int run(String[] args)
        throws Exception
The main driver for sort program. Invoke this method to submit the map/reduce job.

Specified by:
run in interface org.apache.hadoop.util.Tool
Throws:
IOException - When there is communication problems with the job tracker.
Exception

main

public static void main(String[] args)
                 throws Exception
Throws:
Exception

getResult

public org.apache.hadoop.mapred.RunningJob getResult()
Get the last job that was run using this instance.

Returns:
the results of the last job that was run


Copyright © 2005-2011 Internet Archive. All Rights Reserved.