This page is a list in no particular of things learned over time indexing here at the Archive (Its a little spare for now but we'll add over time).

Undo all external dependencies

At the Archive, we let go of all NFS mounts on slaves and we even hardcoded DNS so the cluster had minimal dependency on external services. On too many occasions, a temporary NFS stall or a DNS outage killed jobs that had been running 24 hours or more.

Make all slave nodes the same in a cluster

Save yourself headache by ensuring all nodes have the same hardware profile -- RAM, number and size of disks -- and that they have the exact same versions of operating system and the same software (and versions) installed all around. Hadoop sort of expects the cluster to be homogeneous so giving it something otherwise will only make your indexing life the harder (And cluster computing isn't easy at the best of times).

Choose how many ARCs per segment carefully

An ARC per segment is probably too small. Tens of thousands of ARCs per segment, while it makes for less objects to manage, the product will probably be too large to download to local disk from HDFS or will make for indices that can't be merged within the confines of a local disk, etc. Experiment to find the size best suited to your setup.