Ted Dziuba with an incredibly succinct summary of why you should write less code and embrace simple solutions.
Moving on, once you have these millions of pages (or even tens of millions), how do you process them? Surely, Hadoop MapReduce is necessary, after all, that’s what Google uses to parse the web, right?
Pfft, fuck that noise:find crawl_dir/ -type f -print0 | xargs -n1 -0 -P32 ./process
32 concurrent parallel parsing processes and zero bullshit to manage. Requirement satisfied.
Complexity sucks. Embrace simplicity and unsexiness. I always like to remind myself I’m not nearly as clever as I think I am. Every time I write a piece of code or a feature, I think “now, how would someone else handle this if it breaks?”
Sexiness aside, this is perhaps the biggest reason to avoid “my new, uber leet scala-dsl-erlang-nosql platform”:
You can minimize risk by using the well-proven tool set, or you can step into the land of the unknown. It may not get you invited to speak at conferences, but it will get the job done, and help keep your pager from going off at night.