Scraper Scuffles

Web scraping is in the middle of an arms race. Last week, I read an article by Francis Kim about how scraping—every hacker’s favorite tool for getting around APIs, data-mining, and liberating data—has become harder, because sites are getting better at identifying scrapers and shutting them out.

Francis lays out a bunch of solutions (including an alarming anti-CAPTCHA service) for defeating anti-scraping measures. However I was surprised that neither he nor any commenters mentioned user agent strings.

PhantomJS, Selenium, and other automation tools usually allow you to spoof a particular UA, so why aren’t more people using this to make their scrapers appear to be real, random browsers?

A few days later, I came across Randall Degges’ useragent-api which returns a random UA string every time you make a request. It’s not a panacea, but is an elegant solution for masking your scripts intentions & provenance.

Unlike the Cold War, I think the battle over scraping is going to get hot. Keep your eyes peeled.

2017 Neal Shyam