Scraper Scuffles09.01.16 · python
Web scraping is in the middle of an arms race. Last week, I read an article by Francis Kim about how scraping—every hacker’s favorite tool for getting around APIs, data-mining, and liberating data—has become harder, because sites are getting better at identifying scrapers and shutting them out.
Francis lays out a bunch of solutions (including an alarming anti-CAPTCHA service) for defeating anti-scraping measures. However I was surprised that neither he nor any commenters mentioned user agent strings.
PhantomJS, Selenium, and other automation tools usually allow you to spoof a particular UA, so why aren’t more people using this to make their scrapers appear to be real, random browsers?
A few days later, I came across Randall Degges’ useragent-api which returns a random UA string every time you make a request. It’s not a panacea, but is an elegant solution for masking your scripts intentions & provenance.
Unlike the Cold War, I think the battle over scraping is going to get hot. Keep your eyes peeled.