Scraper Scuffles

Web scraping is in the middle of an arms race. Last week, I read an article by Francis Kim about how scraping—every hacker’s favorite tool for getting around APIs, data-mining, and liberating data—has become harder, because sites are getting better at identifying scrapers and shutting them out.

Francis lays out a bunch of solutions (including an alarming anti-CAPTCHA service) for defeating anti-scraping measures. However I was surprised that neither he nor any commenters mentioned user agent strings.

PhantomJS, Selenium, and other automation tools usually allow you to spoof a particular UA, so why aren’t more people using this to make their scrapers appear to be real, random browsers?

A few days later, I came across Randall Degges’ useragent-api which returns a random UA string every time you make a request. It’s not a panacea, but is an elegant solution for masking your scripts intentions & provenance.

Unlike the Cold War, I think the battle over scraping is going to get hot. Keep your eyes peeled.


I’ve used pretty much every instant message & team collaboration tool ever (AIM, ICQ, IRC, Sametime, HipChat, etc.) and these days it’s all about Slack.

It’s nice that you can edit Slack messages whenever you make a typo, but that means you have to mouse over, click edit, make your edit, click enter, etc. It’s a whole thing.

But what if you could just do this instead?

slack-typobot demo

Pretty neat, hunh? Just type in your correction, like you would have on AIM and slack-typobot will update your original message. Bing. Bang. Boom!

Hackers just wanna have fun

People often ask me why, if I love coding so much, I haven’t pursued a career in software development. I shrug and tell them I’m a hacker. Being a pro developer would take all the fun out of it because I’d have to follow ‘best practices’ (screw you PEP8), ‘architect’ solutions, do code reviews, use JIRA, and numerous other things that would kill my vibe.

Think about movie critics. In theory, they have the best job ever: watching movies all day. But, they’re incapable of enjoying mindless action flicks like The Fast & Furious because they have to focus on things like plot, character development, and the Bechdel Test.

This is all my way of saying, I’ve never used virtualenv when writing Python apps. I just pip install modules globally and get to hacking. And that’s great because I never have to think about “did I install x?”—at least until it’s time to deploy to Heroku.

Fortunately, I found another hacker out there who appreciates my philosophy. Vadim Kravcenko’s pipreqs module generates requirements.txt files based on the libraries you’re using, not just what you’ve installed in your environment (🖕 pip freeze).

For example, my latest project has a lots imports spread across 3 or 4 source files:

```python from flask import request, session, Flask, render_template, Response, redirect, url_for from flask_cors import CORS, cross_origin import requests import json import re import random from os import environ import string from datetime import datetime

import twilio import from import TwilioLookupsClient import twilio.twiml

from sqlalchemy import * from sqlalchemy.exc import IntegrityError from sqlalchemy.exc import CompileError from sqlalchemy.orm import sessionmaker, deferred from sqlalchemy.ext.declarative import declarative_base

import cloudinary import cloudinary.uploader import cloudinary.api ```

But with pipreqs, I can run pipreqs ~/path/to/app and I get a perfectly formatted requirements file:

cloudinary==1.4.0 Flask==0.11.1 Flask_Cors==2.1.2 requests==2.10.0 SQLAlchemy==0.9.4 twilio==3.8.0

Did I just blow your little best practice, rule following, PEP addicted mind? Thought so. 😎

Cross Origin Really Sucks

This was the first half of my Friday night:

  1. Build an awesome web apps with Python and Flask
  2. Add an API to my app because Flask is so awesome
  3. Deploy my app & API to Heroku and make them public

Neat right? Well, the second half didn’t go so well:

  1. Try to use Flask API client-side via AJAX
  2. Realize I need to set CORS headers on your server
  3. Fail to configure Apache on server that doesn’t really exist / that I don’t actually control
    • Cry myself to sleep at 5am

So yeah, lots of feels. I built an API to serve data from my database on Heroku to a static website hosted on S3 (via XMLHttpRequest). But, because of CORS, the requests were invalid 😡

ICYMI: The Same Origin Policy prevents from accessing resources on unless Foo is authorized via Cross Origin Resource Sharing. (Scripts, stylesheets, HTML, images, and a few other file types are obviously excluded from this rule.) This is also the bedrock of web security.

Fortunately, after a solid night’s (morning’s?) rest, I discovered Cory Dolphin’s Flask-CORS library. With 1 line of code, you can enable CORS for particular routes, resources, or your whole Flask app. No server configs. No weird DevOps voodoo.

Seriously, here’s my api route before:

```python from flask import Flask application = Flask(name)

without CORS, I can’t hit this API clientside

@application.route(‘/comments/', methods = ['GET', 'POST']) def comments(i): clist = getComments(i) return clist ```

And after, with Flask-CORS:

```python from flask import Flask from flask_cors import CORS, cross_origin # 1 import application = Flask(name)

@application.route(‘/comments/', methods = ['GET', 'POST']) # Just 1 line of code (actually, just a decorator, and now I can access my API anywhere!) @cross_origin() # this is the magic def comments(i): clist = getComments(i) return clist ```


Sneaky spy

There are lots of ways for an application to get your current location (latitude / longitude) programmatically, but wifi-triangulate is a little shady.

Most web apps ask for your permission before accessing your location (there’s a whole browser based API for this). But with wifi-triangulate, there’s no opt-in and it’s quite accurate. Pretty sure you could use this to locate a user, silently, without consent.

wifi-triangulate demo

Yes, there are other / easier ways to get a user’s location. And no, this library isn’t inherently “evil.” But please be careful and follow the golden rule of software development: don’t be a dick (or a sucker).

2017 Neal Shyam