Recently, Facebook announced that their mobile application will implement content pre-fetching. This means that if somebody creates an FB post about a page on your website (or you run Facebook ads with a link to your site), as FB users view their timeline and see that post, the mobile app fires off a “GET” request to the linked content on your server. The FB app caches that content for a short time, and if the user clicks on the post, the app serves up the cache before sending you on to the actual site so that the response time appears to be reduced.
This is both good and bad. It’s good because who doesn’t want their website to appear to load faster for users who are trying to reach their content? It’s potentially bad, because the higher the “post reach” in the Facebook network, the more prefetching that is going to occur on your server. And it looks a lot like a DDoS attack: spikes of traffic from all over the world, but none of it gets logged in your JS analytics solutions (since prefetching does not parse and execute the JS).
Recently, even with a small reach (60,000 users reached) we experienced a consistent 60 requests/per-minute traffic surge lasting in periods of 10 minutes at a time as a boosted post and some ads rolled out across the FB network. All on Android devices (according to logged user agents) from the FB mobile app web-view user agent, mostly in the target geographic region from the ads.
With enough cash and desire to drive user acquisition, we could essentially pay Facebook to DDoS ourselves. (Or if you’re lucky enough to have a huge page following, your post could potentially do that without boosting).
Facebook does send the “X-Purpose:preview” header to let you know what the request is about, but seeing standard “combined format” log files will at first be a bit confusing (lots of traffic, random IPs, all on Android devices, and nothing logged in your JS analytics platforms).
In NGINX try this:
log_format combinedPurpose '$remote_addr - $remote_user [$time_local] ' '"$request" $status $body_bytes_sent ' '"$http_referer" "$http_user_agent" purpose:$http_x_purpose'; access_log /var/log/nginx/access.log combinedPurpose;
I’m not sure I agree with Facebook’s decision to do this. If website operators need to prepare extra capacity just handle users scrolling through an app (with a huge install base) that they don’t control, this seems like a huge waste of resources (electricity, virtual machine images spun up to respond to traffic), or operators need a very smart caching plan. All for the possibility that a user clicks a post and saves a few seconds.