Back in early 2015, we got spurious reports of web users experiencing Blank white pages. That is pages with no error message, content, or styling what so ever. At first, we didn’t take it too seriously writing it off as connection issues. But after a couple of bosses and power users said they were seeing this issue on a daily basis while on wifi or connected to LAN we had to prioritize it.
This blog post is a post-mortem, after the fact, of how we solved it and what kind code we had to write to fix it and keep it fixed. — Now we’ve even shipped it as an open source gem to give back to the community. Cheers and happy read!
The issues started creeping up over Christmas of 2014. Users began reporting that they were browsing the web page as usual and all of a sudden they would hit a blank white page with no content.
Figuring out the problem
We monitor our sites with New Relic and a couple of user analytics services. None of the critical metrics showed any impact that couldn’t be disregarded as noise. New Relic didn’t show any 500 errors. We couldn’t reproduce the issue locally so we wrote off the issue a couple of times but the bosses and internal power users kept coming back claiming the sites where broken.
In February 2015 we started looking into it since we had to down-prioritize it since we couldn’t easily reproduce the issue. We started by looking at all the 500-errors New Relic tracked even though those should’ve shown our custom 500 error page, we simply didn’t trust the users to differentiate our 500 page from a blank page in bug reports.
In March of 2015, a couple of internal employees started getting quite anxious since they experienced the issue more often than not and filed a lot of issues. By the time we could sync up with them the issue would be gone and the user would be frustrated.
We sent out a memo internally asking our staff to attempt to visit a debug-page and send us the debug code on that page whenever they got the error. Unfortunately to no luck. The debug page also rendered a white page.
Damn those cookies
In April 2015 we struck gold. A technically savvy user opened up dev tools and told us he kept getting a
400 Bad Request and that his cookies were quite sizeable. He told us that his cookies exceeded the 8kb domain limit which most standard nginx configurations have set. We weren’t using nginx but little did we know, Heroku used the same limit.
Advertisers don’t really want to display the same ad to the same user more than a limited amount of times during a set time period. Usually, they control these using cookies. So one of the primary the ad providers we used at that time was Adtoma Fusion which had a facility to keep track of the view counts of each advert per user. Adtoma knew that cookies are not allowed to exceed 4KB per cookie and made sure that their cookie didn’t exceed 6KB(through cookie splitting for
Fusion.s1) so they were not the lone culprit. Adtoma also regularly purged expired content inside the cookie so they where usually keeping the cookie in the 3KB size.
T_ID cookie on our domain. A loyal customer could have a lot of expired packages and a lot of expired campaigns in addition to well-filled profile. That information was stored in a JWT-token, but that information was also needed client side, so we duplicated the information in the JWT-token in clear text inside the cookie. It was easy and saved us the hassle of exposing the key client-side or doing a server round trip. For a loyal user, this cookie could easily come into the 2.5KB to 3KB range.
In addition to this, we normally had roughly 20 cookies that were smallish.
Trying to track it down
It sure took a while before we understood that the invalid cookie sized requests didn’t register in our new relic metrics since Heroku cut the request before it ever reached our apps. So all metrics only showed valid cookie sizes.
It took us a roughly two weeks of discussing with both Heroku and Cloudflare support to figure out which service was cutting the requests and causing the blank white pages being rendered. Once that was done we had to solve the issue.
Actually solving the issue
Quite some of our reference users claimed they couldn’t reach the website for 30 minutes or up to a day after initial white page. Quite a bad user experience!
Since the cookies were being generated on our website and we really couldn’t disable either authorization or advertising we had to get clever.
Solution — part 1
We thought about setting up a service on a subdomain and instructing customer service to inform the users affected to visit the “fixing service”. That wouldn’t work though since the subdomain would not have access to the cookies on our primary domain.
We then decided to set up the fixing service on a subpath called “kaka” (cookie in Swedish) which we could set up in Cloudflare to use another upstream service that would not be Heroku. That second service was set up behind a custom nginx configuration with an increased header size limit.
The service was basically a static HTML file with a redirect back to the website after server headers cleared the worst known offender cookies(
T_ID mentioned above).
Solution — part 2
Now for a more persistent solution, we needed to continuously clean out cookies when they collectively became too large. So we set out to write a Ruby on Rails middleware that could clean cookies both on request and response. The middleware generated less than 0.1ms overhead which we thought was a totally reasonable tradeoff.
The middleware uses a developer configured safe list of cookie names and cookie values using regular expressions. All incoming cookies in the request object get cleaned and changed to a expire directive in the response headers.
All cookies set internally in rails was also scrutinized and removed from being set altogether or expired if it was preexisting.
The light in the end of the tunnel
In June 2015, almost 6 months after the first report, the cookie middleware was shipped to all our sites and reports fell to a halt. Tracking revealed that advertisers were setting 12 million cookies for our users across our 10 sites — every day!
The problems of yesteryear
Today the cookie solutions for both authorization and advertising is long gone being replaced by smarter solutions and we no longer see a burning need for this cookie filtering solution. Since we base our development heavily on open source contributions we want to give back to the community and have now open sourced this middleware as a ruby gem, we hope it will be useful for someone:
- The content part of the cookie header can only be 8186 bytes long before Heroku cuts the request.
- We’ve published a gem called cookiefilter that help you purge not whitelisted cookies and too big cookies.