WordPress admin hang after upgrade

It’s 8am on a Tuesday morning, and one of the web sites we maintain has just stopped responding.  It’s not completely unexpected. We just upgraded to WordPress 3.1 last night.

The upgrade went smoothly in our development environment, and after patching a few plugins, we were off and running. The upgrade last night on our production system went off without a hitch as well. My main concern was to make sure the frontend pages still worked, and all the sites in our network looked good. At least they did on the public pages.

But now, in the light of day, something isn’t right. As it turns out, the frontend pages *are* all working. But on one of the sites, the admin interface, is acting weird. The editor who is reporting the issue says the admin is locked up, and sure enough, I try to log in and…nothing. The page just loads and loads. I try again with the same result. And then again, but instead of giving up after 30 seconds, I stop the urge to hit the “reload” button over and over and just let it go. Strangely enough, after a few minutes, the admin interface loads as normal. Two minutes to be exact. I navigate to another page in the admin, wait two minutes, and it loads again. And another, and another, and all of them give the exact same response. The site is not slow, it’s just hanging for two minutes exactly, and then loading the page I requested without issue.

Ok, so my first thought is that it’s a plugin issue. I even find some evidence here and there, to back up that claim. I disable all the plugins on the site. Two minutes later, the plugins are disabled, and the admin still behaves in the same manner. Click, two minutes… click, two minutes.

So next, I call in some help from another developer on the team. We try turning on php debugging, but that doesn’t help, because, first, it takes two minutes between page loads to get anything useful out of the admin, and second, there *is no error.* The pages are not broken, just delayed. Next, he puts a dead simple block of debug code into the admin files of WordPress itself.

<?php die(); ?>

When placed at the top of the file, the admin stops right away. At the bottom or in the middle, it takes two minutes, the page renders (mostly), then dies. After a few minutes of moving the line from one place in the file to the next, and following down a few includes, he’s found the culprit. These simple, innocent-looking lines from wp-admin/admin.php in the core WordPress code.

/**
 * On really small MU installs run the upgrader every time,
 * else run it less often to reduce load.
 *
 * @since 2.8.4b
*/
$c = get_blog_count();
if ( $c <= 50 || ( $c > 50 && mt_rand( 0, (int)( $c / 50 ) ) == 1 ) ) {
  require_once( ABSPATH . WPINC . '/http.php' );
  $response = wp_remote_get( admin_url( 'upgrade.php?step=1' ),
    array( 'timeout' => 120, 'httpversion' => '1.1' ) );
  do_action( 'after_mu_upgrade', $response );
  unset($response);
}

The site having problems is brand new and has fewer than 50 posts. And for some reason, the upgrade db script failed to run last night when the rest of the sites were upgraded. That means, each and every time an admin page loads, it tries to run the database upgrade. And it runs the upgrade in a slightly odd way. Instead of calling a function to perform the upgrade, it does a wp_remote_get, which uses http to call another page on the server. We have a quirk in our production architecture that doesn’t allow the servers sitting behind our load balancer call themselves, so instead of just running the upgrade, it hangs. And it hangs for exactly 120 seconds before giving up. The other sites are not really affected, but even if they are hanging, the code above runs randomly, proportionally to the number of posts they have. And luckily for us, they have a *ton* of posts.

So, is there a lesson here? Not really. It just happens to be a good story. I can try to spin this as a lesson about having your development environments match your production environments as much as possible, but you probably know that already. It’s a luxury that is seldom affordable to have an exact mirror between production and development. But when the bug you run into is *caused* by your production server architecture, how can you avoid it if you have those variations? No really, I’d love to know how. I’ll take any suggestions I can get.