Wednesday, 11 November 2009

Performance Monitoring and Reporting - Our Story

This post summarises the presentation that we gave at the Google Test Automation Conference (GTAC) in November 2009. It describes some of the work that the smartFOCUS DIGITAL development team have been doing to both monitor and optimize the performance of our web applications.

The performance of a web application should be regarded as a feature rather than an after thought. More organisations are noticing that when their application has any form of unintended latency, it affects their profits. Google saw a 20% drop in traffic when they added a 500ms delay. Yahoo! saw nearly a 10% drop in full page traffic when their load time increased by 400ms. Amazon noticed a 1% drop is sales when their pages took 100ms longer to load.

The performance of an application forms a large part of the user experience and influences the user’s impression of the application and the company it represents. There are a number of reasons to automatically measure Key Performance Indicators (KPI) of a web application:

  • As a developer, you can gauge the impact of changes that you are committing. Cumulative feature additions and bug fixes can lead to a web application that "feels sluggish", but has no immediately obvious culprit unless a developer keeps on top of performance tuning. If you are able to continuously measure performance, you will be able to rectify issues as they arise (functioning as the performance equivalent of a Unit test in a typical Continuous Integration setup).

  • As a tester, you can free up testing resource to be used throughout a sprint. The data that is collected can also be added to the Testers Heads Up Display for performance results.

  • As a manager, you gain a top-down view of site performance, which allows you to become aware of performance issues before your customers inform you.

  • As a member of an infrastructure team, timing data gathered for a range of key actions on a site provides an insight into the performance and load on application and database servers. This should reflect the experience as seen by the application’s users and may pick up on issues that are not immediately obvious through traditional server monitoring tools.

  • As a member of a support team, historic performance data is useful as a comparison point when diagnosing user issues. The changes that are needed to correct performance issues are not necessarily hard to implement, but a few small changes throughout a site can produce huge returns.

We needed to create a framework to allow us to find the performance issues affecting our application, fix them and monitor them to ensure we don't regress.

We chose YSlow, a free firebug plugin from Yahoo! to perform the measurements and produce useful, consistent data.

YSlow

As well as producing detailed, static reports within the plugin, YSlow has the ability to "beacon" the data it records to a pre-set web address using an HTTP GET request. YSlow can also be set to auto-run on page load, so you can manually walk through a site and measure each page.

While doing this process we noticed that the a number of reporting aspects were missing. No detailed information about the make-up of the page (e.g. component types, caching information) was sent.

As YSlow is a standard Firefox plugin, it was easy to poke around in the source code and set this right. We unzipped the XPI file that contains YSlow and made the modifications to send the extra caching and component data. We then rezipped it and installed it on a newly created Firefox profile.

(Note: This was the case at the time - for YSlow v1 and the early betas of v2. The latest version has an extremely comprehensive and well documented beaconing system! See http://developer.yahoo.com/yslow/help/index.html#yslow_beacon )

Selenium

We then needed to look to at automating the process of walking through the site. Since smartFOCUS Digital uses Selenium for a lot of its automated testing it was a natural choice for us to use.

For each section of the walk through, a new requestID is retrieved from the database and this is then sent with the beacon data, to allow all the data we collect to be tied together.

Once the tests had been running for a while we noticed that the data that was being recorded by YSlow was not what we were expecting. To YSlow, it appeared that we were not implementing any caching at all. After a bit of hunting through the Selenium RC Code base, we found the following lines of code and commented them out.

They are within the proxy that sits between the site under test and the machine driving the tests. By default, the proxy blocks ETag and Last Modified headers from passing through, to ensure that the lastest version is always being tested. This is brilliant for testing of versionless software but if you are testing versioned software and want to check that the caching is working properly they can be rather annoying.


//response.removeField(HttpFields.__ETag); // possible cksum?  Stop caching... 
//response.removeField(HttpFields.__LastModified); // Stop caching


Action Timing

Once we had managed to get a what we wanted from running the tests with YSlow we wanted to know what other information would be useful. The first thing that came to mind,and that we have implemented, was the ability to measure how long things take to load. We started out by recording the time that it takes to load a page but we also started getting interested in recording how much time it would take for dialogs to load as well as tree nodes to expand in our management view on the site.

Reporting

So now we are recording all of this useful data, we need a good way to analyse it and be able to monitor each build that is produced. As we predominantly produce web applications, a web reporting portal was the obvious solution. We make heavy use of jQuery within our site, so we went with the really great reporting plugin, flot. The data is pulled from the database through a JSON webservice and goes to create two flot plots for the YSlow data for each page.

Size Plot

The size plot shows the size of the page as a whole, as well as the size of each type of component (e.g Javascript, CSS) that makes up the page. Along with this, the cached versions of these values are plotted as well (i.e. the size you'd download if you visited with a full cache).

The build numbers are plotted along the x-axis: vertical bars highlight when each new build was introduced. The darker bars represent minor build number changes (e.g. 1.2.0) and the lighter bars represent numbered builds (e.g. 1.2.1 and 1.2.2).

YSlow Data Plot

The YSlow data plot is in the same format as the size plot, but plots the scores for each of the YSlow categories on the Y-Axis.

Timing Plot

Each page has at least the load time recorded, along with other actions as apprpriate. The timing plots show these in the same way as the size and YSlow plots, with times averaged over multiple runs.

Delta Plot

The delta plot was introduced to give a good "top-down" view of the site changing over time. It shows the change in page size relative to the previous numbered build for every page that is monitored. This allows you to see if there is an issue that affects every page on the site (the mass of lines will spike upwards together) or a large change that affects only a single page (a single line will break away from the pack and be noticible).

Whats happened since Google Test Automation Conference

We have been in contact with Yahoo! and have asked for our code changes to be merged in. They have had similar thoughts to us and have already implemented some changes and will add the rest. They also have implemented a YSLOW.firefox.run() method that can be used to trigger a YSlow run. This means that we can easily run it with selenium.GetEval().

So that is a quick run through of the work that was presented at GTAC. If you want to know more, the slides and videos from GTAC are embedded below (and available with the rest at GTAC.biz).

There was also a "live-waving" commentary of the presentation on Google wave. You should be able to find it by searching "group:gtacgroup@googlegroups.com"





0 comments:

Post a Comment