Scalable, Flexible Performance Testing : Use caution when migrating to the cloud

Cloud platform has become increasingly popular thanks to its better sharing of hardware resources. More and more services are being migrated to it. However, along with the benefits, it carries some concerns on performance that we are going to look at in this blog.

Recently I did a performance test against an online library. We need to login to the site and pick a book and emulate user browsing through pages of the book.

It's fairly easy to develop script on NetGend platform (URLs and names are obscured to keep the site anonymous)

 function VUSER() {  
      action(http,"http://www.example.com"); //this step will get a sessionID in cookie  
      a.login_email = toUrl("jsmith@example.com");  
      a.login_password = "abc123";  
      http.POSTData = combineHttpParam(a);  
      action(http,"http://www.example.com/password-login");  

      for (id = 1; id < 340; id ++) {  
           action(http,"http://www.example.com/1234567/19/11h${id}.swf");  
           println("${id},${http.totalRespTime},${http.url}");  
      }  
 }

To my surprise, the response times (defined as the time between the transmission of HTTP request and the last packet of HTTP response) vary from 234ms to 1911ms. Since the HTTP response sizes for these transactions are about the same, I wonder what caused the variation in response times.

Luckily I have a friend called "wireshark", the world's most famous packet sniffer. According to the packet capture shown on wireshark, there is a range of packets with long delays among them. There are no dropped packets (hence no packet re-transmissions) here, so there are two possibilities left:

Delay was caused by the server.
Delay was caused by the network elements (like routers) along the path between the server and my PC.

At this point, it appears impossible to determine which one is the real cause. Thanks to the TCP timestamp option (which is turned on by default), it's possible to determine where the delay happened. Why? because the timestamp in TCP option (last part TCP header, if present) was set by the server when a TCP packet was sent. By looking at the variation on TCP timestamp , we can infer whether the delay is caused by the server or the network.

Here is what I gathered from the wireshark packet capture:

 1 0.000000000 1.1.1.1 80 192.168.5.105 38922 TSval 1483383927  
 2 0.000375000 1.1.1.1 80 192.168.5.105 38922 TSval 1483383927  
 3 0.000500000 192.168.5.105 38922 1.1.1.1 80 TSval 66747395  
 4 0.000675000 1.1.1.1 80 192.168.5.105 38922 TSval 1483383927  
 5 0.035894000 1.1.1.1 80 192.168.5.105 38922 TSval 1483383936  
 6 0.035929000 192.168.5.105 38922 1.1.1.1 80 TSval 66747404  
 7 0.188478000 1.1.1.1 80 192.168.5.105 38922 TSval 1483383974  
 8 0.188825000 1.1.1.1 80 192.168.5.105 38922 TSval 1483383974  
 9 0.188856000 192.168.5.105 38922 1.1.1.1 80 TSval 66747443  
 10 0.189142000 1.1.1.1 80 192.168.5.105 38922 TSval 1483383974  
 11 0.189454000 1.1.1.1 80 192.168.5.105 38922 TSval 1483383974  
 12 0.189479000 192.168.5.105 38922 1.1.1.1 80 TSval 66747443  
 13 0.189764000 1.1.1.1 80 192.168.5.105 38922 TSval 1483383974

The second column is the sniffer timestamp (when the packets are captured by sniffer), the last column is the TCP timestamp. One challenge here is to find how much time one unit of timestamp is equivalent to.

Let's take a look at 3 packets whose TCP timestamp changed (see the numbers in bold):

between packets 4 and 5, there is a difference of 9 (from 1483383927 to 1483383936) and the difference in timestamp is about 36ms (more precisely 35.2ms). On unit of time is roughly 4ms
between packets 5 and 7, there is a difference of 38 (from 1483383936 to 1483383974) and the difference in timestamp is about 153ms, again, one unit of time is roughly 4ms.

So based on packets 4, 5 and 7, we can conclude TCP timestamp changes match with the those of sniffer timestamp for these packets, which indicates that network didn't delay the packets, the big delays between the packets are caused the server.

Later on, it was confirmed that the server was running on a cloud based platform, possibly sharing a hardware with some noisy/busy neighbors. While 153ms of unexpected delay may not be much, but it can accumulate and not all applications can tolerate it. Now you know sharing the hardware can be double-edged sword, you are warned on your road to the cloud :-)

Scalable, Flexible Performance Testing

Tuesday, December 3, 2013

Use caution when migrating to the cloud

No comments:

Post a Comment