Thursday, April 24, 2014

Test Application Monitoring Software



    Today’s blog is addressed to the infrastructure folks who have the job of making sure all of our applications and monitoring systems are functioning effectively. The NetGend load testing tool is awesome for testing applications at scale, but we will discuss another use – how do we generate synthetic yet realistic load to test monitoring, filtering and scanning systems that sit in our infrastructure?

    In today's internet-based economy, it's essential to ensure that servers are running efficiently and as-expected.  It is important to periodically check the health of the servers.  Application monitoring software is a key tool in validating one’s performance and business-readiness.  Another example of non-traditional load testing is with filters and proxies, which also have a role in ensuring the protection and configuration and enforcement of various policies – application blocking, logging, etc… (We will talk about this in more detail in an upcoming blog.)


    How do we check the checkers? In the case of application monitoring software, just like any other software, it is important to load test it during development and after functional testing phase.   A natural question is, how do we emulate a large number of servers while injecting a sufficient/configurable number of abnormal conditions?


   It turns out that NetGend, in addition to being a great performance testing platform,  can also emulate the servers-under-monitoring.  More specifically,  it can emulate the agents running on the servers and emulating the sending of the server statistics such as CPU, memory,  etc to the application monitoring software.

   Here is an example of NetGend script that can be used to emulate thousands of agents.   From the standpoint of TCP/IP, each of the agents is simply a TCP client, sending data to the TCP server (the monitoring system) repeatedly.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
function VUSER () {
    connect("tcp", "monitorserver.example.com", 3000);
    isGoodServer = randNumber(0, 50);
    cpuLimit     = 100;
    memoryLimit  = 100;
    if (isGoodServer) {
        cpuLimit    = 80;
        memoryLimit = 90;
    }
    cpu    = randNumber(1, cpuLimit);
    memory = randNumber(20, memoryLimit);
    while (1) {
        cpu    = randWalk(cpu, 0, cpuLimit, 10);
        memory = randWalk(memory, 0, memoryLimit, 10);
        send("${cpu},${memory}");
        sleep(1000);
    }
}

Here is a brief explanation of the above code:
  • Line 2,  sets up a TCP connection to the server.
  • Line 3,  decides whether the instance is going to be a "bad"/"unhealthy" server, in which case, CPU or memory usage may shoot up to 100%
  • Lines 4-9,  sets the upper limit. For healthy ones, we set the upper limit to be 80% on CPU and 90% on memory.  
  • Lines 10-11, sets the starting value for cpu and memory.
  • Lines 13-16, updates the cpu and memory percentage and sends the information to the server.

   This script is so concise and efficient thanks to our function randWalk(), which gets its name from the phrase "random walk".  This function takes 4 parameters
<currentValue>, <lowerLimit>, <upperLimit>, <step>.  
The returned value can go either up or down from <currentValue> by up to <step> amount.

   This brief example reports only the statistics on CPU and Memory usage, but it can be easily extended to report other statistics as well, such as Disk usage, IO, bandwidth usage etc.  Also, some monitoring software may expect the agents to report the statistics in the form of HTTP/HTTPs requests or other protocols over TCP or SSL/TLS.  Rest assured,  NetGend supports all of the above transport mechanisms.

   At NetGend, we are proud of the flexibility and scalability of our platform, we are especially happy that it can be used to test Application monitoring - a sister software to application performance testing.  Also, if you are interested in how this can be used to generate load for proxies or Big Data analysis, please don’t hesitate to  please drop us a line  info@netgend.com.

Monday, April 21, 2014

What We Learned in Scanning for Heartbleed Vulnerable servers




    According to many press or analysts, the Heartbleed vulnerability is one of the most serious vulnerabilities ever discovered.  While companies with dedicated IT/security staffs can act quickly to patch the vulnerability as soon as the news broke,  many smaller company web sites are still unattended from security standpoint and amazingly might not even be aware of it for days after the alert was issued.

    As a service to IT community, and, honestly to test the power of our platform, NetGend set out to scan the internet to identify vulnerable sites/servers and notify the owners.  Our first step was to find a list of servers. After some researches,  we found that Amazon AWS has a list of the top 1 million sites in the world.   Amazon also maintains the public IP ranges owned by Amazon.

    As mentioned in a previous blog,  a simple PoC Heartbleed scanner worked on our lab servers.  The idea here is to extend the PoC to do internet-scale scanning.  The following blog describes some lessons we learned.

    Detecting Server-Hello-Done is trickier than first thought.  A typically Heartbleed vulnerability scanner works like the following:

  • Sends a special CLIENT_HELLO
  • Waits for the server to send 3 messages:  SERVER_HELLO,CERTIFICATE,  SERVER_DONE,
  • Sends a Heart-Beat message,
  • Checks if the server reply has a large chunk of data (revealing server side memory)
 The simple-minded script used in the previous blog assumed all the server messages are in the same TCP data packet.  As it turned out,  it is a lot more complicated in the real world - The "CERTIFICATE" messages can have dramatically different sizes and the 3 message can be packaged in the same or different SSL records, which in turn can span over multiple TCP data packets, with the record boundary and the packet boundary not aligned.  We had to enhance the Server-Hello-Done detection routine to take care of all the cases, but fortunately doing so is  relatively easy on NetGend platform.

    The DNS resolution is rate limited.  Before we started scanning the top 1 million websites,  we thought that we could scan them at a rate of thousands of servers per second,  which is easily supported on our test platform.  It turned out we had to fight with the DNS resolution.  
    We used the google DNS server (8.8.8.8), thinking it's the most powerful DNS server in the world and it could easily handle thousands of DNS requests/second.  Well, in fact, this service is rate-limited. In our experiment, we encountered a limit of about 100 DNS resolutions per second.  This greatly slowed down our scanning.

    3.  There are still a considerable number of servers not patched.  Even though many web sites in the top 1 million list were patched on the first day,   there were over 20,000 servers that had still not been patched after 2 days.  When we tried again about a day later,  25% of the unpatched servers has been patched.  A week after the alert was issued,  a site that boasts of 100 million subscribers was still not patched.   Another interesting thing we found was that, among the top 1 million web site, at least 750K have HTTPS support.

    During our scanning effort, we were glad to see some web sites had started monitoring the attempts to exploit this vulnerability.  Our scanning effort has triggered alarms in some of them companies or organizations.  After explaining the purpose to them and they unblocked our IP from their site.

    To avoid triggering false alarms,  we included some thing in the message to communicate that we did not have an evil intention:  we set the 32-byte random field of the SSL CLIENT-HELLO message to contain the string "test done by netgend.com".

   In addition to the top 1 million sites, we also looked at the IP ranges for Amazon web services.  We were able to find thousands of HTTPs servers that were vulnerable.   There are two surprises:  One is that AWS owned a whopping  7 million public IP addresses;   The other one is more interesting: Many of the vulnerable servers led us to the owner's main web site:  when you point your browser to the server, either the content reveals the name of the main site or the server redirects the browser to the main site.   It turns out that  for many vulnerable servers, the server(s) for the main web sites are patched. It's unclear whether the admins for the main site forgot about the other servers that they own or they simply haven't got a chance to patch these servers yet.

   We have notified many site admins of their vulnerabilities. But, unfortunately, it's a manual process.  To speed things up, we have notified both CERT and Amazon AWS in the hope that they can carry on the task of notifying the owners/admins of the vulnerable sites.

   At NetGend, we are proud that our test platform is so powerful that it can do interesting security testing in addition to the advanced performance testing.  We are also happy that it can be put to good use - like scanning for vulnerable servers and notifying the owners.  If you are interested in learning more, please drop us a line:  info@netgend.com.


Wednesday, April 16, 2014

Averaging over JSON fields




    JSON has become very popular these days and many web services are based on JSON.  It's fairly common to see an interesting question related to JSON in StackOverflow.  This blog looks into a typical JSON test scenario. To paraphrase the question,  a server  sends a JSON response in the following format.

{
"global_id": 11111,
"name": "IMG_001.JPG",
"width": "1111",
"height": "1111",
"time_taken": {
  "segment_1_time": 1,
  "segment_2_time": 1,
  "segment_3_time": 27,
  "segment_4_time": 1,
  "segment_5_time": 56,
  "segment_6_time": 8,
  "total_time": 94
 }
}
  The test platform will need to emulate HTTP clients and:
  • Do a bunch of HTTP transactions.
  • Extract the values of the fields "segment_x_time" from the server response, where x ranges from 1 to 6 after each transaction.
  • Update the running total of each of the 6 fields.
  • Calculate the averages on the 6 fields by dividing the running total by the count at the end of the test.
  The net result is that there should be 6 averages, one for each of segment_1_time, segment_2_time, ... segment_6_time.

   It's raised as a question for JMeter platform, there are a few answers proposed. One post suggested using regular expression to extract the fields, which is not the right approach since it's error-prone.  Another post suggested using a Jmeter plugin for JSON path, but that would require some work on installation and the solution may not work for those without the plugin.

   Even after the values for the 6 fields are extracted, it is still unclear how to do a running total and how to calculate the average at the end of the test.

   It's a simple exercise on NetGend platform. The following short script will print the 6 averages at the end of test.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
function userInit() {
    var totals = [];
    var counts = [];
}

function userEND() {
    for (i=1; i< totals.length; i++) {
        printf("average for ${i} is %d\n", totals[i] DIV counts[i]);
    }
}

function VUSER() {
    action(http,"http://www.example.com/test");
    x = fromJson(http.replyBody);
    for (i=1; i<=6; i++) {
        counts[i] ++;
        totals[i] += x.time_taken."segment_${i}_time";
    }

Here is a brief explanation on the script:

  • Lines 1 - 4,  userInit() function is called in the beginning of a test, before the first virtual user starts,  it will initialize two arrays, "totals" will hold the running totals for the fields "segment_x_time", "counts" will keep track of the counts.
  • Lines 6 - 10,  the function userEND() will be called at the end of the test and it will calculate the averages and print them out.
  • Line 13, do a transaction with the server to get the JSON message and store it in variable http.replyBody
  • Line 14, use fromJson() to parse the server response and create a variable x,  we can access any part of JSON message by using variable x,  for example, to get the value for field "global_id", we just need to use x.global_id
  • Line 15, create a simple loop with "i" going from 1 to 6
  • Line 16,  increment counts[i] by 1
  • Line 17,  increase totals[i] by the value of the field "segment_${i}_time"
   The solution looks simple, right?  With a little bit of change the script can even handle a more complex case where not all the 6 fields,  segment_x_time, are present in a server response.  The only change needed (see the bold line) is to check whether the field is present.


1
2
3
4
5
6
7
8
9
function VUSER() {
    action(http,"http://www.example.com/test");
    x = fromJson(http.replyBody);
    for (i=1; i<=6; i++) {
        if (x.time_taken."segment_${i}_time" == "") { continue; }
        counts[i] ++;
        totals[i] += x.time_taken."segment_${i}_time";
    }


   Some of the more astute readers may ask if it is a problem when multiple virtual users try to update the same global variable at the same time. This concern is reasonable because a thread based application may have a race condition when accessing a global variable from different threads.  The access to the data needs to be synchronized.  On the NetGend platform however, accessing to global variables from different VUsers is always synchronized under the covers.  Programmers do not have to worry about these tedious concerns.

    At NetGend, we are proud that we have the right architecture to make difficult test cases easy.  If you have complex test scenarios, you are welcome to give NetGend a try and save yourselves unnecessary headaches.