Thursday, November 14, 2013

Processing html message, from regexp kung-fu to DOM parser


    There is no doubt that regular expression (regexp) is useful in performance testing. It can be handy when we need to extract some fields from a server response.  Most performance testing platforms support it, the question is how flexible/easy it is supported?

    To get an idea, let's look at an example based on an interesting question asked in stockoverflow.  Give a sample html message like the following, we need to extract the URLs within "h3" nodes.

 <html><head>  
     <title>A sample webpage!</title>  
   </head>  
 <body>  
 <h3 class="content-title">  
 <!-- change when this is completed -->  
   <a href="/container/recentIssue.jsp?punumber=201"> Title 1 </a>  
   <a href="/container/recentIssue.jsp?punumber=101"> Title 0.1 </a>
 </h3>  
 <a href="/container/mostRecentIssue.jsp?punumber=999">abc</a>  
 <h3 class="content-title">  
 <!-- change when this is completed -->  
   <a href="/container/mostRecentIssue.jsp?punumber=202">  
   Title 1  
   </a>                    
 </h3>  
 </body></html>  
     It's easy to come up the regexp to extract an URL:  /href=\"([^\"]+)\"/ ,   the hard part is to do it within a h3 node.
 
    Multiple solutions were proposed, but it was not clear they will skip those that are outside of a h3 nodes (such as the one with value "999").   The problem may appear simple at first,  but it's not trivial to solve it with only one regexp.    One reason is, within a h3 node, there may be unknown number of hyperlinks with punumber.  It's obvious how to use 2 regexps to solve this problem. First we use a regexp to grab the h3 nodes, then we use the simple regexp /href=\"([^\"]+)\"/  to grab URL in each of the grabbed h3 nodes.  I am not sure how other testing platforms support this type of operation,   it's fairly straightforward on netgend platform.
 function VUSER() {  
      action(http,"http://www.example.com");  
      a = regexp(http.replyBody, /(?s)<h3 (.*?)<\/h3>/g);  
      i = 0;  
      while (i < getSize(a)) {  
            b = regexp(a[i], /href=\"([^\"]+)\"/g); 
            i ++;
            j = 0;
            while (j < getSize(b)) { 
                  println(b[j]);  
                  j ++;
            }  
      }  
 }  
In the above code snippet, we first use regexp to grab all h3 nodes, assign it to array variable "a", then for reach element in "a", we then use regexp to grab all the links.  Note that the "g" after regular expression instructs the regexp engine to get all the matches.  Also note that the "(?s)" in the first regexp means we are going to do match with the string being considered as one line.

    Of course, in real world, requirement may be more complex. For example, we may need to do a HTTP transaction for each of the URLs extracted.  It's not hard on netgend platform.  We just need to replace the "printlin(b[j])" with the following line:
 action(http, "http://www.example.com${b[j]}");  

     Extraction of fields using regular expression is handy, but it's not robust, for example, when the relative positions of attributes change, some regexps may give bad surprises.   In the case of html message, it's better to use DOM parser function "fromHtml".   This function takes the HTML message as one argument and a XPath as another argument.  It's easier  (provided that you know the XPath expression) and more robust.  Here is a solution to the same problem using "fromHtml" function.
 function VUSER() {  
      action(http, "http://www.example.com");  
      a = fromHtml(http.replyBody, '//h3/a/@href');  
      i = 0;  
      while (i < getSize(a)) {  
            println(a[i]);  
            i ++;
      }  
 }  
     Much simpler, right?  But you would need to learn a little bit of XPath.   In case you can't use XPath (for example, the message is not HTML or XML),  you can apply your  regexp Kung-fu on netgend,   a platform that can emulate 50,000 concurrent sessions.

No comments:

Post a Comment