Friday, January 10, 2014

Scrap content from web pages



   Scraping web content is rewarding - you get the contents you need and at the same time you touch up  the technical skills.  One of our earlier blog is on downloading a list of MP3 files.   This blog was based on an interesting blog by Stu on scraping the list of word-press plugins from web pages.  Stu has created many creative scripts even with the limitations posed by the loadrunner and this is not the first time I am inspired by his blogs!

   Things have changed a lot since Stu wrote his blog 3 years ago. Here is the structure of a plugin block in the web page as of now.  The only fields that can be scrapped are plugin-name, version, last-updated-date, Downloads and stars.

 <div class="plugin-block">  
      <h3><a href="http://wordpress.org/plugins/googl-url-shorter/">Goo.gl Url Shorter</a></h3>  
      This WordPress plugin can short your long url via Goo.gl       
        <ul class="plugin-meta">  
           <li><span class="info-marker">Version</span> 1.0.1</li>  
           <li><span class="info-marker">Updated</span> 2011-11-16</li>  
           <li><span class="info-marker">Downloads</span> 330</li>  
           <li>  
                <span class="info-marker left">Average Rating</span>  
                <div class="star-holder">  
                     <div class="star-rating" style="width: 0px">0 stars</div>  
                </div>  
           </li>  
        </ul>  
      <br class="clear" />  
 </div>  

    The following is a script on the NetGend platform.   It's interesting to compare it to the loadrunner script in Stu's blog.  Note how short the NetGend script is.

 function userInit() {  
      var page = 1;  
      var maxPage = 2000;  
 }  
 function VUSER() {  
      currentPage = page ++;  
      if (currentPage > maxPage) { exit(0); }  
      action(http, "http://wordpress.org/plugins/browse/popular/page/${currentPage}");  
      b = fromHtml(http.replyBody, '//div[@class="plugin-block"]', "node");  
      for (i=0; i<length(b); i++) {  
           c = fromXml(b[i]);  
           a = [ c.div.h3.a.value, c.div.ul.li[0].value, c.div.ul.li[1].value, c.div.ul.li[2].value, c.div.ul.li[3].div.div.value];
           println(join(a, ","));
      }  
 }  

    Why is the NetGend script so short?  If you compare it to the loadrunner script,  you will see that NetGend script doesn't extract values by left/right boundary method.  That method may be convenient but it's error-prone.    Instead it

  • Uses xpath to extract an array of HTML blocks for plugins by the function "fromHtml" ,  each of the HTML blocks will be a XML message,
  • parses the XML message using the function "fromXml" and accesses the fields. 

   It's true that the part //div[@class="plugin-block"] requires a little more knowledge such as xpath,  however, accessing fields of a XML message is actually straight forward, take c.div.h3.a.value as an example,  "c" is the variable name,  the part "div.h3.a" follows the structure of the HTML block in an obvious manner.  You add "value" at the end of expression to get the value of the field.

    Note that NetGend script is immune to changes in HTML tags - adding extra spaces or adding more tags will not affect the script.  The same can't be said on the left/right boundary method.

    By the way, for someone who truly want to extract values by left-boundary and right-boundary, we do have it supported on NetGend platform.

   Scraping content from a web site may or may not generate a lot of web traffic, but it definitely gives a good clue on how flexible the performance test platform is.

No comments:

Post a Comment