Screen Scraping in Ruby with Watir and Nokogiri


I was given an interesting challenge to scrape some data from a specific site.  Not to write a completed, packaged solution, but rather just to scrape the data.  The rub being, the site uses Javascript paging, so one couldn’t simply use something like Mechanize.  While a self-contained product would require inclusion of V8 (as the Javascript would need to be run and evaluated), to just scrape the data allows making use of whatever is easy and available.  Enter Watir.

Watir allows “mechanized/automated” browser control.  Essentially, we can script a browser to go to pages, click links, fill out forms, and what have you.  It’s mainstay is in testing, but it’s also pretty damned handy in cases where we need some Javascript on a page processed… like in this case.  Keep in mind though, it is literally automating a browser, so you’ll see your browser open and navigate to pages, etc. when the script runs.  But, there is also a headless browser option.  This is no where near as fast as just sucking down a page via a socket, but it’s a quick and easy solution to code, and not prohibitively slow unless you’re dealing with thousands of pages (in which case one might want to include V8 in their solution, etc.).

The site in question display’s info in a master-detail format:  A master page might hold say 20 results, each result being a detail page.  So let’s say there are 20 Master Pages, each displaying 20 Detail Pages.  That’s… 420 pages that need to be processed (first the 20 Master Pages, then each of the 20 * 20 Detail Pages)!  No way we’re doing that by hand. 

I’m just going to pull down the Master Pages for this.  One can cherry pick info (scrape) from each detail page using Nokogiri.  And so we’re all on the same page, I use OS X.  I have no interest in, or use for Window’s.  Bless your heart if that’s what you use, you may need to do some tweaking to make things work, I don’t know.  So, here’s the broad strokes of the strategy:

  • Write a script for Watir to pull-down the HTML for each Master Page
  • Save said Master Pages to disk for Nokogiri to scrape for us
  • Have Nokogiri generate the url’s of the Detail Pages

This would be followed by having Nokogiri pull in the 400 detail pages, and then scrape them for data, saving it to a database.  I’m not going that far here, as what I am covering can easily be extended.

NOTE:watir-webdriver is necessary for Firefox and Chrome browser control.  If you’re a sick puppy that’s into IE (non-standards compliant– better on this than they ever were, but still not there yet, do whatever the frak they want because they’re “Microsoft”), I believe the straight watir gem handles that… follow the link to their site, and follow the directions to get it installed…

(as a hint:  gem install watir and/or gem install watir-webdriver).


So, the code for the slurp page (slurp.rb):


    require 'watir-webdriver'
    browser = Watir::Browser.start "http://Foobar.com"
    for i in 1..20
        l = browser.link :text => "#{i}"
        l.exists?
        l.click
        open("page_#{i}.html", "w"){ |f| f.puts browser.html }
        sleep 2
    end
    

The above code is in the file slurp.rb, and in its own folder.  Let’s call the folder, “Scrape” for illustration.  Each saved HTML file will be saved within this same folder.

Line by line, this is what’s going on:

Obviously, first one must require the gem (line 1).

Instantiate watir, and give it a page to open (line 2).

We know going in that there are 20 Master Pages, and the site uses the standard “paging ui“: <1 2 3> etc.  So, a simple for loop, iterating from page 1 to page 20 is the most basic way to attack that (line 3).

We want to “click” each page link, from 1 to 20.  We’ve already looked at the source HTML, and know that each page’s link has the text of it’s number, i.e <a>1</a>, <a>2</a>, etc., so we tell watir to find the link with text “#{i}".  “i” will be 1 through the first iteration of the for loop, then it will be 2, and so forth.  We’re all Rubyist’s here, right?  In case no, the quoted portion: “#{i}” is the syntax needed to have Ruby interpolate the value of “i“, so Ruby will insert 1 the first iteration, then 2, and so forth
(line 4).

Next, just a sanity check that in fact this link is on the page (line 5).

Since it is, let’s go ahead and click it (line 6)

Now, we’re saving the HTML from this new page into a file with the corresponding number, i.e., page_1.html for the first page, page_2.html for the second, and so on.  Also, we’re getting the HTML from the page via watir‘s html method (line 7).

Finally, we don’t want to hammer the server, and we want to make sure the file operation has a moment, so we give a 2 second pause (line 8).

After the 2 second pause, we start the next iteration of the for loop, and continue on for 19 more iterations.

When it’s all done, we’ve got a bunch of newly created HTML files, one for each of the Master Pages. Now we need to use Nokogiri to pull out the links for the Detail Pages…

Screen Scraping 101, condensed

There was a time when one had to go through the source HTML by hand, and devine the DOM path to each morsel of data desired. Those were the bad old days. Now we’ve got great tools to find these path’s for us, either via CSS or XPath. I’m going to use Firefox as an example, but Chrome, Safari, and Opera all have similar tools. I suppose IE might have them, but why use diseased meat when you can have filet mignon?

Within Webtools, or Firebug, select element on the data you want from the page. Now, right click the highlighted HTML in the webtools’ window, and you should be given the option of something along the lines of copying the unique select path, copying the CSS Path, copying the XPath, or some other likely candidate. You’ll know it when you see it, for whatever tool you’re using.

Select it, open your editor, and use [command]-v on OS X (I think it’s [control]-v on Windoze to paste). What you’ve just pasted is either the CSS Path, or the XPath to the unique data you’re looking for. This is what Nokogiri will use to locate the data you want

As a side note: the path’s returned are kitchen sink level paths– they are way more than what’s needed to isolate the data you want. Additionally, I found that the XPath returned in, I believe it was Chrome, didn’t start correctly for an XPath, and I needed to add a forward slash. Your mileage will vary. I’m not getting deep into XPath here, there are plenty of tutorials on the web. If for some reason Nokogiri isn’t isolating your data, and you’re using XPath, you will need to tweak the path. Even at that, it’s still light-years better than having to go through all the HTML by hand, and figure out the path from scratch.

If instead you opted for a CSS Path, you still might need to massage it a bit. The nice thing about CSS Path’s is you can easily simplify. If it’s an ID (and you know it’s unique to the page, because you did a find and confirmed that, you can just use that ID. Regardless, though, you are almost guaranteed not to need the full path saved from the browser’s tools. Take just what you need, and cut the rest

So, let’s do a couple quick examples:

Suppose I want the first 4 digit’s of text from a deeply nested span tag, and I have an XPath to it.  The following will display it using Nokogiri (where doc has been assigned the instantiated instance of Nokogiri— we’ll get to that):



puts doc.xpath('//div[6]/div[3]/section/div/div[1]/span/span').text[0..3]
    

That puts will display those first 4 digits.

Now, suppose that I had 4 pieces of data all located in span tags, nested in a div that itself is nested in a couple of other div’s, with classes applied.       By using the developer tools in the browser, it’s a simple matter of selecting the first piece of data, then within the web tools’ listing of HTML that has focused on that item, right clicking it, and selecting copy CSS Path (or the equivalent there of). Moving into my editor, and pasting it. Cut out any of the unnecessary cruft, and applying the following code: (where doc has been assigned the instantiation of Nokogiri, which we’ll get to)



for i in 0..3
    puts doc.css('div#leftcolumn div#topleft div span')[i].text
end


This will yield the text within the first 4 span’s

That brings us to grabbing the payload from the Master Pages we’ve downloaded. Let’s just grab the URL’s for a single page. The principal is the same for all of them, as the mark-up will be the same.

Let’s suppose that for the Detail Pages, it’s a straight forward HTML link, no Javascript magic required. And, the link’s who’s href that we need has a CSS class applied to it, and it is within a div that also has a CSS class applied. Further, these are unique to the data we need. Here is the complete code to pull out these URL’s:



require 'nokogiri'

doc = Nokogiri::HTML(open("page_1.html", "r"))

doc.css('div.leftside a.info').each { |link| puts link['href'] }


That’s it!  Every one of the 20 URL’s we need for the detail pages will be printed out. It would be trivial to make this short script handle all 20 page_x.html pages.

Happy Coding

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s