Your First Extractor

From Scrubyt

Jump to: navigation, search

Your First Extractor


OK, so you have already seen the getting started section, you know what are we into and you are ready to really get started with something tangible.


I recommend to use Firefox - simply because we are going to use a Firefox extension, XPather or Firebug, which can conveniently find XPaths and objects on the page. You can do with other browsers as well (scRUBYt! does not depend on any browser), but then it is up to you how will you figure out different stuff which I will describe here. In the following, I am assuming that you are using Firefox and XPather.


First you have to make sure that you have the 'DOM Inspector' in your Firefox installation. DOM Inspector allows you to explore the parsed document object model (DOM) of any page. You can get details on each HTML element, attribute, and text node. To check whether you have it installed, press Ctrl + Shift + I. or chose Tools -> DOM Inspector from the menu. If you don't have it, you will need to install DOM Inspector support into Firefox. You can find more information on this at the grease monkey DOM inspector guide. Once you have DOM Inspector, please install XPather first before you move on.

Tip: on Ubuntu, you can install it separately (if missing) with apt-get install firefox-dom-inspector. Thanks to Jason Evans for the tip.

I will show you how to scrape the first 10 result URLs from google for the search term 'ruby'. When writing an extractor, you will almost always need a browser (O.K., technically you need just a text editor to do that, but to know what to enter into the editor you will need a browser) - so launch Firefox - if not, reproduce the Firefox specific stuff in your browser of choice.


Open an empty file, and require the packages needed:

require 'rubygems'
require 'scrubyt'

Next, we are defining the extractor:

require 'rubygems'
require 'scrubyt'
 
google_data = Scrubyt::Extractor.define do
end

Note that we called the extractor google_data, since after it runs, it will return data from google. This is just a convention, but it is usually wise to name your extractors according to the things they extract, because in bigger scenarios there will be more extractors and this can help to avoid confusion. Next, let's fetch the google start page!

require 'rubygems'
require 'scrubyt'
 
google_data = Scrubyt::Extractor.define do
   fetch          'http://www.google.com/ncr'
end

Remeber that you must specify the protocol (http://) for the fetch action, otherwise it will think you want to fetch a file. If you wonder what '/ncr' is good for - it stops google from localizing itself. I.e. if you are in, say Spain, and just type 'http://www.google.com' into your browser, you will be automatically redirected to google.es. Since we would like to avoid this, we are using '/ncr'.


Let's simulate the entering of the search term 'ruby' into google's search textfield. To do this, we will need to know the name of the textfield. Point your browser to http://www.google.com/ncr and open the DOM Inspector (Ctrl + Shift + I), or Tools -> DOM Inspector. If you managed to install XPather correctly, you should see it's icon in the top-left corner of the DOM inspector.

Click on it. Now click on the search text field in the google page (it should blink red if you did it right). Now go back to XPather. If you did everything right, on the right side of XPather you should see something like this (actually it is not there any longer!!!):


What we need from here is the name of the input field, which is q. (This was not the reason we have installed XPather - though it was the easiest way I know of to find the textfield name - there will be far more complex tasks, where XPather will come handy.)

Now that we know the name of the textfield, we can continue with the extractor:

require 'rubygems'
require 'scrubyt'
 
google_data = Scrubyt::Extractor.define do
  fetch          'http://www.google.com/ncr'
  fill_textfield 'q', 'ruby'
  submit
 
end

The navigation is ready! Let's see the scraping part!

In the browser, type 'ruby' into the search text field and submit it (just as you would do it normally). We want to scrape the results, like 'Ruby Programming Language' or 'Ruby Home Page - What's Ruby' etc. We create a pattern which will do this for us. On the result page, copy'n'paste the first result ('Ruby Programming Language') and go back to the extractor:

require 'rubygems'
require 'scrubyt'
 
google_data = Scrubyt::Extractor.define do
  fetch          'http://www.google.com/ncr'
  fill_textfield 'q', 'ruby'
  submit
        
  result 'Ruby Programming Language'
end

Believe it or not, we are done here! Of course, we would like to inspect the results, so we need to add 2 more lines to do that:

require 'rubygems'
require 'scrubyt'

google_data = Scrubyt::Extractor.define do
  fetch          'http://www.google.com/ncr'
  fill_textfield 'q', 'ruby'
  submit
        
  result 'Ruby Programming Language'
end
 
google_data.to_xml.write($stdout, 1)
Scrubyt::ResultDumper.print_statistics(google_data)

Personal tools