Using Different Types of Examples

From Scrubyt

Jump to: navigation, search

Using Different Types of Examples



scRUBYt! is working based on the examples specified by the user. Every pattern has one or more examples to specify what should be extracted by it.
You can specify different types of examples, not just strings found on the page - though you will use that for most of the time, so this is the default example type.
There are 2 basic types of patterns: tree and string. A tree pattern evaluates to an HTML region (an HTML element which can have more child elements etc. - so a HTML tree, hence it’s name). A string pattern evaluates to a string.


Do not confuse the pattern type with the example type. There are just two types of patterns, but 6 types of examples. 4 of them will create a tree pattern and 2 of the a string pattern.
The example types are so far (in the parentheses, the pattern type is shown):

    1. String from the page (tree) - You have mostly seen this in action, so probably if you think about an 'example’, this type of example springs to your mind. However, specyfying this type is by far the trickiest, and a handful of things have to be kept in mind - see *Example Specification for this.


Though this example type is the most commonly used, the other ones can also come handy:

    1. XPath (tree) - if you would like to extract something with XPaths on your own rather than leave it to the

system to figure it out from a page example, you can use an XPath example. An XPath example should always begin with a slash '/’. There is no need to specify the example type explicitly here - scRUBYt! will figure this out automatically. In an exported extractor, all the ’string from the page'examples are replaced with XPath examples, so check out any exported extractor (which originally contained at least one ’string from the page'example) to see a concrete example.

    1. Attribute (string) - Extracting attributes of the parent pattern. If the parent pattern is a tree pattern, you can extract attributes from it. Let’s revisit the google example, extracting also the URLs of the result pages this time:
require 'rubygems'
require 'scrubyt'

google_data = Scrubyt::Extractor.define do

  #Perform the action(s)
  fetch 'http://www.google.com/ncr'
  fill_textfield 'q', 'ruby'
  submit

  #Construct the wrapper
  link "Ruby Programming Language" do
    url "href", :type => :attribute
  end

end

google_data.to_xml.write($stdout, 1)


I think this is pretty straightforward again: we have instructed scRUBYt! that the 'url' pattern should extract the href attribute of the 'link' pattern.

    1. Image (tree) - If you want to scrape an image tag (for perhaps further extracting it’s dimensions, alternate text, or another attribute by specifying an attribute example type) just specify it’s src attribute as an example (which can be easily acquired from your browser - just right click on the image and choose 'copy image location'or similar - well, if the page does not use relative URLs to the images. If it does, you should better look it up in the source). If something does not go really awry, it is not needed to specify the type explicitly - scRUBYt! will figure it out based on the image extension. If your image has some very exotic extension (or has no extension at all), use :type => :image.

To see an image pattern, in action check out the us1camera example in the official example set of 0.2.0!

    1. Regular expression (string) - Take the parent pattern’s textual content, and scan() the results of this pattern with the regular expression provided. Again, let’s see an example:</li>
table do
   row :generalize => true do
     cell ‘1, 2, 3‘, :generalize => true do
   numbers /\d+/
  end
 end
end 


:generalize will be covered later (for the curious, it means: extract all the rows of a table, not just the concrete one defined by the example). The interesting part here is the 'numbers'pattern. For the input '1, 2, 3' it will generate an output like this:


<cell>
  <numbers>1</numbers>
  <numbers>2</numbers>
  <numbers>3</numbers>
</cell>


    1. No example (tree) - In this case the system knows that this pattern is used for grouping it's child patterns together (for example an address is grouping together Street, Number, City, ZIP, country etc, or a book has a title, author, ISBN) and generates the rule based on the child patterns.

I guess this tutorial was a bit heavy, but don’t worry if not everything was clear - you will understand everything when creating extractors, and you can always refer back.