Example Specification

From Scrubyt

Jump to: navigation, search

Example Specification from the Page - Known Issues and Pitfalls

At the moment, when providing examples from the page, you must specify the whole text content of a node. This sometimes causes trouble in the present version (for most of the time not, these are more like hints and tips if something is not working out):


    1. The rendered element and its actual source code may be quite different: the text of the element in the page source may be split up between 2 elements and still look like 1 element on the page, may be formatted with a lot of whitespace which is rendered differently in the browser, or may be mixed up with other elements, images whatever. If scRBUYt! stops with a 'FATAL: Node for example #{text} Not found!' message, and you think that your text is there, use XPather to find out what's happening: click on the node, then observe it's textual content.

Let's see an example, this time from amazon

It looks as if the textual content of the element next to 'You save' would be “$17.00 (34%)”. The problem is that this is not really so. Check it out in XPather by clicking on the node and observing it's text content.

As you can see, there is a lot of additional whitespace which is invisible on the page - however, currently it fools the system, because in the source code it is different, as we have seen in XPather.

In the future versions, problems like this will be solved by adding the possibility of specifying an example with XPaths, trough containment and other rules (see the conclusion at the end).


    1. If you are grouping together more examples (for example an item name and a price into a web shop item) you have to be sure all the examples are the first occurrences on the page. One example is worth thousand words.

Let's say you would like to extract the title of the article and the number of diggs, so the pattern structure will look something like:

article do
   title
   diggs
end 

So, you can specify these example pairs:

    1. Data Extraction for Web 2.0: Screen Scraping in Ruby/Rails, Episode 1 as an examle of 'title' and 22 diggs as an example of 'diggs' - because '22 diggs' is the first occurrence of that string on the page
    2. Data Extraction for Web 2.0: Screen Scraping in Ruby/Rails as an examle of 'title' and 35 diggs as an example of 'diggs' - because '35 diggs' is the only occurrence of that string on the page

but not

    1. Get more data comparison options in MySQL with operators you may not know as an examle of 'title' and 22 diggs as an example of 'diggs' - because '22 diggs' is NOT the first occurrence of that string on the page

So, be sure to choose the first occurrence of the string on the page as an example!

Tip: always make to sure that all the examples exist in the very moment when you are launching the learning extractor!

I could mention the digg example again: I was constructing an extractor on digg, and I could not get it work for all the tea in China. After some minutes of banging my head against the wall I have noted that the problem was rather mundane (i.e. not hidden in the deep pitch of metaprogramming logic or a similar cool place): I have specified the number of diggs, launched the extractor, but meanwhile somebody dugg the article, so the count of diggs example was no more valid. It was very similar with ebay - the price examples were 'corrupted' there very fast because of bidding.

You can workaround these cases in the following way:

    1. taking a snapshot of the page (by saving it from your browser, or if the page is really Ajaxy and all that jazz, you could try a Firefox pluginlike this one
    2. picking an example from a page that is not likely to change (e.g. digg - go to the 10th page and choose an example here, or ebay - choose an item which has a lot of time to go)
    3. Mixed content: I have received a problematic case from a sCRUBYt! user. The problem was that the example content was mixed with an image:

TODO:: Snapshots to come…

Possible soulution: additional possibility of specifying the example in a more sophisticated way, like for example:

:begins_with => 'Landed', :ends_with => 'Succesfully'


Conclusion: In the future, I would like to beef up example selection with multiple rules instead of a single string. I can imagine something like:

airport_record [:begins_with => 'Landed', :ends_with => 'Succesfully, :matches_regexp => /\w+\d/]


Tip: observe, observe, observe everything with XPather, and if you are still sure there is a problem, file a bug report or add an enhancement request at scRUBYt!’s rubyforge tracker.