Introduction to Constraints

From Scrubyt

Jump to: navigation, search

Introduction to Constraints



The two most typical/trivial problems with a set of rules are that they match either less or more instances than you would like them to.
Constraints are a way to remedy the second problem: they serve as a tool to filter out some result instances based on rules.


Constraints are nearly always applied after you run the extractor, observe the result and find out that it extracted more results than needed. [Let me demonstrate this on an example].
Let's construct a simple extractor which scrapes the name and price of every camera!


camera_data = Scrubyt::Extractor.define do
  fetch File.join(File.dirname(__FILE__), "input.html")

  item do
    item_name "Canon Vertical Battery Grip BG-E3 For EOS Digital Rebel XT"
    price "$179.00"
  end
end


That was easy - or was it? Let's examine the output:

  <root>
    <item>
      <item_name></item_name>
    </item>
    <item>
      <item_name></item_name>
    </item>
    <item>
      <item_name>Canon Vertical Battery Grip BG-E3 For EOS Digital Rebel XT</item_name>
      <price>$179.00</price>
    </item>
    <item>
      <item_name>Canon Vertical Battery Grip BG-E4 For EOS 5D</item_name>
      <price>$249.00</price>
    </item>
    <-- ...
         22 items omited
    -->  ...
    <item>
      <item_name>Canon EOS Digital Rebel XT Body (Black) - EOS 350D</item_name>
      <price>$696.00</price>
    </item>
    <item>
      <item_name>Shopping Cart</item_name>
    </item>
    <item>
      <item_name></item_name>
    </item>
    <item>
      <item_name></item_name>
    </item>
    <item>
      <item_name></item_name>
    </item>
    <item>
      <item_name></item_name>
    </item>
    <item>
      <item_name></item_name>
    </item>
  </root>


Well, something is not really right here: there are 25 records on the page we are interested in, but unfortunately there are more objects on the same XPath as our records.
Those were extracted as well and littering out otherwise nice output. How do we remove them?!


As mentioned previously, inf an extractor returns too much results, we have to apply constraints - additional rules showing the system which results should be thrown out.
There are currently only a few constraints implemented in scRUBYt!, but even from these we can use two types to solve our situation.
First let's check the easier one (which will be used in such cases anyway - we will look into the other one for illustration purposes), ensure_presence_of_pattern.
By observing the output, it is clear that we need only those items; which have a price;. The ensure_presence_of_pattern constraint, as its name says, is doing exactly this: returns only those element which have a specific child pattern.
Let's see the modified example:


camera_data = Scrubyt::Extractor.define do
  fetch File.join(File.dirname(__FILE__), "input.html")

  item do
    item_name "Canon Vertical Battery Grip BG-E3 For EOS Digital Rebel XT"
    price "$179.00"
  end.ensure_presence_of_pattern 'price'
end


If you observe the result now, you can see that it extracts only the correct items.
The second type of constraint is called ensure_presence_of_ancestor_node. It's meaning is to accept only those results which have an ancestor HTML node with a given name and set of attributes.
For example, if a pattern extracts a <tr>, and you add an ensure_presence_of_ancestor_node constraint to it with values :td (the suggested ancestor HTML node)
and 'colspan' => '3' (the attribute which has to be present), only those table rows will be returned which contain a <td> ancestor with the attribute 'colspan', where the value of 'colspan' is '3'.
This may sound complicated for the first time, but it is a super-easy concept once you will get used to it.


So how do we apply this constraint to our cause?
We can observe from the statistics that the prices were extracted correctly. So let's check that HTML element in XPather.
Open XPather and click e.g. the first price ($179.00) on the page. [You should see something like this].


You can observe that the price is inside a element and it has an attribute 'class' with the value 'searchProductPrice'.
Therefore add an ensure_presence_of_ancestor_node constraint to the pattern 'item':

camera_data = Scrubyt::Extractor.define do
  fetch File.join(File.dirname(__FILE__), "input.html")

  item do
    item_name "Canon Vertical Battery Grip BG-E3 For EOS Digital Rebel XT"
    price "$179.00"
  end.ensure_presence_of_ancestor_node :span, 'class' => 'searchProductPrice'
end


Since the 'price' pattern is an ancestor of the 'item' pattern, their HTML input chunks have to be in the same relation (i.e. the HTML input of 'price' is an ancestor of 'item'), so we can tell 'item' that we need only those result instances which have a 'price' (which translated to HTML inputs means that only those should be extracted which have a ancestor with an attribute name 'class' and attribute value 'searchProductPrice'.
ensure_presence_of_ancestor_node has a negative counterpart: ensure_absence_of_ancestor_node, which rejects (and not accepts) results with such properties.


The last type of constraint currently supported is ensure_presence_of_attribute (which again has a ensure_absence version).
It's parameter are “attribute_name” and “attribute_value”. If this type of constraint is added to a pattern, the HTML node it targets must have an attribute named “attribute_name” with the value “attribute_value”.)


As a mean to reject the unneeded results, the concept of constraints is quite powerful - however, much more constraints will have to be implemented to really leverage it's power.