Reference

From Scrubyt

Jump to: navigation, search

This document tries to document the entire API provided by Scrubyt 0.3.

Contents

[edit] Navigation

Before you extract data, you need to find it. You always start with a call to fetch(), then you can optionally navigate from page to page until you find the data that needs to be extracted. Here's a simple example:

require 'rubygems'
require 'scrubyt'

extractor = Scrubyt::Extractor.define do
 fetch 'http://www.google.com/ncr'
 fill_textfield 'q', 'ruby'
 submit
end
extractor.to_text.write($stdout, 1)

TODO: why is this script illegal? I just want it to print the final page, but Scrubyt says "[ERROR] No extractor defined, exiting..."

Probably becuase it has no Root Pattern (see below). Try putting after submit

 text 'Ruby Programming Language'

[edit] fetch(url)

Fetches the given URL or file path. Accepts any number of options after the url:

  •  :proxy => "HOST:PORT" -- Specify the proxy to use in a string of host:port.
  • :mechanize_doc => TODO
  •  :resolve => ?? -- :full, :host or a default string that gets prepended to the doc_url. TODO: what do :full and :host mean?
  •  :basic_auth => [login, password] -- Specify the login and password for basic HTTP authentication.
  •  :user_agent => STRING -- Specify the user agent to use here. By default it uses a very generic string like "Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)"

Examples:

[edit] click_link(text, index=0)

Clicks the link that contains the given text. Do not supply embedded tags, so if the link is "<a>go here</a>", you would call click_link("go here"). "Books[2]" would click on the second link with Books in the title. TODO: this is a problem because then how do I specify a link that contains braces? How do I turn this feature off? And why not just use the index parameter? You can also supply a compound example: click_link({:begins_with => 'to', :contains => /\d+/}) would find "<a>to 999 zebra</a>" The compound may contain any number of these keys, :contains, :begins_with and :ends_with, and the values can be both regexps or strings.

[edit] click_image_map(index)

Clicks the image map on the page. TODO: there's no way to say where to click it, right?

[edit] Forms

The functions in this section allow you to fill out and submit forms as if the user was doing it. Ajaxy forms might require the use of Firewatir (in developemnt) but static forms are trivial.

[edit] fill_textfield(name, string)

Fills the given input field with the content given by string.

[edit] fill_textarea(name, text)

Fills the named textarea with the supplied text.

[edit] select_option(list_name, option)

Selects an item from a drop-down list.

[edit] check_checkbox(name)

Checks the named checkbox. TODO: is there any way to uncheck it? Or does it toggle? If it toggles, then how do I set it to a known value?

[edit] check_radiobutton(name, index=0)

Selects the indexed radio button in the named group.

[edit] submit()

Submit a form.

  • submit() -- submits the form for the last item that was edited
  • submit(form) -- specify the form that should be submitted
  • submit(form,button) -- specify the form and the button that submitted it.
  • submit(form,button,type) -- TODO??

[edit] end

Reserved but not used? TODO: it looks like including 'end' will cause a method missing error? Check this.

[edit] todo

TODO: how do I handle cookies?

[edit] Extraction

In the extractor body, you specify what data you want copied to the results by specifying one or more patterns.

 require 'rubygems'
 require 'scrubyt'
 
 extractor = Scrubyt::Extractor.define do
   fetch "http://lwn.net"      # navigate to the desired page
   headlines "Headlines for"   # select results with a pattern
 end

[edit] Root Pattern

Any symbol that is not recognized as a Navigation call (above) is interpreted to be a root pattern. You can only have one root pattern per extraction. In the example above, 'headlines' is the root pattern. Patterns nest so you can have as many as you want in the result set.

 record do
   item_name 'APPLE NEW IPOD MINI 6GB MP3 PLAYER SILVER'
   price '$71.99'
 end

Assuming there's an item with that name on the first page, this will produce:

   <record>
     <item_name>APPLE IPOD NANO 4GB - PINK - MP3 PLAYER</item_name>
     <price>$149.95</price>
   </record>

If no result nodes match, you can specify the content using default: price "$71.99", :default => "$0.00" TODO: is this right?? TODO: what on earth does :generalize => false do?

[edit] Text Example

This is the default type of example. It is only used for learning extractors. When a learning extractor is converted into a production extractor, text examples are expanded to the XPath of the matching text on the page.

 record do
   item_name 'APPLE NEW IPOD MINI 6GB MP3 PLAYER SILVER'
   price '$71.99'
 end

[edit] XPath Example

You can specify an XPath yourself instead of having scrubyt figure it out for you.

 user_count "/html/body/table/tr/td/table/tr[1]/td[2]"   -- sets 'user_count' to everything matched by this expression

Just make sure to start the string with "/" and scrubyt will automatically detect that you're using an XPath. In a production extractor, this will be converted to:

 user_count("/html/body/table/tr/td/table/tr[1]/td[2]", { :generalize => true })

[edit] Attribute Example

This type of example selects the value of a particular attribute of a node.

 url "href", :type => :attribute

In a production extractor, this will be converted to:

 url("href", { :type => :attribute })

Note that any attribute example can also be specified in an XPath. The above example is equivalent to:

 url "/@href"

[edit] Regular Expression Example

This example selects the result of applying a regular expression to a text value.

 numbers /\d+/

In a production extractor, this will be converted to:

 { numbers(/\d+/, { :type => :regexp }) }

[edit] HTML Subtree

The :html_subtree type outputs the HTML code which is the result of the parent pattern. This is in contrast to other example types, which will only pass the text content (stripping out tags) to any string pattern children.

 html :type => :html_subtree do

[edit] Shortcut Patterns

This is a short-hand way of specifying patterns with default behavior. For instance, detail_url is equivalent to "detail_url 'href', type => :attribute".

You can always override default values in shortcut patterns.

Right now there's only one shortcut pattern (in other words, match any href attribute)

  • name_url -- name_url 'href', type => :attribute

TODO: is name_detail (type => :detail_page) also a shortcut pattern? What is it? How is it used?


[edit] Other

[edit] select_indices

Allows you to only store a subset of everything that was matched.

  • article_title("xkcd: Commitment").select_indices([:first, :every_third])

You can pass:

  • A range of indices: 3..5
  • An array of indices: [3,4,7,12]
  • Keywords: :first, :last, :all_but_last, :all_but_first, :every_even, :every_odd, :every_second, :every_third, :every_fourth TODO: why is there no :all?
  • Any combination of the above.  :first,:all_but_first would return everything.

[edit] Constraints

If your extractor is producing too much data, you could pare the results down with constraints.

 item do
   item_name "Canon Vertical Battery Grip BG-E3 For EOS Digital Rebel XT"
 end.ensure_presence_of_pattern ‘price’
  • ensure_presence_of_pattern(tag) -- the given pattern must be somewhere in the node's ancestry.

TODO: what pattern? what's allowed here? How is this different from ensure_presence_of_ancestor_node?
TODO: Can you specify a regular expression as a constraint?
TODO: How do you specify that a tag have certain value? For example, suppose you have table with the following rows:

<tr>
 <td>cat 1</td>
</tr>
<tr>
 <td>cat 2</td>
</tr>
<tr>
  <td>hamster</td>
</tr>

How do you select only those rows with a cat in it?

  • ensure_presence_of_attribute() -- ensure_presence_of_attribute("attr", "value"), value is optional. If value is not specified, it matches any value.
  • ensure_absence_of attribute()
  • ensure_presence_of_ancestor_node() -- ensure_presence_of_ancestor_node :span, ‘class‘ => ’searchProductPrice’
  • ensure absence_of_ancestor_node()

Constraints can be stacked. For example:

        symbols_table "//table[@width='100%']" do
            symbol_row "//tr" do

                symbol "//td[1]" do
                end 

                total_value "//td[5]" do
                end 

            end.ensure_presence_of_pattern('price').select_indices(:all_but_last)
        end

The select_indices constraint is stacked on the ensure_presence_of_pattern constraint.

[edit] Learning

First, you write a quick skeleton script that runs using text that it finds on the page. For instance, this

   article_title("xkcd: Commitment").select_indices([:first,:every_third])

Searches for an element containing that string, then stores all elements similar to that one in the results.

There's no need for this step if you manually locate the information yourself using firebug.

[edit] Production

In theory, to turn a learning script into a production script, you just print the results like this:

   extractor.export(__FILE__)

Unfortunately, all I get is a "integer 46950979137200 too big to convert to int" from RubyInlineAcceleration and a big fat stack trace. TODO: what is the difference between a learning script and a production script?

[edit] next_page

If you place this call after your root pattern, you can repeat the pattern and collect information from a sequence of pages. As long as the Next button is always named the same, Scrubyt will keep clicking Next and pulling data from the page until TODO: (what? other than :limit, how do I stop this?)

next_page 'Next >', :limit => 5

TODO: This is the only code that uses :limit? It's also the only call that can appear after the root pattern. I guess it just keeps clicking on the link to name until the page won't load or it hits the limit? TODO: Is there any other magic to next_page? I see something in the code about patterns...

TODO: "I collected a bunch of links from a page, and I was wondering if it was possible, in the same extractor, to fetch each of these links and extract info on the resulting page." Is there any way to do this other than next_page?

[edit] Results

This is how you print the results of your extraction. TODO: all of these calls accept a pattern. What's it for?

  • to_xml -- returns the results in xml format
  • to_text -- returns the results as plain text
  • to_csv -- returns the results in csv format
  • to_hash -- returns the results as nested Ruby hashes.

Utility Functions

  • print_statistics -- prints some simple statistics like the count of hits by each pattern.
  • remove_empty_leaves -- can remove empty leaves before printing the results.
  extractor.remove_empty_leaves.to_xml    TODO: is this right??

More examples:

  • extractor.to_xml.write($stdout, 1) -- prints the result of the scraping as XML
  • amazon_books.to_xml.write(open(‘result.xml‘, ‘w’), 1) -- print to a file
  • puts amazon_books.book[1].title[0].to_text -- find individual items in the result set (BROKEN IN 3.0?)
  • extractor.to_hash -- books = amazon_books.to_hash; puts book[1][:title]

[edit] TODO

Scrubyt automatically recognizes and searches inside frames. Very impressive.

TODO: what does the :write_text option do?

TODO: what is this? "Temp patterns are skipped in the output (their ancestors are appended to the parent of the pattern which was skipped" Something to do with :output_type=>:model (default) vs :output_type=>:temp.

TODO: what the heck is a :detail_page?

[edit] Related Links

cellulite

The better cellulite treatment today is a billion dollar industry from creams to laser treatments. Women battling with cellulite are willing to try anything to get rid of the undesired fat on the thighs, buttocks and the abdomen. While treatments like surgery and creams can give you short term results they don't hold on a long term basis. Not everyone is a suitable candidate for surgery. If you are diabetic, pregnant or not age appropriate then surgery is not an option.

cellulite treatment

Personal tools