Navigating to the page of interest
From Scrubyt
Navigating to the Page of Interest
The goal of this step is to navigate to and download the input document(s) which are the subject of the next steps. I think it is quite straightforward that scRUBYt! has to be pointed to the document which contains the data first, before it can perform the actual extraction.
scRUBYt! comes with a navigation module, which is built upon Mechanize. However you won't really see Mechanize when defining an extractor, since everything is hidden under a DSL. Let's view the commands available at the moment (in scRUBYt! terminology, they are called actions, so let's stick with that from now on):
fetch
Fetches a file from the file system or an URL from the Web
Every extractor (and thus navigation) must start with a fetch action (there is not too much to click at or scrape if nothing is loaded).
If you are loading a file, make sure that you either supply the full path or that the path is visible from the directory where the extractor is executed. If you are fetching from a URL, you will need to specify the protocol (http://).
Let's see some examples!
fetch "http://digg.com/"
fetch File.join(File.dirname(__FILE__), "input.html")
fetch "test_record.html"
I think these examples do not require too much explanation.
click_link
Click a link (i.e. load the page to which it is pointing) specified by the text of the link</em>. Let's see some examples:
#This is from the amazon scenario. Once we have navigated to the search result page, narrow the result further by clicking on the link in the left navigational sidebar.
click_link "Books"
#This one is from the ebay example. After searching for ‘iPod’, we are not redirected to an actual result page yet - we have to specify that we are looking for an ‘Apple iPod’.
click_link "Apple iPod"
Again, I don't think so there is too much to add here - you just call click_link with the text of the link as an example, and as a result of this action you will navigate to the page which the link points to.
fill_textfield
(name, query)
Find the textfield with the specified name and enter the query string into it
#From the google scenario: searching for the term ‘ruby’
fill_textfield ‘q’, ‘ruby’
submit
Submit the actual form (i.e. which was edited for the last time). scRUBYt! automatically figures out to which form does the textfield or other widget belong that you interacted with for the last time. However, it is your responsibility to submit a form once you are done with it, and only begin to edit a different one. If you edit different forms, and just submit once, just the one which was edited for the last time gets submitted.
Once the navigation is finished, the active document is passed to the scraping module automatically - you don't have to do anything, just begin to define the scraper.
