Scrapin' google in no sec
From Scrubyt
Scrapin' Google in no sec
Google is back on stage! I have dusted one of the most known examples in scRUBYt!'s history and wrapped it in a brand new coat. Well, here it is, hot off the frying pan. If you are a scRUBYt! newbie this is a perfect place to start to 'scrub' around!
Example of:
- filling and submitting a textfield
- extracting and using a href attribute
- recursively crawling to the next page(s)
Goal: Go to google.com. Enter 'ruby' into the search textfield and submit the form. Extract the url of the first 2 pages.
Solution: Use the navigational commands 'fetch', 'fill-textfield', 'submit' to navigate to the page of interest. There, extract the links with the pattern 'link'. The URLs should be extracted with 'links child pattern, 'url' which is an attribute pattern. This will extract the first 10 results, but you need the first 20. To achieve this, the 'next_page' idiom should be used, with ':limit' set to 2.
Check out the code:
require 'rubygems'
require 'scrubyt'
google_data = Scrubyt::Extractor.define do
#Perform the action(s)
fetch 'http://www.google.com/ncr'
fill_textfield 'q', 'ruby'
submit
#Construct the wrapper
link "Ruby Programming Language" do
url "href", :type => :attribute
end
next_page "Next", :limit => 2
end
puts google_data.to_xml
http://scrubyt.org/wp-content/uploads/2007/08/google_visualize.png http://scrubyt.org/wp-content/uploads/2007/08/google_visualize_thumb.png
What's 'q' supposed to mean?
'q' is the name of the Google search box. How have I figured it out? Well, you can search the source code back and forth to find it, but I guess using tools like XPather/Firebug for example is far more easy!
XPather?!?Firebug?!?…never heard of it
If you are not familiar with XPather/Firebug you should check out either this quick kick-off tutorial recommended for smaller apetites, or the full user guide(just to tease your taste buds). As an extra topping here is a yamoo cheat sheet.
Install XPather or Firebug:
XPather Install
Firebug Install and the cheat sheet.
Still confused? This visual step-by-step 'how-to' may help:
The manipulation of this tool is actually very handy (that is one of the reasons it is recommended by the team). Give it a try and you'll get the hang of it. Here is a little example as a demonstration (let' s assume you have DOM Inspector in your Firefox):
- go to [www.google.com google].
- launch the DOM Inspector (you can find it in under the Tools in any Mozilla window).
- click on the first icon in the top row (see snapshot)
- now click on the Google search box (if you did it right a red frame should appear around a box and blink several times)
- go back to the DOM Inspector and enjoy the result
For those who like it to be visualized… (another broken link!!! FIX IT!!!)
