Playing around with the output

From Scrubyt

Jump to: navigation, search

Playing around with the output




So we have extracted something. Great. Now what can we do with it?


The easiest and fastest way to observe the result is to dump it to the screen as an XML, using the to_xml method. Suppose your extractor is named amazon_books, then this can be done with:

amazon_books.to_xml.write($stdout, 1)

Alternatively, if you would like to dump the result to a file, you can use

amazon_books.to_xml.write(open('result.xml', 'w'), 1)


(Note that this is the 'demo' way of doing this - correctly you should close the file, but for illustration purposes the above example is easier, I guess.)

Further, you can gain statistics on the item count extracted. The statistics can be dumped to the screen with

Scrubyt::ResultDumper.print_statistics(amazon_books)

and they look like this:

   book extracted 32 instances.
       title extracted 32 instances.
       price extracted 28 instances.


This data can help a lot if things are not working out.
First of all, you can see that the extractor scraped 32 books. Are there really 32 books on the page? If yes, that's cool - however if not, you are notified that something is not perfect yet and you can fine-tune the scraper to find the instances which were not found (or to exclude the additional ones which are not needed).
Furthermore, we can observe that though there are 32 books, only 28 of them have a price. Is this OK? (in this case yes, since 4 of them were unavailable from amazon at the moment and therefore no price was present - but it could be the case that a price was there, just the extractor could not find it).


Let's see a different method of accessing results.

puts amazon_books.book[1].title[0].to_text
#In the future this will by slightly improved so you can write
puts amazon_books.book[1].title 

If there are more documents involved (the extractor crawled to further pages), you can index the actual document like this:

puts amazon_books[1].book[1].title[0].to_text

i.e. you are indexing also amazon_books, which has a semantics 'give me the results from the n-th page'.


Output is one of the weakest parts of scRUBYt! at the moment - in the future there will be tons of other output methods and functions, and heaps of differents formats will be supported (txt, csv, Ruby Struct, ruport, atom, rss, DB, html, …) However, I believe that there is more than enough to explore in the other areas until then :-).