Object Oriented Scraper Backed With Tests
I just drank a ton of coffee and I’m blasting music in my headphones so this post my bit a bit more scatter-shot than most since I can’t really focus :]
Yesterday I managed to build a pretty naive scraper using Nokogiri which would count how often each word was used in the first 10 posts of this blog. Basically scraping the home URL of the site and grabbing everything inside of the div.entry-content
selector.
Today I want to convert it into a more OO library so it’s a bit more modular and reusable. I also want to back everything with RSpec tests to get into the practice. While it won’t be true TDD I’ll try to write the tests for the library before putting the classes together.
Design Decisions
I’m calling the project Tentacles
for now since it relates to my Octopress blog. I’m still trying to figure out exactly what the end product will be. So far I know that I want it to produce a page of statistics about my blog. I figure that for now it can be just one page with stats that cover the entire blog. In the future I might want to make it more granular so that each post can get special attention. For now it’s easiest for me if I just think of the whole blog as a big data set and this page as the output.
I also know that since Octopress is heavily integrated with Rake that I’d probably like to trigger the process as part of a Rake task. IMO the logical place would be to amend Octopress’ rake generate
so that it not only builds our static pages but it also produces our statistics. Down the line I might want to change this but for now it seems OK to me.
Finally I figure I’ll want to have some kind of configuration file so the parser knows what to look for.
For now I’m fine with the output being a plain text file with a few stats on it. We’ll work on making the output more robust after we’ve figure out the basics of our module and integrated it with Rake.
Here’s the folder structure I’m using:
- tentacles
- bin <— contains our executable program
- tentacles
- lib <— contains our library of classes
- crawler.rb
- config.yml
- runner.rb
- spec <— contains our RSpec tests
- crawler_spec.rb
- runner_spec.rb
- bin <— contains our executable program
Playing with IRB
One of the first issues I’ve run up against is figuring out how to play with my classes in IRB. Being new to Ruby I tend to build everything in one folder. Since this is my first time embarking on some actual modular structure I’m unsure how to require or include a module in IRB. What I’ve settled on for now is to cd
into my lib folder and use the -I
flag to set the $LOAD_PATH
.
Here’s the grep
from the irb man page.
So we end up in tentacles/lib
and call IRB like so:
And now we can require our classes
Skeletons
I’m going to create a basic Runner
class so we can verify that the stuff in IRB is working properly.
Here’s what I’ve thrown together:
and here’s how we test it in IRB.
Looks good so far!
Tests
OK on to the tests then. I’m going to be using RSpec so if you don’t have that setup already you should do a gem install rspec
.
I’m a total noob when it comes to testing so let me take my best stab at this…
I’m going to write tests for Runner
first since it’s already stubbed out. I want to make sure of the following things:
- It should respond to the
run
method - When I pass it an invalid config file it should throw an error
- When I pass it an empty string or nil in place of config it should throw an error
For now that’s the only public API this object has. Pretty simple but of course I’m immediately running into issues. Here’s what my spec looks like:
and here’s what runner.rb looks like:
aaaaaand here’s the error:
It looks like the test is bailing out on my before
block when I try to create an instance of runner and pass it the config file. Folks on IRC are kind enough to point out that require
and methods run in RSpec don’t necessarily have the same scope so trying ../lib/tentacles/config.yml
won’t work either. The solution is to use File.dirname(__FILE__) + '/../lib/tentacles/config.yml'
. Since I don’t want my line lengths to get any longer I define a helper module and give it a relative_path
method which should spit out File.dirname(__FILE__)
.
After I include it my tests look like this:
You’ll also notice I added a test for an invalid yml file. Basically I created a mocks folder and tossed in a yaml file that’s full of gibberish. Probably not the best way to mock stuff but whatever, i’m learning!
With that all of our tests for Tentacles::Runner
are passing. Yay! But now it’s 10:37pm and I gotta call it a night. We’ll continue tomorrow by writing tests for Tentacles::Crawler
. See ya!
You should follow me on Twitter here.
- Mood: Wired, Lazy
- Sleep: 7.5
- Hunger: 0
- Coffee: 2