Yesterday we verified that our Crawler was able to hit a document and, given the right selector, pull down a list of words and their frequency on the page. We also created a custom exception to be used whenever the selector fails to pull down the right content. I’m going to repeat this process today with the get_metadata_by_selector. If there’s time we’ll try to output another file with our data, otherwise that’ll be tomorrow’s homeworkd :D
Let’s take a moment to look at today’s metadata to figure out what we’d like our output to reflect.
That’s the actual markdown that goes into the editor but it gets converted into a ul. I don’t think you can pass a CSS class to markdown syntax otherwise I’d use one here. We could go back and wrap everything in regular HTML tags but since we know that our metadata is going to be the last ul per entry we’ll just use that knowledge to build our selector. Obviously a more robust solution would use a CSS class so that might be a good refactoring for the future.
I figure for now we’ll just parse the metadata into a Hash that’ll look something like this:
In the final iteration we’ll toss all of our Metadata Hashes into an ordered Array so we can visualize them over time.
Red, Green, Refactor
Ok, time for a failing test. Let’s make sure that our selector pulls something down and if it doesn’t we should raise the custom SelectionError we defined yesterday. I’m already seeing some repetitive code in our Crawler so I’m refactoring it. Where we need to get a group of XML nodes from the document via selector I’ve created a private helper called nodes_by_selector. This is also where we’ll raise our exception if nothing came back. I’m also cleaning up some of the word cruff from our public API so instead of get_words_by_selector it’s not just words_by_selector. The same goes for our metadata method.
Going back to the tests we need to refactor a bit for any place that’s been broken. Immediately I saw that my nodes_by_selector method was not initially returning the nodes so I added that back in. The tests brought that to my attention before I had to do any potentially painful debugging. Beyond that we just need to fix up our method names:
We’ve got a duplicate test in there where both #words_by_selector and #metadata_by_selector are checking that they both raise an exception if nothing comes down. Let’s see if we can refactor those into an RSpec shared example. I’m not sure if this is a best practice or not but here’s my implementation:
Basically we’re putting our method name as a symbol into a variable using let and then calling that method in the shared_examples_for block. Notice how we’re using @crawler.send(selector_method, ...)? In this case selector_method refers to our method name symbol.
If you run this in RSpec’s nested mode it looks pretty cool:
Ok, so we know that all of our selector methods raise the proper exception if they are called with a bunk selector. Now let’s make sure we can get our metadata downloaded and structured.
Unfortunately I’m realizing that if the ul for our metadata is part of the post then those words get counted along with everything else, which is not what I want. I need to figure out how to exclude that content…
I could either tell my crawler to explicitly ignore that content or wrap my blog entry in an even more specific class and just select that. I guess that’ll be an exercise for tomorrow :\