Yeesh, I gotta stop writing so late at night… Last night I was trying to get my spider to follow all the links on the blog’s archive page and then sum up all the words from every post. Unfortunately I was way too tired to get that to actually work. Tonight I finished that step of the process but it required some ugly code and refactoring our unit tests. Without further adieu…
One of the first things I realized what that my paths to the output folder were getting all weird depending on the context in which I was running my tests. So I switched to using Ruby’s __FILE__ to create paths relative to our crawler. words_by_selector is kind of gross with some nested iterators but whatever, it works. We will probably need to refactor it when we get the metadata spider working. For now I’m just glad that it actually visits all the pages and produces the right output.
Our spec also needed updating so it could find the output directory properly. One downside to our current hacked-together setup is that I haven’t produced a proper mock for things so the test takes FOREVER to run. Something like 30+ seconds because it’s actually crawling our site instead of just hitting a dummy file. Definitely need to fix that at some point :)
But once we get it all working the output from robdodson.me ends up looking like this:
We can use that JSON to start graphing which I’ll hopefully have time to get into before going to Europe. We shall seeeeee. - Rob