Alan Harnum

The Annonated Joseph Bloorg (or, How To Use an RSS Feed as an API)

18 Nov 2015


Last weekend, I mentored at Toronto Public Library's first Hackathon. It's the first significant thing I've done with the library since I stopped working there in June, and while not exactly odd to go back in this way, it's also clear I still haven't transitioned to being "outside" the library (I caught myself again and again saying "we" in reference to TPL or the library world in general when discussing why something was the way it was / worked the way it did).

While it was pitched as an "open data hackathon" and a number of datasets were provided, the interest I saw on the ground was largely on using the library's real-time data and services as a platform for building new things: mobile sites and apps, tools to scratch particular itches, etc.

I sent feedback to that end this morning to the organizers:

A specific piece of feedback for the future would be this (and I promise this isn’t just me continuing to beat a drum I beat a lot when I was employed by the library, but feedback I heard from participants throughout the weekend, and have heard in the past from members of the Toronto tech community): investment by the library in building a public, well-documented public API to its collection, event and account data would pay off significantly down the line in terms of civic engagement, positive publicity for the library, and, plainly, free labour to surface products and services of interest to the library’s patrons.

Simply put, the library is sitting on a very desirable amount of civic data and functionality. As long as the library continues to treat this largely as a product where access for the public is mediated through things built by the library, rather than a platform that the public can use to build the things it desires, its potential is significantly limited.

To that extent, a lot of my value as someone who worked for many years on the library's digital properties was in my insider knowledge of how the website works and how to use its highly-configurable search URLs to extract information from the back-end Endeca repository.

Therefore, I bring you my own thing I made at / because of the Hackathon...

Behold Joseph Bloorg!

Current status: Real Bothood achieved. Lives in "the cloud" now. #tplmakers #hackathon @torontolibrary

— Joseph Bloorg (@tpldbot) November 15, 2015

Joseph Bloorg is a Twitter bot who tweets random items from TPL's digital collections. Bloorg:

The Python code is available at (it's not particularly spectacular Python), with some inline comments; I'm going to explain the fourth point in more detail because I think there's somewhat wider interest in how to use the TPL website in an API-like way, and using the RSS feed feature programatically requires some knowledge about how search URLs to the site are constructed.

At a high level, the code does three things:

I'm not going to explain the code aside from the URL construction logic and some of the parsing of the RSS XML, except as necessary to inform that.

Step 1: Finding out how many potential digital items there are

result_feed_URL = ""

def get_random_item_number():
    tree = etree.parse(result_feed_URL)
    # Get the total results
    total_results = tree.find('channel').find('results').find('total-results').text
    # Generate a random number from the total results
    item_number = randint(1, int(total_results))
    return item_number

Breaking down result_feed_URL

The path
The parameters

The rest of the function

The rest of the function just parses the result feed XML for a "total-results" tag, extracts the text from that (which is a number representing the total number of records available for the search), generates a random integer between 1 and the total number of records, and returns that for use by the next function, which will actually retrieve a record by number.

I've included the relevant portion of the RSS XML that we're parsing below.

<rss xmlns:record="" xmlns:results="" xmlns:content="" xmlns:wfw="" xmlns:dc="" xmlns:atom="" xmlns:sy="" xmlns:slash="" version="2.0">

Step 2: Retrieving a single item record based on the random number

def get_item_by_number(number):

    record_feed_URL = "" + str(number)
    record_tree = etree.parse(record_feed_URL)
    item = record_tree.find('channel').find('item')
    return item

Breaking down record_feed_URL

The path

Nothing different from what we've seen before - requesting the RSS search results feature.

The parameters

Returning a single item record for the tweet to use

Again, we do some XML parsing with lxml to extract the information we care about - in this case, it's the item record, which we return. More elaborate parsing gets done in the construct_tweet function...

Step 3: construct a tweet from an item record

def construct_tweet(item):
    item_title = item.find('title').text

    item_link = item.find('link').text

    item_record = item.find('record')

    item_id = item_record.find('recordId').text

    item_image_file_name = item_record.xpath("./attributes/attr[@name='p_file_name']/text()")

    # We have to convert the file name to lowercase

    item_image_URL = "" + item_image_file_name[0].lower()

    # print("Title: " + item_title)
    # print("Record ID: " + item_id)
    # print("Link: " + item_link)
    # print("Image URL:" + item_image_URL)

    # Manual length, should be safe for a while
    tweet_URL_max_length = 25
    tweet_title_trim_length = 140 - tweet_URL_max_length

    tweet_text = item_title[:tweet_title_trim_length] + " " + item_link

    return tweet_text

This function parses the returned item record for fields we care about. While we only actually make our tweet from the item_title and item_link variables, I want to highlight image_image_file_name in particular because it points to a rather buried but key feature of the RSS feed behaviour.

Specifically, all the item records that may appear in a browser or RSS reader as simple "news" items have a huge amount of additional information buried under the "attributes" tag. A snippet:

    <attr name="p_dig_access_rights">Copyright</attr>
    <attr name="p_dig_caption">
    Front lines: The first line of defence is the soldiers who keep an eye out for enemy troop movements. This soldier looks downright comfortable as he sits in wait at a sometimes boring lookout post; while those behind him prepare to set up operations. But; huddled against the cold; he is prepared to react on a moment's notice to any threat to his platoon.
    <attr name="p_dig_creator">Cooper, David</attr>
    <attr name="p_dig_subject_topical">Arctic regions</attr>
    <attr name="p_dig_subject_topical">Canada. Canadian Army -- Drills and tactics</attr>

It's a rather ugly information dump, but there's a lot there.

Big Picture Stuff

When I worked in library tech, I was fond of saying that if you had a web-based catalogue and account self-service system, you had a public API, it was just a shitty one. The lack of an API never prevented someone from trying to programmatically extract data or automate website functionality - it just made it harder for them!

Bloorg is a case in point, and far from the most extreme one I know of (after all, he's parsing something that's intended to be parsed, albeit in a perhaps unanticipated way - I know or have heard of many projects that achieve their ends by scraping library HTML and automating form interactions).

I think the basic success loop for library digital properties is the following:

Library linked data, makerspaces and some other trends in the professional conversation are helping move things in this direction, but my anecdotal sense is that it's all a rather piecemeal process (though that's often enough how change gets made...)