The Annonated Joseph Bloorg (or, How To Use an RSS Feed as an API)

18 Nov 2015

Background

Last weekend, I mentored at Toronto Public Library's first Hackathon. It's the first significant thing I've done with the library since I stopped working there in June, and while not exactly odd to go back in this way, it's also clear I still haven't transitioned to being "outside" the library (I caught myself again and again saying "we" in reference to TPL or the library world in general when discussing why something was the way it was / worked the way it did).

While it was pitched as an "open data hackathon" and a number of datasets were provided, the interest I saw on the ground was largely on using the library's real-time data and services as a platform for building new things: mobile sites and apps, tools to scratch particular itches, etc.

I sent feedback to that end this morning to the organizers:

A specific piece of feedback for the future would be this (and I promise this isn’t just me continuing to beat a drum I beat a lot when I was employed by the library, but feedback I heard from participants throughout the weekend, and have heard in the past from members of the Toronto tech community): investment by the library in building a public, well-documented public API to its collection, event and account data would pay off significantly down the line in terms of civic engagement, positive publicity for the library, and, plainly, free labour to surface products and services of interest to the library’s patrons.

Simply put, the library is sitting on a very desirable amount of civic data and functionality. As long as the library continues to treat this largely as a product where access for the public is mediated through things built by the library, rather than a platform that the public can use to build the things it desires, its potential is significantly limited.

To that extent, a lot of my value as someone who worked for many years on the library's digital properties was in my insider knowledge of how the website works and how to use its highly-configurable search URLs to extract information from the back-end Endeca repository.

Therefore, I bring you my own thing I made at / because of the Hackathon...

Behold Joseph Bloorg!

Current status: Real Bothood achieved. Lives in "the cloud" now. #tplmakers #hackathon @torontolibrary
— Joseph Bloorg (@tpldbot) November 15, 2015

Joseph Bloorg is a Twitter bot who tweets random items from TPL's digital collections. Bloorg:

is modelled on the similar YUDLbot
is written in Python using tweepy (for tweeting) and lxml (for parsing the RSS feed)
Runs in a Docker container on Digital Ocean
Uses a (not terribly well-known) feature of the TPL website to turn any search result into an RSS feed as an "API"

The Python code is available at https://github.com/waharnum/tpldbot/blob/master/tpldbot.py (it's not particularly spectacular Python), with some inline comments; I'm going to explain the fourth point in more detail because I think there's somewhat wider interest in how to use the TPL website in an API-like way, and using the RSS feed feature programatically requires some knowledge about how search URLs to the site are constructed.

At a high level, the code does three things:

queries the RSS feed once to find out the current number of items in the digital collections, and generates a random number out of that range
queries the RSS feed a second time to get information about the item corresponding to that random number - basically, we do a two-step call to "randomly" select an item
constructs and posts a tweet from that item's details

I'm not going to explain the code aside from the URL construction logic and some of the parsing of the RSS XML, except as necessary to inform that.

Step 1: Finding out how many potential digital items there are

result_feed_URL = "http://www.torontopubliclibrary.ca/rss.jsp?N=38550&Erp=0"

def get_random_item_number():
    tree = etree.parse(result_feed_URL)
    # Get the total results
    total_results = tree.find('channel').find('results').find('total-results').text
    # Generate a random number from the total results
    item_number = randint(1, int(total_results))
    return item_number

Breaking down result_feed_URL

The path

http://www.torontopubliclibrary.ca/rss.jsp: this requests the RSS feed feature of the site. Any search you can make through the site's search UI can be turned into a request for the equivalent RSS instead.

The parameters

N=38550: this is a dimension ID. Essentially, it filters a search to records with a specific facet (facets are the search refinement links along the left side of a search results page). In this case, this dimension ID is the one for "Type : Images", which gets us the non-PDF items from the digital collection. You can see this yourself via the equivalent search URL at http://www.torontopubliclibrary.ca/search.jsp?N=38550
Erp=0: this is short for "Endeca Pecords per Page" (Endeca is the back-end search / indexing / repository system that powers the site) and controls how many records are returned for one request, with remaining ones needing to be accessed through pagination logic. In this case, we request 0 records (the default is 10) because we only care about a piece of metadata from the search.

The rest of the function

The rest of the function just parses the result feed XML for a "total-results" tag, extracts the text from that (which is a number representing the total number of records available for the search), generates a random integer between 1 and the total number of records, and returns that for use by the next function, which will actually retrieve a record by number.

I've included the relevant portion of the RSS XML that we're parsing below.

<rss xmlns:record="http://www.torontopubliclibrary.ca/rss" xmlns:results="http://www.torontopubliclibrary.ca/rss" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" version="2.0">
    <channel>
    ...
        <results>
            <total-results>23691</total-results>
            </results>
    </channel>
</rss>

Step 2: Retrieving a single item record based on the random number


def get_item_by_number(number):

    record_feed_URL = "http://www.torontopubliclibrary.ca/rss.jsp?N=38550&Erp=1&No=" + str(number)
    record_tree = etree.parse(record_feed_URL)
    item = record_tree.find('channel').find('item')
    return item

Breaking down record_feed_URL

The path

http://www.torontopubliclibrary.a/rss.jsp

Nothing different from what we've seen before - requesting the RSS search results feature.

The parameters

Erp=1: gives us a single record per page
No={a random number from previous function}: this indicates the start number of the records returned for this specific query. Basically, it's the pagination directive.

Returning a single item record for the tweet to use

Again, we do some XML parsing with lxml to extract the information we care about - in this case, it's the item record, which we return. More elaborate parsing gets done in the construct_tweet function...

Step 3: construct a tweet from an item record


def construct_tweet(item):
    item_title = item.find('title').text

    item_link = item.find('link').text

    item_record = item.find('record')

    item_id = item_record.find('recordId').text

    item_image_file_name = item_record.xpath("./attributes/attr[@name='p_file_name']/text()")

    # We have to convert the file name to lowercase

    item_image_URL = "http://static.torontopubliclibrary.ca/da/images/MC/" + item_image_file_name[0].lower()

    # print("Title: " + item_title)
    # print("Record ID: " + item_id)
    # print("Link: " + item_link)
    # print("Image URL:" + item_image_URL)

    # Manual t.co length, should be safe for a while
    tweet_URL_max_length = 25
    tweet_title_trim_length = 140 - tweet_URL_max_length

    tweet_text = item_title[:tweet_title_trim_length] + " " + item_link

    return tweet_text

This function parses the returned item record for fields we care about. While we only actually make our tweet from the item_title and item_link variables, I want to highlight image_image_file_name in particular because it points to a rather buried but key feature of the RSS feed behaviour.

Specifically, all the item records that may appear in a browser or RSS reader as simple "news" items have a huge amount of additional information buried under the "attributes" tag. A snippet:

<attributes>
    <attr name="p_dig_access_rights">Copyright</attr>
    <attr name="p_dig_caption">
    Front lines: The first line of defence is the soldiers who keep an eye out for enemy troop movements. This soldier looks downright comfortable as he sits in wait at a sometimes boring lookout post; while those behind him prepare to set up operations. But; huddled against the cold; he is prepared to react on a moment's notice to any threat to his platoon.
    </attr>
    <attr name="p_dig_creator">Cooper, David</attr>
    ...
    <attr name="p_dig_subject_topical">Arctic regions</attr>
    <attr name="p_dig_subject_topical">Canada. Canadian Army -- Drills and tactics</attr>
</attributes>

It's a rather ugly information dump, but there's a lot there.

Big Picture Stuff

When I worked in library tech, I was fond of saying that if you had a web-based catalogue and account self-service system, you had a public API, it was just a shitty one. The lack of an API never prevented someone from trying to programmatically extract data or automate website functionality - it just made it harder for them!

Bloorg is a case in point, and far from the most extreme one I know of (after all, he's parsing something that's intended to be parsed, albeit in a perhaps unanticipated way - I know or have heard of many projects that achieve their ends by scraping library HTML and automating form interactions).

I think the basic success loop for library digital properties is the following:

build, buy or adapt platform approaches to delivering digital services - things with well-defined APIs and standardized data models
build services and products on top of these platforms
expose and document the platforms publicly so others can build on them

Library linked data, makerspaces and some other trends in the professional conversation are helping move things in this direction, but my anecdotal sense is that it's all a rather piecemeal process (though that's often enough how change gets made...)

Alan Harnum