Visualising words across time

Tristan Roddis
cogapp
Published in
6 min readMay 3, 2019

--

Summary: we created an interface to query search terms from the Qatar Digital Library, and you can play around with it yourself using the online demo.

Deciding what to do

During the most recent Cogapp hack day, we invited members of the British Library’s Qatar Foundation Partnership Programme to join us here at Cogapp towers, with an exclusive focus on working with the dataset available from the Qatar Digital Library (QDL).

For the uninitiated, the QDL is a free online resource of primary source historical documents related to the Gulf region, and currently features over 1.7 million digitised pages, most of which have transcriptions available, thanks to the wonders of optical character recognition (OCR).

Before the hack day, we came up with various ideas for potential projects, and the one that most interested me was the idea of charting the occurrence of different search terms within this body of OCR text, to see how they varied over time.

Hack days ideas mapped out on the office window

Now this is, of course, an idea blatantly stolen from the excellent Google Ngrams viewer, that allows you to examine the variation of words over time in their corpus of digitised books.

Google Ngrams viewer

However, I was keen to see what results we’d find with the data from the digitised pages of the QDL, so, working with Susannah Gillard, one of the archivists at the British Library, we sketched a plan of what our version should do.

The good news was that the raw data is already accessible. This is because the QDL has a ‘timeline’ feature that will show search matches for a single search term:

QDL’s embedded timeline viewer

Picking a Javascript library

However, this will only do one term at a time, so the next step was to write some code that would repeatedly query this interface and display it on a graph. To do this, I initially thought of D3.js as it is the most popular library for data visualisation. However, while researching, I also found the wonderful chart.js package which seemed closer to my needs in that it already plotted beautiful line charts, was extremely configurable, and has lovely animations as you add and remove lines from the graph.

Absolute values

A few hours of furious coding later, I had a first version: we could now see matches over time!

Pearls versus petroleum (absolute values)

However, as you can see from the graph above, we have a problem in that almost every search term gave the same general profile: two humps corresponding to the two world wars, that reflects the overall frequency of the documents in the archive. The solution to this would be to ‘normalise’ the data by displaying everything as relative rather than absolute values: i.e. instead of showing the total number of matches per year, the y-axis should show the percentage of documents for that year that include a match.

Relative values

Pearls versus petroleum (relative values — truncated)

The graph above shows the same search terms, but this time as relative values (which is, after all, what Google Ngrams does). This has the intended effect of accentuating how things vary over time, and we can see that matches for ‘pearls’ are higher towards the end of the 19th century, before the discovery of oil. although it should be noted that it does also have some unintended side effects: this is where the total number of documents available in a year is low, and we can see gross over-representation because it does not take many hits to skew this percentage. This artefact is mostly visible at the far ends of the graph, where the total documents are small. E.g. the graph below that shows the same as above, but showing the full range. Values beyond 1950, where the total number of documents are low, lead to huge spikes:

Pearls versus petroleum (relative values — full range)

Archival descriptions

Over hack day pizza lunch, I got talking to the British Library’s Arabic Scientific Manuscripts Curator, Bink Hallum. He pointed out that while the system we were creating may be useful for OCR transcriptions, these are only available for English-language, typewritten documents. Given that the material Bink deals with is Arabic-language, handwritten documents, I wondered what I could do to help. For this reason, I added another option to the graph: whether to search on the OCR transcriptions, or whether to search in the archival text, which exists for all records on the QDL, albeit in much smaller quantities than the OCR.

Searches across archival descriptions only. Give very different profiles compared with…
… matches within OCR for the same search terms

Smooth operator

Talking of artefacts, the other thing we noticed was that sometimes the line can appear to bend. The reason for this is that Chart.js helpfully attempts to ‘smooth’ lines to more easily see trends (using the lineTension parameter, if you’re interested). However in some cases this can end up as offputting rather than illuminating.

Smoothing for a search term with few hits can make the line bend or skew

For this reason I added another option that allows you to switch off this smoothing:

‘spikey’ mode removes this effect

And finally, just for fun, tried setting it deliberately high, which turns out to create an amusing ‘scribble mode’ for all your graphs:

‘scribble’ mode: artistic, but qualitatively useless

Ta-da!

So that was the final version that was ready for the presentation at the end of the day: a graph that can take arbitrary queries and plot them over time, with options to configure the scale (absolute or relative), the source (transcriptions or descriptions) and the smoothing (smooth, spikey or scribbley).

You can try it for yourself here.

Pro tip: if you click on the label of a search you have just done, it will toggle its visibility: useful when different results have very different scales.

And here are some sample results:

English transliterations of Arabic words: before 1910 the state of Kuwait was more commonly called Koweit
The popularity of different sports as mentioned in the archive
Camel, boat, aeroplane: varying methods of transport over time

So there you have it. An easy-to-use interface to compare search terms over time, with a few configurable parameters.

Next on the list of improvements would be to click through to see actual results for a given year, and to add the ability to just search a specific range of years. Or if you have any suggestions yourself, please let me know.

Want to hack?

The above was one of several hack-day projects that we completed with the British Library Qatar project team in March 2019. If your organisation would like to join us for a future hackday, please get in touch.

--

--