Automated image analysis with IIIF

Using Artificial Intelligence for bulk image analysis

Adrian Hindle

Published in

cogapp

9 min readJun 20, 2017

In this article we’ll show how to use the IIIF Presentation and Image APIs to gather inputs including:

Finding interesting images
Image recognition and automatic tagging
Colour analysis
Finding similar images
Term extraction
The best image analysis API

And we will show the interesting, valuable, and occasionally hilarious outcomes from these techniques for bulk image analysis.

Background

Our research focused on image analysis and machine learning using the million images available from the Qatar Digital Library, as well as other IIIF repositories.

We started with a simple problem: how to automatically identify “visually interesting” documents (i.e. those with illustrations and diagrams) among the hundreds of thousands that do not have these features.

This led us to explore techniques for image analysis, ranging from colour extraction using services such as Imagga, to machine learning techniques to classify documents after training with sample data sets. For these we used a mixture of bespoke software and third-party API services.

Here’s what we did, the results, and what we learnt…

Finding interesting images

It is easier to define what is of interest by explaining what is not. What isn’t a visually interesting page/image is one that is blank or that only contains text. However, one that is interesting might also have illustrations, drawings, diagrams, geometric shapes etc.

Visually interesting: “A Syrian Voyage in Central and South America” http://www.qdl.qa/en/archive/qnlhc/12957.56

For some images like these it can be quite easy to distinguish the visually interesting from the not visually interesting. But for others, like the following 2 images, it gets a lot harder.

Visually interesting: “Nourishment for the Ailing” and “Nourishment for the Healthy” p.20 http://www.qdl.qa/en/archive/qnlhc/9549.20

Not visually interesting: “Nourishment for the Ailing” and “Nourishment for the Healthy”p. 19 http://www.qdl.qa/en/archive/qnlhc/9549.19

In this example we would want to get the first one as it has illustrations on the right-hand page. The second one on the contrary is not visually interesting as it does not contain a diagram nor illustration. You can easily imagine that we would probably get the second one as a false positive.

You might think we could simply use the archives’ metadata, but for most of these visually interesting images we don’t have information on these features. The underlying EAD or MODS records sometimes have descriptions of these visually interesting parts, but nothing that we could programatically extract and use to find these pages.

Colour variance

The first thing we tried was to simply look at their colour, working on the assumption that illustrations used more colours than the plain black of handwritten pages. To do this we used the colours API of the Imagga image analysis service. The colours endpoint analyses and extracts the predominant colours from one or several images.

We categorised our images using the colour variance value returned by Imagga, as we thought the visually interesting images would have high values of colour variance.

We tested this theory against a random archive and quickly realised we wouldn’t be able to use this approach. Although the images differed a lot in colour variance (from 7 to 49), the visually interesting ones were dispersed within this, so there was no obvious threshold value to use.

Image recognition and automatic tagging

We then started looking into extracting terms and other features using Clarifai. Clarifai is an image and video recognition API that can automatically tag, organise and search visual content with machine learning.

The best feature of Clarifai is that you can create your own concept sets. We first tried using a combination of existing concepts ‘illustration’ and ‘drawing’ but we weren’t getting very good results, probably due to the fact that Clarifai was picking up stains and watermarks as illustrations. We also tried the opposite negative concepts with ‘text’ and ‘manuscript’ but the images (being manuscript pages) all had a high rating of text and manuscript, regardless of whether they were visually interesting.

We then extended Clarifai by creating our own custom sets: ‘arabic_manuscript’ and ‘arabic_ manuscript_with_image’, each with their own training images. The names of these sets don’t have any effect, what defines them are the the fact that they each contain examples of the two types of image that we wish to disambiguate.

Clarifai interface, arabic_manuscript training set, 46 images

Clarifai interface, arabic_ manuscript_with_image training set, 26 images

We wrote a Python script to create the sets, train them and test against random images.

To get the random images, the script would retrieve the Qatar Digital Library’s IIIF collection, randomly pick ten archives from the collection and then for each archive pick ten random images using reservoir sampling.

The results are then displayed using a Jinja2 template in HTML. The images that Clarifai has given a higher value to the ‘arabic_ manuscript_with_image’ set compared to the ‘arabic_ manuscript’ have a red border around them.

Clicking on any image will automatically add it to the opposite set. This is used to correct and train the sets.

First run with 10 random archives and 10 images for each

Even with these small sets Clarifai was giving impressive results.

In the image above we can see that it got a few wrong on the first ten images, caused by water-stained pages. We knew that we could improve this, so using the HTML output page we corrected and re-trained Clarifai with the updated sets. In the end we were able to get close to 100% accuracy, as shown below.

Example output for one archive, http://www.qdl.qa/en/archive/81055/vdc_100000000044.0x0003ca

We took this data from Clarifai and created a custom IIIF manifest that contained only the visually interesting images.

The custom manifest visible in Mirador viewer.

The next steps for this could be to use these results to boost and/or filter the search results.

We could also use this technique for other challenges specific to our data. For example, to find blank sheets vs those with text, or handwritten vs typewritten.

I’m sure you can think of similar questions/problems about your own collection.

Colour analysis

We now set ourselves the challenge of sorting images by colour.

To investigate colour analysis we used images provided by the Internet Archive via the experimental IIIF service provided by ArchiveLabs. We retrieved around 84,000 images, from which we were able to extract 23,000 unique colours.

R,G,B plotted in 3D with coloured dot for 1000, 5000, 10000 and 23000 images

The graphs above show the red, green and blue values for these colours, plotted for 1,000, 5,000, 10,000 and 23,000 colours. From these you can see that a diagonal line appears, caused by all the black and white images. You can also see that in the final version we have pretty good coverage of the colour cube. We then used this data to create a beautiful rainbow (because everybody loves rainbows, right?).

Try the Rainbow demo now

Finding similar images

The next problem we challenged ourselves with was finding similar images.

We began by looking into image analysis using the Python image processing library Scikit-Image.

Original http://www.qdl.qa/en/archive/81055/vdc_100022555293.0x000001

We examined various filters for image analysis, shown in the image above. From left to right:

Original image: http://www.qdl.qa/en/archive/81055/vdc_100023491125.0x000037
Boolean format: converts the image to boolean format.
Skeletonization: reduces binary objects to 1 pixel wide representations. This can be useful for feature extraction, and/or representing an object’s topology.
Censure: this is a scale-invariant center-surround detector.
DAISY: local image descriptor based on gradient orientation histograms.

All these are really good for finding images that have been modified (e.g. that have been rotated or that contain noise) but we couldn’t get them to display similar images.

We also tried using mean squared error and structural similarity.

The mean squared error is calculated by doing the square of the difference between pixels in image A and image B, summing them and dividing the result by the number of pixels. Structural Similarity Index (SSIM) is used for measuring an image’s quality against the original. So we gave those two a try.

Less errors than the previous one but with more structural similar

In the final image above, the system suggested the pictures of the Eiffel Tower and Statue of Liberty were the same. We quickly realised we weren’t going to be able to use this to identify similar images.

Scikit-Learn

We looked into doing our own machine learning algorithm using Scikit-Learn but realised that machine learning is not easy, especially starting from scratch and for this problem. Our main constraint was time so we had to reconsider so that we would have something interesting to show at the IIIF conference in the Vatican.

Nonetheless we will be exploring this, maybe using pre-trained models, in the near future.

Term extraction

It was time for a different approach, so we starting looking into term extraction and using the resulting tags to find similar images instead. We tested the three main services: Clarifai, Google Vision API and Microsoft Computer Vision. Here is an example extraction for each service for the same image.

Microsoft Computer Vision term extraction

To test this we used 2000 images from our friends at the Nationalmuseum Sweden, found via the Europeana API. We extracted terms for all the images using Clarifai, Google Vision API and Microsoft Computer Vision. Then if the confidence of those terms was above 0.75 we indexed them into Elasticsearch and visualise them using using Searchkit.

**We can use this to find popes with a moustache**

You can try this yourself here http://labs.cogapp.com/iiif-ml.

A bit of fun

Microsoft Computer Vision also tries to add a caption to the images. It sometimes gives really good results:

Microsoft’s caption starts with “Emanuel Swedenborg…”. We were impressed because that is correct, it is indeed Emanuel Swedenborg:

But then Microsoft adds extra information and things get a bit weird:

The MS Computer Vision caption for this image is “Emanuel Swedenborg sitting in front of a laptop”. This is caused by the images that Microsoft used to train their algorithm, which are evidently contemporary photos and that’s is why it is trying to find laptops, phones, people taking selfies, etc. In the same way, Google’s Deep Dream tries to find faces and cats everywhere.

You can see more captions by filtering using Microsoft Computer Vision: http://labs.cogapp.com/iiif-ml.

The best image analysis API

Having studied Clarifai, Google Vision API and Microsoft Computer Vision, we found the three APIs are broadly similar.

Clarifai’s advantage is that it can be trained, so if it can’t detect something, all you have to do is create a set and train it in the same way we did to find visually interesting images, discussed at the beginning of this article.

MS Computer Vision offers basic caption for the image. That can be quite useful but may give you weird captions depending on your images. These can be a lot of fun though.

The Google Vision API links to similar images, offers crop hints and location identification. This API also seems to try and be more specific compared to the others we trialled.

Each AI has a “personality” of sorts. As always with machine learning identification is only as good as the training data, although for better results you could combine these with human input (e.g. tagging incorrect items, or comparing with existing metadata).

In conclusion, computer vision is becoming increasingly accurate, easy to use, and has reached the stage of being another tool for developers to employ to make sense of their data.

We gave this talk as part of the International Image Interoperability Framework (IIIF) 2017 Conference in The Vatican. You can also see a PDF of the slides from the talk, that includes some extra screen shots and lots more amusing captions (skip to page 49).

If you’d be interested in having Cogapp analyse your collection, please get in touch.