Monday, 18 August 2008

Re-using video compression code to aid document quality checking

(Expanding on this video post from the Crigshow)


The volume of pages from a large digitisation project can be overwhelming. Add into that the simple fact that all (UK) institutional projects are woefully underfunded and underresourced, it's surprising that we can cope with them really.

One issue that repeatedly comes up is the idea of quality assurance; How can we know that a given book has been scanned well? How can we spot images easily? Can we detect if foreign bodies were present in the scan, such as thumbs, fingers or bookmarks?

A quick solution:

Inspired by a talk at one of the conference strands at WorldComp, where the author talked about the use of a component of a commonly used video compression standard (MPEG2) to detect degrees of motion and change in a video, without having to analyse the image sequences using a novel, or smart algorithm.

He talked about using the motion vector stream to be a good rough guide to the amount of change between frames of video.

So, why did this inspire me?
  • MPEG-2 compression is a pretty much a solved problem; there are some very fast and scalable solutions out there today - direct benefit: No new code needs to be written and maintained
  • The format is very well understood and stripping out the motion vector stream wouldn't be tricky. Code exists for this too.
  • Pages of text in printed documents tend towards being justified so that the two edges of the text columns are straight lines. There is also (typically) a fixed number of lines on a page.
  • A (comparatively rapid) MPEG2 compression of the scans of a book would have the following qualities:
    • The motion vectors between pages of text would either shown little overall change (as differing letters are actually quite similar) or a small, global shift if the page was printed on a slight offset.
    • The motion vectors between a page of text and a page with an image embedded in text on the next, or a thumb on the edge, would show localised and distinct changes that differ greatly from the overall perspective.
  • In fact, a real crude solution could be, just using the vector stream to create a bookmark list for all the suspect changes. This might bring the number of pages to check down to a level that a human mediator could handle.
How much needs to be checked?

Via basic sample survey statistics: to be sure to 95% (±5%) that the scanned images of 300 million pages are okay, just 387 totally random pages need to be checked. However, to be sure that each individual book is okay to the same degree, a book being ~300 pages, 169 pages need to be checked in each book. I would suggest that the above technique would significantly lower this threshold, but it would be by an empirically found amount.

Also note that the above figures carry the assumption that the scanning process doesn't change over time, which of course it does!

No comments: