SVG images on Commons and their textual content: some statistics

As part of my summer project to enable the easy translation of SVG images hosted on Wikimedia Commons, I’ve recently been compiling some statistics on their content.

As of time of writing, there are 538,152 SVG images on Commons — approximately 123 gigabytes’ worth. Evidently, analysing them all was going to be too big a task, so instead I selected 10,000 at random (based, in fact, on the first letter of their SHA1 hash – in this case, ‘m’) to test.

Of those 10,000, 71% do not include a single <text> tag; the flipside is that 29% do, or, to put it another way, TranslateSvg has the potential to allow for internationalisation of 156,000 files Commons-wide.

11.5% of SVG files (~40% of files with any strings at all) have between 1 and 10 strings. As the number of strings targeted increases, so frequency tends to decrease, with the notable exception of the 7.5% of all SVG files which  include exactly 16 strings (that’s not an error by the way). The topmost 20 in my sample ranged from 189 strings to a massive 815 strings. In any case, just 0.6% of all SVG files – or 1 in 50 translatable files – include over 100 strings.

In total, I extracted some 57,805 strings from my sample, suggesting the existence of some 3 million <text> tags on Commons, each of which could be translated. We can, of course look more closely at what comprises those strings. (I should note that the following ignores attributes, and – because I wasn’t expecting <text /> tags – might suffer from a slight rate of error.)

Never the less, I can say that slightly over half of those strings look at bit like “<text><tspan>…</tspan></text>”, which is coincidentally Inkscape’s default (despite the fact there’s no reason I know of to use <tspan>s like that). A further quarter, give or take, use plain ol’ <text> syntax. 8.7% consist of multiple pairs of <tspan>s back to back (a relatively sane construction).

Of the wackier constructions, 2.5% choose to nest <tspan>s, whilst 1.5% of all <text> tags have no visible  content whatsoever. A handful of people managed to use the <textPath> tag in their files, which I can’t see a good way of supporting in TranslateSvg.

Okay, so the above isn’t that interesting by itself, but it’s going to inform the design choices I make with TranslateSvg in order to ensure it handles all variety of different constructions properly and optimises in the right places. Hurray for research 🙂

2 thoughts on “SVG images on Commons and their textual content: some statistics”

Leave a Reply

Your email address will not be published. Required fields are marked *