SVG images on Commons and their textual content: some statistics

As part of my summer project to enable the easy translation of SVG images hosted on Wikimedia Commons, I’ve recently been compiling some statistics on their content.

As of time of writing, there are 538,152 SVG images on Commons — approximately 123 gigabytes’ worth. Evidently, analysing them all was going to be too big a task, so instead I selected 10,000 at random (based, in fact, on the first letter of their SHA1 hash – in this case, ‘m’) to test.

Of those 10,000, 71% do not include a single <text> tag; the flipside is that 29% do, or, to put it another way, TranslateSvg has the potential to allow for internationalisation of 156,000 files Commons-wide.

11.5% of SVG files (~40% of files with any strings at all) have between 1 and 10 strings. As the number of strings targeted increases, so frequency tends to decrease, with the notable exception of the 7.5% of all SVG files which  include exactly 16 strings (that’s not an error by the way). The topmost 20 in my sample ranged from 189 strings to a massive 815 strings. In any case, just 0.6% of all SVG files – or 1 in 50 translatable files – include over 100 strings.

In total, I extracted some 57,805 strings from my sample, suggesting the existence of some 3 million <text> tags on Commons, each of which could be translated. We can, of course look more closely at what comprises those strings. (I should note that the following ignores attributes, and – because I wasn’t expecting <text /> tags – might suffer from a slight rate of error.)

Never the less, I can say that slightly over half of those strings look at bit like “<text><tspan>…</tspan></text>”, which is coincidentally Inkscape’s default (despite the fact there’s no reason I know of to use <tspan>s like that). A further quarter, give or take, use plain ol’ <text> syntax. 8.7% consist of multiple pairs of <tspan>s back to back (a relatively sane construction).

Of the wackier constructions, 2.5% choose to nest <tspan>s, whilst 1.5% of all <text> tags have no visible  content whatsoever. A handful of people managed to use the <textPath> tag in their files, which I can’t see a good way of supporting in TranslateSvg.

Okay, so the above isn’t that interesting by itself, but it’s going to inform the design choices I make with TranslateSvg in order to ensure it handles all variety of different constructions properly and optimises in the right places. Hurray for research 🙂

GSOC – Week One

  • I compiled a list of SVGs to be analysed
  • I posted on various relevant mailing lists, getting responses from translators using both Arabic and Cyrillic scripts
  • I downloaded Translate from Git, then installed and configured on local test wiki
  • I downloaded 10,000 SVGs ready to be analysed
  • I investigated Translate’s terminology and how TranslateSvg will fit with that
  • On the basis of those investigations, I posted a preliminary workflow suggestion  for discussion and criticism (has already come a long way)
  • Revised that preliminary workflow to accommodate new thoughts
  • Submitted all necessary paperwork to Google to accept my place as part of GSoC

Accepted onto Google Summer of Code

Today, the nine successful Google Summer of Code applicants were announced on the Wikimedia blog. One of them was me 🙂 Yay!

I’ll be working on TranslateSvg, a translation interface for SVG file. The next steps from here involve three things, which I shall be working on over the next week or two:

  • Analysing SVG files, to assess which data structures should have implementation priority;
  • Opening a discussion with translators to understand their interface needs better;
  • Analysing the code behind the Translate extension to understand its potential for underpinning the interface.

You can read my full proposal, which I will now be working towards, here.

Driver problems with SiS163u

This is just a quick note to anyone googling for problems experienced when trying to connect to networks using a Wireless LAN device of the SiS163u range (such as”Fujitsu Siemens Computers WLAN 802.11b/g (SiS163u)”) – and in particularly when getting an authentication timeout error.

What I found helped resolve to issue was to ensure that my WLAN driver was updated to version 6.0.1039.1110. Drivers of the 6.0.1039.109x family experience problems with modern routers. The easiest way to update your driver is via your laptop manufacturer’s website (e.g. the Fujitsu website, as in the case noted above).

UPDATE: I’ve posted a copy of that driver here, but YMMV.

SVG files as application/x-octetstream

Unfortunately, Firefox seems to have a tendency to glitch over SVG files, assigning them the mime-type of application/x-octetstream instead of the correct image/svg+xml. This causes problems when using my Toolserver tool SVGCheck from Mozilla Firefox.

Fortunately, this is easily fixed in a couple of minutes. To do so:

  1. Close Firefox, if open, taking a note of these and other instructions if necessary.
  2. Navigate to your Firefox profile folder (for instructions on how to do so, refer to this article on MozillaZine.org).
  3. Open the file mimeTypes.rdf in your favourite text editor.
  4. Using the editor’s search functionality, locate and remove the lines:
  5. Save, exit and reload Firefox.
  6. All done 🙂

This is a (semi) known bug with Firefox, and an annoying one indeed.

Git up and running

I just submitted my first patchsets via Git, a simple fix to a problem I firs submitted a patch for under the old Subversion system.

Within a couple of hours it had been reviewed in some form of another by three developers, approved and merged. On my end, had involved a commit and two amends, but I had as little problem with my half of the bargain as my reviewers did with theirs (and many thanks to them for being so prompt).

Thus, for me at least, the whole process worked well. If only code review can be kept this snappy, MediaWiki will no doubt prosper.

Applying for Google Summer of Code

I’ve just applied to be one of MediaWiki’s Google Summer of Code students. The project gives students from around the world a stipend in order to cover open source development work over the summer months, and it would be ideal for me.

However, it’s also highly competitive, so I welcome comments on my proposal, which is still very much in the draft stage. It’s entitled “TranslateSvg: Bringing the translation revolution to Wikimedia Commons”, and it shouldn’t be too boring a read–although I’ll probably sit down and rewrite most of it tomorrow to make sure of that!

Open Economics Hackday

Yesterday, I travelled to the Barbican Arts Centre in London for an economics themed “hackday” – that is to say, a day which included participants with a wide range of skills, but focussed on discrete achievements rather than abstract discussions. It was organised by the Open Knowledge Foundation.

Overall, it was an enjoyable day: okay, so the Hammersmith and City line was closed, and then I missed my stop, but I got there in one piece, found the group within the vast public “space” on the ground floor, and enjoyed a mixture of data mining and trying out visualisations made using the “d3” JavaScript framework. It was incredibly powerful, but took a long time to learn: not ideal for a single day event, but interesting nonetheless.

It wasn’t a lot to show – particularly because my visualisation was more of a “proof of concept” than an integrated component of a final app – but it was something. And it was great to meet like-minded people, even if one does get a sense of despondency with regard to the many struggles involved in creating a single data hub.

I left at 6-ish; many of the 20-strong group will no doubt have stayed longer. As it happened I might as well have, since the main London-Oxford line was shut. In the event I couldn’t be bothered sitting around at Paddington for an hour waiting for them to fix “a signalling problem at Reading”, so I got the tube to Marylebone, then trains to Banbury and from there Oxford, arriving an hour later than expected. Ah well.