GSOC – Week 2

  • I analysed the 10,000 SVGs I downloaded last week (see previous blog post for the results)
  • I generated all three 3 important initial designs
  • I sought feedback on those designs
  • I established a new central page for the project
  • I got to grips with how parameters other than string could be saved within the Translate infrastructure
  • I exchanged security-aspect-related correspondence
  • I created test message group in the correct format, beginning work on getting Translate to understand it and display it using the new design
  • Get to grips with how parameters other than string could be saved within the Translate infrastructure.
  • Exchanged security-aspect-related correspondence
  • Create test message group in correct format

SVG images on Commons and their textual content: some statistics

As part of my summer project to enable the easy translation of SVG images hosted on Wikimedia Commons, I’ve recently been compiling some statistics on their content.

As of time of writing, there are 538,152 SVG images on Commons — approximately 123 gigabytes’ worth. Evidently, analysing them all was going to be too big a task, so instead I selected 10,000 at random (based, in fact, on the first letter of their SHA1 hash – in this case, ‘m’) to test.

Of those 10,000, 71% do not include a single <text> tag; the flipside is that 29% do, or, to put it another way, TranslateSvg has the potential to allow for internationalisation of 156,000 files Commons-wide.

11.5% of SVG files (~40% of files with any strings at all) have between 1 and 10 strings. As the number of strings targeted increases, so frequency tends to decrease, with the notable exception of the 7.5% of all SVG files which  include exactly 16 strings (that’s not an error by the way). The topmost 20 in my sample ranged from 189 strings to a massive 815 strings. In any case, just 0.6% of all SVG files – or 1 in 50 translatable files – include over 100 strings.

In total, I extracted some 57,805 strings from my sample, suggesting the existence of some 3 million <text> tags on Commons, each of which could be translated. We can, of course look more closely at what comprises those strings. (I should note that the following ignores attributes, and – because I wasn’t expecting <text /> tags – might suffer from a slight rate of error.)

Never the less, I can say that slightly over half of those strings look at bit like “<text><tspan>…</tspan></text>”, which is coincidentally Inkscape’s default (despite the fact there’s no reason I know of to use <tspan>s like that). A further quarter, give or take, use plain ol’ <text> syntax. 8.7% consist of multiple pairs of <tspan>s back to back (a relatively sane construction).

Of the wackier constructions, 2.5% choose to nest <tspan>s, whilst 1.5% of all <text> tags have no visible  content whatsoever. A handful of people managed to use the <textPath> tag in their files, which I can’t see a good way of supporting in TranslateSvg.

Okay, so the above isn’t that interesting by itself, but it’s going to inform the design choices I make with TranslateSvg in order to ensure it handles all variety of different constructions properly and optimises in the right places. Hurray for research 🙂

GSOC – Week One

  • I compiled a list of SVGs to be analysed
  • I posted on various relevant mailing lists, getting responses from translators using both Arabic and Cyrillic scripts
  • I downloaded Translate from Git, then installed and configured on local test wiki
  • I downloaded 10,000 SVGs ready to be analysed
  • I investigated Translate’s terminology and how TranslateSvg will fit with that
  • On the basis of those investigations, I posted a preliminary workflow suggestion  for discussion and criticism (has already come a long way)
  • Revised that preliminary workflow to accommodate new thoughts
  • Submitted all necessary paperwork to Google to accept my place as part of GSoC

Accepted onto Google Summer of Code

Today, the nine successful Google Summer of Code applicants were announced on the Wikimedia blog. One of them was me 🙂 Yay!

I’ll be working on TranslateSvg, a translation interface for SVG file. The next steps from here involve three things, which I shall be working on over the next week or two:

  • Analysing SVG files, to assess which data structures should have implementation priority;
  • Opening a discussion with translators to understand their interface needs better;
  • Analysing the code behind the Translate extension to understand its potential for underpinning the interface.

You can read my full proposal, which I will now be working towards, here.

Applying for Google Summer of Code

I’ve just applied to be one of MediaWiki’s Google Summer of Code students. The project gives students from around the world a stipend in order to cover open source development work over the summer months, and it would be ideal for me.

However, it’s also highly competitive, so I welcome comments on my proposal, which is still very much in the draft stage. It’s entitled “TranslateSvg: Bringing the translation revolution to Wikimedia Commons”, and it shouldn’t be too boring a read–although I’ll probably sit down and rewrite most of it tomorrow to make sure of that!