How to convert existing XML based translations into PO files

Jens Seidel jensseidel at users.sf.net
Thu Aug 21 08:24:06 CDT 2008


Hi,

I try to explain how to convert existing translations to PO format. I
will use the German translation as example. Please ensure that you have
no local changes in your working copy before you start.

It seems that the German translators specify the current sync/merge state of XML
files in the Subversion log.

The following files are currently translated:

book/book.xml
book/ch00-preface.xml
book/ch01-fundamental-concepts.xml
book/ch02-basic-usage.xml
book/foreword.xml

According to svn log r3268 all files are in sync with r3206 of the
English version. Additionally ch00-preface.xml corresponds to revision
3266 (see r3269).

Let's get the corresponding English files:
$ svn up -r 3206 en
$ svn up -r 3266 en/ch00-preface.xml

The conversion process is error prone and slow. Let's that's why remove
identical (untranslated) files in en/ to avoid (bogus) conflicts.

$ rm en/book/{appa-quickstart.xml,appb-svn-for-cvs-users.xml,appc-webdav.xml,ch03-advanced-topics.xml,ch04-branching-and-merging.xml,ch05-repository-admin.xml,ch06-server-configuration.xml,ch07-customizing-svn.xml,ch08-embedding-svn.xml,ch09-reference.xml,copyright.xml,index.xml}

Now we have to get matching <English string> and <translation of it> pairs to
be put into PO file. That's why we ensure that the translated XML files
have the same structure as the English files and just assume that the
<i>th paragraph in both file matchs. This is the main assumption and without
it we're lost!

OK, ready to start? Change into de/.

I assume that you used the UTF-8 encoding, if not, please adapt the
Makefile or recode your files. Start make xml-to-po:

de$ make xml-to-po
po4a-gettextize --format=docbook --master-charset=utf-8 -o doctype="book" --localized-charset=utf-8 \
        --master=../en/book/book.xml --master=../en/book/ch00-preface.xml --master=../en/book/ch01-fundamental-concepts.xml --master=../en/book/ch02-basic-usage.xml --master=../en/book/foreword.xml \
        --localized=book/book.xml --localized=book/ch00-preface.xml --localized=book/ch01-fundamental-concepts.xml --localized=book/ch02-basic-usage.xml --localized=book/foreword.xml | \
        msgattrib --clear-fuzzy > de.po
po4a gettextize: Original has less strings than the translation (631<632). 
               Please fix it by removing the extra entry from the translated
               file. You may need an addendum (cf po4a(7)) to reput the 
               chunk in place after gettextization. A possible cause is that
               a text duplicated in the original is not translated the same 
               way each time. Remove one of the translations, and you're 
               fine.
po4a gettextization: Structure disparity between original and translated 
files:
msgid (at ../en/book/ch00-preface.xml:902) is of type 'Content of: 
<preface><sect1><sect2><title>' while
msgstr (at book/ch00-preface.xml:1773 book/ch00-preface.xml:1788) is of type
'Content of: <preface><sect1><sect2><figure><title>'.
Original text: Subversion's Architecture
Translated text: Die Architektur von Subversion
(result so far dumped to gettextization.failed.po)
The gettextization failed (once again). Don't give up, gettextizing is a 
subtle art, but this is only needed once to convert a project to the 
gorgeous luxus offered by po4a to translators.
Please refer to the po4a(7) documentation, the section "HOWTO convert a 
pre-existing translation to po4a?" contains several hints to help you in 
your task


It didn't worked! Don't worry. Read the above instructions carefully and
also the manpage. Such problems can occur if the translation differs
from the original English file in tag usage, paragraph splitting, ...

Try to bring your file as much as possible into sync with the English one.
Compare viewers such as gvimdiff, kompare, mgdiff, ... can be a great help.

The German files use comments to also specify the English text (to easy
proofreading, ...):

<!--
    <attribution>Greg Hudson, Subversion developer</attribution>
-->
    <attribution>Greg Hudson, Subversion-Entwickler</attribution>

These comments do *not* cause trouble, po4a simple skips these. Yippee!

The problem in ../en/book/ch00-preface.xml:902, book/ch00-preface.xml:1773 seems
to be related to the different recognized tags:
<preface><sect1><sect2><title> vs. <preface><sect1><sect2><figure><title>

The text looks good. To help po4a I decided to just remove the problematic
<figure> tag in both ../en/book/ch00-preface.xml and book/ch00-preface.xml
(both files need to be in sync!).

The following patch (shown only against the English file) demonstrates it:

Index: book/ch00-preface.xml
===================================================================
--- book/ch00-preface.xml (Revision 3266)
+++ book/ch00-preface.xml (Arbeitskopie)
@@ -905,11 +905,6 @@
         a <quote>mile-high</quote> view of Subversion's
         design.</para>

-      <figure id="svn.intro.architecture.dia-1">
-        <title>Subversion's architecture</title>
-        <graphic fileref="images/ch01dia1.png"/>
-      </figure>
-  
       <para>On one end is a Subversion repository that holds all of your
         versioned data.  On the other end is your Subversion client
         program, which manages local reflections of portions of that

(No, we do not plan to commit this. It is just a help for po4a to get proper
string pairs. It failed here so we have later to translate the text
"Subversion's architecture" again :-))

Attention: Don't call "make clean" at this stage as this will remove our modified
XML files (because an (empty) PO file exists in the directory!)

Let's try "make xml-to-po" again. It fails now in the next file
ch01-fundamental-concepts.xml at a similar <figure> tag. OK, we know what
to do: We remove all <figure> tags from this file (both in en/ and de/)
as well.

Now it complains with:
msgid (at ../en/book/ch01-fundamental-concepts.xml:359) is of type 'Content 
of: <chapter><sect1><sect2><sidebar>' while
msgstr (at book/ch01-fundamental-concepts.xml:513) is of type 'Content of: 
<chapter><sect1><sect2><para>'.
Original text: <sidebar id=\"svn.basic.in-action.wc.sb-1\">

Inspecting this file shows that an addional paragraph was added to the German
file. The following patch fixes it:

Index: book/ch01-fundamental-concepts.xml
===================================================================
--- book/ch01-fundamental-concepts.xml  (Revision 3279)
+++ book/ch01-fundamental-concepts.xml  (Arbeitskopie)
@@ -522,14 +497,12 @@
         and behave as though you had typed:</para>
 -->
       <para>wird Subversion die unsicheren Zeichen umwandeln, als ob
-        Sie</para>
+        Sie Folgendes geschrieben hätten:</para>

       <screen>
 $ svn checkout http://host/path%20with%20space/project/espa%C3%B1a
 </screen>

-      <para>geschrieben hätten.</para>
-
 <!--  
       <para>If the URL contains spaces, be sure to place it within quotation
         marks so that your shell treats the whole thing as a single

The content of the translation matched the English one, nevertheless the structure
was a little bit different and required the translation to be rephrased. If such a
rephrasing is not possible the English file needs to be changed to allow it!
Some tricks such as using XML entities, ... could be abused as workaround as well
but seems not be necessary here.

Now it works and we get a proper PO file de.po.

I attached the required patch against revision r3279 for your reference. Please note
that this patch should not be applied. It's just to demonstrate what was necessary
to change. (The <figure> problem should be reported as bug against po4a.)

Lets check statistics:
$ msgfmt -cv de.po
de.po:7: some header fields still have the initial default value
606 translated messages.

Attention: These statistics are partly wrong. Remember that our files
were not fully translated? All English text in our XML files will be
considered as translation as well. That is itself not critical but it is
suggested to remove such English msgid from de.po (replace it with
msgstr ""). 

PS: Has anyone an idea how to remove all messages where the translated
msgstr string is identical to the English msgid? I'm sure it will be
possible without perl, e.g. with the help of msgfilter ...

Great, 606 message translations could be reused! The header of the PO
file looks currently as follows and needs to be fixed:
# SOME DESCRIPTIVE TITLE
# Copyright (C) YEAR Free Software Foundation, Inc.
# This file is distributed under the same license as the PACKAGE package.
# FIRST AUTHOR <EMAIL at ADDRESS>, YEAR.
#
msgid ""
msgstr ""
"Project-Id-Version: PACKAGE VERSION\n"
"POT-Creation-Date: 2008-08-21 14:32+0300\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL at ADDRESS>\n"
"Language-Team: LANGUAGE <LL at li.org>\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: ENCODING"

Let's change this to

# German translation of the Subversion book
#
# This file is distributed under the same license as the Subversion book.
# Jens Seidel <jensseidel at users.sf.net>, 2008.
#
msgid ""
msgstr ""
"Project-Id-Version: svnbook 2008\n"
"POT-Creation-Date: 2008-08-21 12:09+0300\n"
"PO-Revision-Date: 2008-08-21 12:11+0200\n"
"Last-Translator: Jens Seidel <jensseidel at users.sf.net>\n"
"Language-Team: Debian German <debian-l10n-german at lists.debian.org>\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"

and we have now a proper PO file.


This file is still incomplete as it misses the <figure> tags and also
all strings from the files we removed at the beginning. Let's revert our
local changes (e.g. by using svn revert) and update the working copy
to HEAD.

We call
$ make update-po
po4a-updatepo --po=de.po --format=docbook --master-charset=utf-8 -o \
doctype="book" --master=../en/book/appa-quickstart.xml \
--master=../en/book/appb-svn-for-cvs-users.xml \
--master=../en/book/appc-webdav.xml --master=../en/book/book.xml \
--master=../en/book/ch00-preface.xml \
--master=../en/book/ch01-fundamental-concepts.xml \
--master=../en/book/ch02-basic-usage.xml \
--master=../en/book/ch03-advanced-topics.xml \
--master=../en/book/ch04-branching-and-merging.xml \
--master=../en/book/ch05-repository-admin.xml \
--master=../en/book/ch06-server-configuration.xml \
--master=../en/book/ch07-customizing-svn.xml \
--master=../en/book/ch08-embedding-svn.xml \
--master=../en/book/ch09-reference.xml --master=../en/book/copyright.xml \
--master=../en/book/foreword.xml --master=../en/book/index.xml \
--previous
..........[snip]........

PS: It's the msgmerge tool called by po4a which needs a very long time.
I read in the past (on debian-i18n at lists.debian.org) a posting from
Clytie Siddall where she wrote about a patch to speed it up, don't know
it's current status).

It works without any problem (the error prone step was already performed).

Now we have a much larger file (there is still work to do for the German
team :-):
$ msgfmt -cv de.po
606 translated messages, 376 fuzzy translations, 3204 untranslated messages.

Later, if the English book changes again, make update-po should be
called again. If e.g. a typo was fixed in an English message
(let's say "Subvversion"), the PO file will change from

msgid "Version Control with Subvversion"
msgstr "Version Control with Subversion"

to

#, fuzzy
#! msgid "Version Control with Subvversion"
msgid "Version Control with Subversion"
msgstr "Version Control with Subversion"

Such a fuzzy message is just a guess from msgmerge for the translation.
In this case it's optimal (as the typo isn't part of the translation). The
current msgstr matches the msgid given in the comment. So you just have
to check the changes between "Version Control with Subvversion" and
"Version Control with Subversion" and integrate it into your
translation. After it, remove the "#, fuzzy" and "#|" line(s).

Fuzzy strings will per default not used but the English string.

****************************************************************************
* This also means that such minor fixes which do not change translations   *
* should be applied to all msgid strings (and rarely to msgstr if affected)*
* in PO files to avoid addional work for translators!                      *
****************************************************************************

PS: For your reference I made the German PO file available under
http://alioth.debian.org/~jseidel-guest/de_svnbook.po
It is not ready to be committed and should be accepted by the
translators first.

PS2: It is not yet ready because the sync information as mentioned above
is not always right! E.g. r513 of ../en/book/ch00-preface.xml is not
yet merged (search for the string "Demonstrates how to use the public
APIs to write a"). Contributors contains also "Øyvind A. Holm" which was
never a part of the English file (I think). Is he a German translator?
Such credits should be given in a different way! po4a supports so called
addendas which can extent the original document by translator credits, ...
Also ch02-basic-usage.xml is far away from beeing in sync with r3206.

PS3: To start a new translation a PO file template (usually called
svnbook.pot) (contains all msgid's but only empty msgstr's) can be
generated in a new language directory by calling make update-po.
This file should be send to translators (contains currently 4186
messages to translate :-)).

Jens
-------------- next part --------------
Index: src/en/book/ch00-preface.xml
===================================================================
--- src/en/book/ch00-preface.xml	(Revision 3279)
+++ src/en/book/ch00-preface.xml	(Arbeitskopie)
@@ -905,11 +905,6 @@
         a <quote>mile-high</quote> view of Subversion's
         design.</para>
       
-      <figure id="svn.intro.architecture.dia-1">
-        <title>Subversion's architecture</title>
-        <graphic fileref="images/ch01dia1.png"/>
-      </figure>
-  
       <para>On one end is a Subversion repository that holds all of your
         versioned data.  On the other end is your Subversion client
         program, which manages local reflections of portions of that
Index: src/en/book/ch01-fundamental-concepts.xml
===================================================================
--- src/en/book/ch01-fundamental-concepts.xml	(Revision 3279)
+++ src/en/book/ch01-fundamental-concepts.xml	(Arbeitskopie)
@@ -31,11 +31,6 @@
       client receives information from others.  <xref
       linkend="svn.basic.repository.dia-1"/> illustrates this.</para>
 
-    <figure id="svn.basic.repository.dia-1">
-      <title>A typical client/server system</title>
-      <graphic fileref="images/ch02dia1.png"/>
-    </figure>
-
     <para>So why is this interesting?  So far, this sounds like the
       definition of a typical file server.  And indeed, the repository
       <emphasis>is</emphasis> a kind of file server, but it's not your
@@ -102,11 +97,6 @@
         probably by accident.  This is definitely a situation we want
         to avoid!</para>
 
-      <figure id="svn.basic.vsn-models.problem-sharing.dia-1">
-        <title>The problem to avoid</title>
-        <graphic fileref="images/ch02dia2.png"/>
-      </figure>
-
       </sect2>
 
     <!-- =============================================================== -->
@@ -128,11 +118,6 @@
         linkend="svn.basic.vsn-models.lock-unlock.dia-1"/>
         demonstrates this simple solution.</para>
 
-      <figure id="svn.basic.vsn-models.lock-unlock.dia-1">
-        <title>The lock-modify-unlock solution</title>
-        <graphic fileref="images/ch02dia3.png"/>
-      </figure>
-
       <para>The problem with the lock-modify-unlock model is that it's
         a bit restrictive and often becomes a roadblock for
         users:</para>
@@ -216,16 +201,6 @@
         linkend="svn.basic.vsn-models.copy-merge.dia-2"/> show this
         process.</para>
 
-      <figure id="svn.basic.vsn-models.copy-merge.dia-1">
-        <title>The copy-modify-merge solution</title>
-        <graphic fileref="images/ch02dia4.png"/>
-      </figure>
-
-      <figure id="svn.basic.vsn-models.copy-merge.dia-2">
-        <title>The copy-modify-merge solution (continued)</title>
-        <graphic fileref="images/ch02dia5.png"/>
-      </figure>
-
       <para>But what if Sally's changes <emphasis>do</emphasis> overlap
         with Harry's changes?  What then?  This situation is called a
         <firstterm>conflict</firstterm>, and it's usually not much of
@@ -490,11 +465,6 @@
         top-level subdirectory, as shown in <xref
         linkend="svn.basic.in-action.wc.dia-1"/>.</para>
 
-      <figure id="svn.basic.in-action.wc.dia-1">
-        <title>The repository's filesystem</title>
-        <graphic fileref="images/ch02dia6.png"/>
-      </figure>
-
       <para>To get a working copy, you must <firstterm>check
         out</firstterm> some subtree of the repository.  (The term
         <emphasis>check out</emphasis> may sound like it has something to do
@@ -616,11 +586,6 @@
         each tree is a <quote>snapshot</quote> of the way the
         repository looked after a commit.</para>
 
-      <figure id="svn.basic.in-action.revs.dia-1">
-        <title>The repository</title>
-        <graphic fileref="images/ch02dia7.png"/>
-      </figure>
-
       <sidebar>
         <title>Global Revision Numbers</title>
 
Index: src/de/book/ch00-preface.xml
===================================================================
--- src/de/book/ch00-preface.xml	(Revision 3279)
+++ src/de/book/ch00-preface.xml	(Arbeitskopie)
@@ -1781,15 +1781,7 @@
         einen <quote>kilometerhohen</quote> Blick auf das Design von
         Subversion.</para>
       
-      <figure id="svn.intro.architecture.dia-1">
 <!--
-        <title>Subversion's architecture</title>
--->
-        <title>Die Architektur von Subversion</title>
-        <graphic fileref="images/ch01dia1.png"/>
-      </figure>
-  
-<!--
       <para>On one end is a Subversion repository that holds all of your
         versioned data.  On the other end is your Subversion client
         program, which manages local reflections of portions of that
Index: src/de/book/ch01-fundamental-concepts.xml
===================================================================
--- src/de/book/ch01-fundamental-concepts.xml	(Revision 3279)
+++ src/de/book/ch01-fundamental-concepts.xml	(Arbeitskopie)
@@ -55,11 +55,6 @@
       gestellt.  <xref linkend="svn.basic.repository.dia-1"/>
       verdeutlicht das.</para> <!-- jmf: XSLT macht daraus "Abbildung..." -->
 
-    <figure id="svn.basic.repository.dia-1">
-      <title>Ein typisches Client/Server System</title>
-      <graphic fileref="images/ch02dia1.png"/>
-    </figure>
-
 <!--    <para>So why is this interesting?  So far, this sounds like the
       definition of a typical file server.  And indeed, the repository
       <emphasis>is</emphasis> a kind of file server, but it's not your
@@ -193,11 +188,6 @@
         probably by accident.  This is definitely a situation we want
         to avoid!</para>
 
-      <figure id="svn.basic.vsn-models.problem-sharing.dia-1">
-        <title>The problem to avoid</title>
-        <graphic fileref="images/ch02dia2.png"/>
-      </figure>
-
     </sect2>
 
     <!-- =============================================================== -->
@@ -219,11 +209,6 @@
         linkend="svn.basic.vsn-models.lock-unlock.dia-1"/>
         demonstrates this simple solution.</para>
 
-      <figure id="svn.basic.vsn-models.lock-unlock.dia-1">
-        <title>The lock-modify-unlock solution</title>
-        <graphic fileref="images/ch02dia3.png"/>
-      </figure>
-
       <para>The problem with the lock-modify-unlock model is that it's
         a bit restrictive and often becomes a roadblock for
         users:</para>
@@ -307,16 +292,6 @@
         linkend="svn.basic.vsn-models.copy-merge.dia-2"/> show this
         process.</para>
 
-      <figure id="svn.basic.vsn-models.copy-merge.dia-1">
-        <title>The copy-modify-merge solution</title>
-        <graphic fileref="images/ch02dia4.png"/>
-      </figure>
-
-      <figure id="svn.basic.vsn-models.copy-merge.dia-2">
-        <title>The copy-modify-merge solution (continued)</title>
-        <graphic fileref="images/ch02dia5.png"/>
-      </figure>
-
       <para>But what if Sally's changes <emphasis>do</emphasis> overlap
         with Harry's changes?  What then?  This situation is called a
         <firstterm>conflict</firstterm>, and it's usually not much of
@@ -522,14 +497,12 @@
         and behave as though you had typed:</para>
 -->
       <para>wird Subversion die unsicheren Zeichen umwandeln, als ob
-        Sie</para>
+        Sie Folgendes geschrieben hätten:</para>
 
       <screen>
 $ svn checkout http://host/path%20with%20space/project/espa%C3%B1a
 </screen>
 
-      <para>geschrieben hätten.</para>
-
 <!--
       <para>If the URL contains spaces, be sure to place it within quotation
         marks so that your shell treats the whole thing as a single
@@ -750,15 +723,7 @@
         Hauptverzeichnis abgelegt, wie in <xref
         linkend="svn.basic.in-action.wc.dia-1"/> dargestellt.</para>
 
-      <figure id="svn.basic.in-action.wc.dia-1">
 <!--
-        <title>The repository's filesystem</title>
--->
-        <title>Das Dateisystem des Repositorys</title>
-        <graphic fileref="images/ch02dia6.png"/>
-      </figure>
-
-<!--
       <para>To get a working copy, you must <firstterm>check
         out</firstterm> some subtree of the repository.  (The term
         <emphasis>check out</emphasis> may sound like it has something to do
@@ -991,14 +956,6 @@
         <quote>Schnappschuss</quote> des Repositorys nach einem Commit
         ist.</para>
 
-      <figure id="svn.basic.in-action.revs.dia-1">
-<!--
-        <title>The repository</title>
--->
-        <title>Das Repository</title>
-        <graphic fileref="images/ch02dia7.png"/>
-      </figure>
-
       <sidebar>
 <!--
         <title>Global Revision Numbers</title>


More information about the svnbook-dev mailing list