BugzilLi0n - Summary of Problem & Suggested Approaches [INCOMPLETE]

Tue Jul 23 21:03:22 UTC 2013

Apologies for the cross posting. I'm not sure exactly where this discussion should take place (I think developers@, but I'm Cc'ing localizers@ just in case).

Below is my assessment on the linguistic adaptation problem and the state of Bugzilla's solution(s) to that problem. I have included recommendations for future action, but that is secondary. The assessment below is primarily to solicit comment and encourage discussion in order to establish a clear direction for any sustained effort in this area.

WARNING: This is INCOMPLETE. I had originally planned on dedicating more time to this effort, but life happens, and I am no longer able to. However, I did not want my research to go to waste. I hope this will be useful to someone who comes in after me.

    --Matt (IRC: posita)

BACKGROUND & INTRODUCTION

About a year ago, somewhat by accident, I started looking into the state of the internationalization and localization world for Bugzilla (i.e., i18n, l10n, l20n, which I will collectively call "BugzilLi0n"*). I started off with (what I thought was) a simple experiment: I wanted to see if I could adapt a local Bugzilla installation to serve as a law firm issue tracker and work flow manager.

(Footnote *: Pronounced, "bug-ZIL-ee-yun", or "bug-zih-LO-kah-lih-ZAY-shun".)

Lawyers and paralegals are notoriously slow to adopt new technology. Technological offerings in that field are not only clunky and often well behind the feature curve, but they are almost always offensively expensive. Having spent about a decade in software development, but finding myself now working in a different field, I longed for familiar tools and features I knew would be immediately useful in my new context (even with minimal training of others).

I settled on an approach: take a stable release of Bugzilla and see if I could adapt it to a relatively foreign environment through template editing, perhaps some extensions, but with with minimal patching. It occurred to me that some of the difficulties I was experiencing were broader than my narrow goals, so I decided to investigate further. I quickly and necessarily encountered some of the localization issues that garnered much discussion about 4-5 years ago. It didn't seem like much more work to consider solutions that, in addition to serving my own narrow goals, could be reusable in a broader context which led me to participate in this forum.

Disclaimer

These comments and observations are mine alone. They are the result of only a handful of hours of investigation as an outsider with specific goals (i.e., biases). They should be afforded appropriate skepticism.

Also, I am not an expert in linguistics. English is my native language. I learned French in high school (and have forgotten most of it). I speak enough Spanish to awkwardly order a cup of coffee and that's about it.

Some of the material presented herein will be remedial or obvious to participants, but I thought a comprehensive summary would be useful for onlookers. Please note that I do not address the concept of automatic language selection or "fallback". That is a problem I consider related to, but separate from this discussion.

WHERE IS BUGZILLIi0N TODAY?

(I believe) I have looked through most of the materials relating to the problem of adaption. The Bugzilla universe appears largely unchanged from around 2008-9-ish with some tapering off in 2010. However, other Mozilla projects have maintained various discussions (see, e.g., <http://diary.braniecki.net/2012/02/27/l20n-feedback-round/>, <https://groups.google.com/forum/?fromgroups#!forum/mozilla.dev.l10n>, <https://wiki.mozilla.org/L20n:Roadmap:Next>, etc.).

Bugzilla currently delegates nearly all of its language adaptation to templates which are parsed and processed by Template Toolkit with Bugzilla-specific extensions. I say "nearly" because there are situations (e.g., custom fields, <https://bugzilla.mozilla.org/show_bug.cgi?id=426222>, etc.) where display values can come straight from non-template sources which are not currently localized (see also <https://wiki.mozilla.org/Bugzilla:L10n:Roadmap#Localizable_database_data>). There are probably work-arounds for use cases employing custom fields concurrently with support for different languages, but those work-arounds likely require a level of technical expertise not available to most (e.g., Perl, an understanding of Bugzilla's Template Toolkit, Bugzilla Perl APIs, etc.) and a level of hacking ugliness which would likely frustrate attempts to upgrade by patch.

PROBLEM IN TWO PARTS

Is Bugzilla currently as adaptable as it "needs" to be?

This is currently unclear to me. To understand my discussion on this topic, it may help to review <http://search.cpan.org/dist/Locale-Maketext/lib/Locale/Maketext/TPJ13.pod> and <https://wiki.mozilla.org/L20n/Issues> to understand the types of contextual linguistic nuances present in languages other than English that might benefit accommodation.

In my assessment, the difficulties Bugzilla is experiencing with localization stem in part from the fact that the product is (or at least was initially) developed primarily by English speakers (so it has a natural bias toward solving linguistic problems in that domain), and in part from the fact that English has relatively simple rules respecting plurality, gender, formality, etc.

I have yet to find a counterexample**, but for the most part, the problem of localization within a UI can be modeled with the following distinct(ish) but related layers:

    Query
        Display Logic
                v
            (Context Data)
                v
            Linguistic Logic

    Response
        Display Logic
                ^
            (Compliant Text)
                ^
            Linguistic Logic

(Footnote **:  Except perhaps that of Japanese, where displaying the same data in the same layout may require different language depending on the status of the viewer [e.g., worker bee, boss, customer, etc.], although in a broad sense, the presented model could accommodate such situations with appropriately rich context data. See, e.g., <https://wiki.mozilla.org/L20n/Issues>.)

"Display logic" can be thought of as providing structure to the display of contextually relevant data. For seeing users***, this largely involves laying out the parts or subparts of a page or screen (e.g., a tabular format for a report, field name/value pairs for attributes, a linear list of comments as part of a discussion, input for performing a search, etc.).

(Footnote ***:  Alternatively accessible formats are beyond the scope of this discussion, but useful to keep in the back of one's mind.)

Part of the larger responsibility of the display logic layer is to display blocks of written language (text) within those subparts. However, language has its own rules (which I've labeled as "language logic"). In order to apply those linguistic rules correctly, context data must be provided. It is possible this context is appropriately discovered and communicated by the display logic layer, but this is not necessarily a requirement (it could come from somewhere else).

This is easier to model in broad theory than apply in practice. Assuming one could discover the number and genders of an audience, consider the following (contrived) example where one wants to display a status string above a list of query results: "As of [date/time], there were [#] result[s] to your query, [sir/ma'am/gentlemen/ladies/lady and gentlemen/ladies and gentleman/ladies and gentlemen]."

Date and time formatting is a fairly common and well-understood aspect of localization. Despite this, the layer to be responsible for formatting isn't always clearly delineated. Static string libraries are ill-suited to formatting dynamic data, so this often gets done in the display logic layer. Additionally, the display logic layer may not care about the specific number of results from a query. It might be perfectly happy just iterating through a list of zero or more items to present the contents of zero or more rows in a table. But, in order for the linguistic layer to do its job, metadata about the query results needs to be gathered or computed (i.e., a count). In addition, number and gender of the audience is likely irrelevant to the layout, but is very relevant to the linguistic layer.

There are two messy solutions: either provide everything to the linguistic logic layer, and let that layer figure it out, or figure out what is needed by the linguistic layer and provide only that. In the first case, the linguistic layer is no longer purely linguistic. It needs to know fairly detailed information about the data it receives in order to make sense of it. In the second case, the display logic layer needs to know fairly detailed information about what is needed by all potential instances of the linguistic layer. I say "all" because some information may be needed by some linguistic contexts, but ignored by others. The display logic layer would likely need to provide all data potentially needed by all linguistic contexts if it wanted to avoid being in the business of figuring out which went with which. A controller implementation may be an appropriate intermediary, but the work needs to be done somewhere. There Is No Free Lunch.

Currently, Bugzilla attacks this problem by munging the two layers inside the template. As I alluded to above, I believe the current implementation is likely sufficient to afford solutions to most localization issues. However, there are several drawbacks. First, there is little reuse because reuse is either hard or ugly. Templates are already difficult to create/maintain.**** One really needs to understand the Bugzilla-specific oddities like Template Toolkit parsing extensions, field_descs, global/variables.none.tmpl, etc., and one should probably know Perl. This shrinks the market of potential new maintainers considerably. By the way, I don't think this is specific to Bugzilla. Many architectures, including Google Apps (Django) necessitate significant coordination between display logic and linguistic logic layers, etc. It's a Hard Problem.

(Footnote ****: Please understand, I am not talking about existing maintainers, although their pain is of concern as well. I'm talking about making it easier for outsiders to adapt Bugzilla to new linguistic contexts. The barrier to entry is high, even if it is documented [see <http://www.bugzilla.org/docs/tip/en/html/cust-templates.html>]. There is, of course, a risk that even with the requisite expertise, existing maintainers will give up if modifications are too difficult. This should be avoided, as loss of expertise is very costly.)

Without a clearly effective way to separate and easily maintain the above-identified layers, I am reluctant to recommend changing the current implementation, simply because the problem is significant. Effort should be proportionate to realized gain. One needs to be careful where significant work is likely, but where the magnitude of any benefit is unclear, and where implementation of one solution may be difficult to leverage if another solution is adopted later. I will attempt to address this in the next section.

If Bugzilla needs increased adaptability, what are the viable solutions?

INCOMPLETE. This section needs more work and more details.

There are tools out there like Locale::Maketext and perllocale, but their use likely requires a lot of care and thought or one will just end up with a spaghetti mess.

I do not know this to be true, but Bugzilla's current template set feels like it evolved organically (kind of like a scrappy patchwork class hierarchy) where contributors came and went and abstracted where it was immediately convenient to add features or fix bugs without too much thought about overarching conventions. Please note, I mean no offense by that observation. It is normal (and perhaps healthy). For example, I suspect where several pages need to display a table list of bugs, that table list was pulled out and referenced by the pages on which it appears. Subtle (and not so subtle) differences were parameterized as needed by the client pages. Feature requests came in, and the table list was further parameterized. Ultimately, these kinds of extract-and-reference bits become hyper-specific to their own tasks.

Basically, any recommendation likely comes down to: 1) continue collapsing the display/language layers inside of things like templates and leave the hard work to the template maintainers; or 2) try to separate them out knowing that calling layers are going to need to provide a lot of language-specific data to the language layer.

Regarding option 1, see (metaphorically) Networking Truth # 5. It also has the effect of potentially alienating extension writers who have to patch templates to inject display material where they need it. This sucks, but it seems to have been the de facto convention for quite some time, so who am I to judge?

Regarding option 2, it might be cleaner, but only if it can provide clear and (perhaps more importantly) consistent interfaces (contracts) to (e.g.) the display layer. There are probably a small number of contextual types (e.g., counts, gender/count pairs, date/time, money, etc.), but someone would likely need to invest a lot of time designing and experimenting with those interfaces to avoid just making things messier. See Networking Truths #s 6 and 6a.