MySQL UTF-8 Conversion in Bugzilla
Fergus Sullivan
fergus at yahoo-inc.com
Tue Nov 28 02:01:28 UTC 2006
Hi guys.
Long time listener here, first time caller...
Our Bugzilla usage here at Yahoo may help illuminate the need for a
broader approach to this problem. We've found that a simple
conversion to UTF-8 may not be the answer. A more complex conversion
process may be needed to avoid dataloss.
As background, our Bugzilla instance is based on version 2.20 and has
not had a default charset. We have bugs logged by internal users in
many locales, including multi-byte locales. The text of these bugs
can often be in an East Asian language. These display fine because
mysql doesn't really care what charset its bytes are in. The browser
default will ensure a consistent charset is used within each locale.
Among the existing bugs, we have bugs that may fall into any of the
following categories:
== The simple bugs ==
- bug is entirely in ascii
- bug is entirely in utf-8
- bug is in shift-jis, euc-jp, big5, gb2312 or any other 'native
character set'.
All of these can be converted from charset to charset relatively
easily, especially if we can work out what the source charset is.
There are libraries for this.
== The tricky bugs ==
- bug is mostly in ascii or utf-8 or any other single charset, but
includes characters from another valid charset. Why? A bug could be
logged in English but includes pasted characters from Japanese.
- bug is mostly in ascii or utf-8 or any other single charset, but
includes garbage characters. Why? A bug report might refer to a
Chinese product, but include pasted characters that indicate some
kind of corruption took place prior to display time.
== The Problem ==
We'd very much like to have our existing database start defaulting to
UTF-8. The problem is how can we convert existing reports to UTF-8
and not lose information in existing bugs that shows problems
relating to charsets?
== Problematic Solution ==
We considered the following.
- have bugzilla default to utf-8 for all NEW bugs.
- set a unicode bit on all new bugs.
- If at showbug a unicode bit is set: set this in the header's
charset.
Flaw:
- What about contexts where we need to display both a unicode bug
and a non-unicode bug in the same page?
== Problems with email ==
We also need to enforce charset when bugs are submitted via the email
gateway, and when notifications are sent. Also, let's not forget
whinemails.
Thoughts?
/ferg
--
fergus sullivan | yahoo bugzilla admin | fergus at yahoo-inc.com | o.
408.349.6807 |
On Sun 19-Nov-06, at 10:20 a, Frédéric Buclin wrote:
>> checksetup.pl will warn you and give you 60 seconds to stop it,
>> before
>> it goes ahead and converts your database.
>
>
> As dataloss is something really critical, wouldn't it be safer to
> request the admin to type some character (such as [Y]es) to confirm
> the
> DB conversion, and cancel it if the admin didn't do it within 60
> seconds? Of course, checksetup.pl should be able to get this answer
> from
> the answer file too. I would feel much more confident doing it this
> way.
>
> LpSolit
> -
> To view or change your list settings, click here:
> <http://bugzilla.org/cgi-bin/mj_wwwusr?user=fergus@yahoo-inc.com>
More information about the developers
mailing list