control characters and Util::clean_text()

Tom Emerson tree at basistech.com
Wed Dec 21 18:45:15 UTC 2005


Benton, Kevin writes:
> Actually, it does.  If I put a character in and you strip it without
> telling me, you've hurt my ability to get my "stuff" done.  We ran into
> this with Perforce and non-utf8 Unicode files that were getting
> corrupted on check-in.  Perforce messed up the file format but didn't
> tell the user that it couldn't handle Unicode in any format except
> UTF-8.  Users didn't know any better so they went on with their work
> only to find out later that their check-ins had been corrupted when they
> tried to use them from a different system at a later date (not easy to
> troubleshoot).

This isn't the issue.

The issue, as I read the question, was this: if you have Unicode text
encoded in UTF-8 and you strip control characters (including 0x7F)
from it, will that corrupt the text. The answer is no, it will not.

If you have UTF-16 or UCS-2 or UTF-32 text and you start stripping
random bytes, then you will have a problem. The fact that Perforce
choked and did the wrong thing is a bug on their side.

Stripping control codes from UTF-8 is no different from stripping
control codes from Latin-1.

-- 
Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
 "You can't fake quality any more than you can fake a good meal." (W.S.B.)



More information about the developers mailing list