control characters and Util::clean_text()
Tom Emerson
tree at basistech.com
Wed Dec 21 18:45:15 UTC 2005
Benton, Kevin writes:
> Actually, it does. If I put a character in and you strip it without
> telling me, you've hurt my ability to get my "stuff" done. We ran into
> this with Perforce and non-utf8 Unicode files that were getting
> corrupted on check-in. Perforce messed up the file format but didn't
> tell the user that it couldn't handle Unicode in any format except
> UTF-8. Users didn't know any better so they went on with their work
> only to find out later that their check-ins had been corrupted when they
> tried to use them from a different system at a later date (not easy to
> troubleshoot).
This isn't the issue.
The issue, as I read the question, was this: if you have Unicode text
encoded in UTF-8 and you strip control characters (including 0x7F)
from it, will that corrupt the text. The answer is no, it will not.
If you have UTF-16 or UCS-2 or UTF-32 text and you start stripping
random bytes, then you will have a problem. The fact that Perforce
choked and did the wrong thing is a bug on their side.
Stripping control codes from UTF-8 is no different from stripping
control codes from Latin-1.
--
Tom Emerson Basis Technology Corp.
Software Architect http://www.basistech.com
"You can't fake quality any more than you can fake a good meal." (W.S.B.)
More information about the developers
mailing list