Term of the Week: Character Encoding – The Language of Localization

What is it?

A method for representing characters in a data format, typically binary, so that the characters can be transmitted electronically and decoded properly by the receiver.

Why is it important?

As localizers, all the text we work with is encoded for storage and transmission. If we don’t know how it’s encoded, we’ll read or write it incorrectly.

Why does a technical communicator need to know this?

If two people exchange handwritten letters, they can be reasonably confident that if one writes the letter A, the other will recognize it. But what if they send those messages electronically? The sender and receiver have to agree in advance how the sender should convert the text to binary data, so the receiver can reverse the process and read what was sent.

That agreement is a character encoding: a system to map characters to a transmission format, and back. Character encoding predates computers (Morse code is one example), but in localization, we are primarily concerned with encoding characters to binary.

If text is stored using one encoding method and read back using a different method, corrupted characters will result. To reduce this risk, many applications always use the same encoding, but it’s still perilously easy for buggy localization tools to accidentally corrupt data during processing.

Historically, the term character encoding has been used interchangeably with character set, but with the rise of Unicode, it’s important to maintain the distinction, because Unicode is a single character set that supports multiple character encodings. That is, Unicode data can be stored in different ways (UTF-8, UTF-16, etc.), and you need to know which method was used. In general, knowing the character encoding is enough to infer the character set, but the converse is not always true.

If you’re lucky, you should rarely need to interact with character encodings directly[W3C-1][W3C-2].

References

[W3C-1] W3C: Character Encodings for Beginners:
[W3C-2] W3C: Character Encodings: Essential Concepts: A more in-depth discussion of characters, character encodings, and related concepts, with an emphasis on Unicode.

About Chase Tingley

Chase Tingley is VP of Engineering at Spartan Software, Inc. He has 15 years of experience developing localization tools, including work on WorldServer, GlobalSight, and Ontram. An advocate for the greater adoption of localization standards and open source tools, he is a core contributor to the Okapi Framework (open source framework for creating interoperable localization tools, http://okapi.sourceforge.net) and a member of the OASIS XLIFF and XLIFF-OMOS Technical Committees.

Term: Character Encoding

Email: chase@spartansoftwareinc.com

Website: spartansoftwareinc.com/

Twitter: @ctatwork

LinkedIn: linkedin.com/in/ctingley/