Just what is the Character (Set) of that Document?

In this age of the Internet, where information is exchanged between systems regularly, it is all too easy to forget that computer systems can store their “plain-text” data in a lot of different ways. If you thought that UNIX Files versus Windows Files were annoying with their Line Feed versus Carriage Return+Line Feed differences, can you imagine the trouble we would have if ASCII didn’t exist?

ASCII, the American Standard Code for Information Interchange, has become a subset of many other character sets in common usage today, so you can exchange a lot of documents without too much hassle, but what do you do if you get something else?

Since ASCII is commonly a subset, it is very easy to process some documents as if they were ASCII, but sometimes that isn’t the case. For example, while working with some SAP IDOCs last year, we came across foreign characters that were getting lost in translation. More precisely, they weren’t being lost so much as being interpreted as a different character than they really were. Fixing this problem wasn’t that difficult though. In our EBI product, we have included tasks that let you decode and encode text documents from one Character set to another very easily. We just added the Character Set Converter task to our existing Business Process Script and the problem was resolved.

However, what would you do if you received a text document that appeared completely illegible? Chances are in that case, the entire document is encoded in a character set that doesn’t include ASCII at all, such as EBCDIC. In this case, the best time to address character encoding is on the system where the document is being created in the first place. This is because computer systems can include special hardware to accelerate the translation of the encoding from a native form of EBCDIC, for example IBM037, to a more multi-language friendly UTF-16 or UTF-8 encoding that are in common use today for exchanging documents between systems, particularly as XML. And by the way, UTF-8 is one of those Character Set Encodings that contains ASCII as a subset.

I hope the next time you come across some “plain-text” that appears to be missing something or doesn’t appear to make any sense you’ll remember to check what the character set of the document is and find that fixing the problem isn’t as difficult as you might think.

2 thoughts on “Just what is the Character (Set) of that Document?

  1. Pingback: Twitter Trackbacks for Just what is the Character (Set) of that Document? | EXTOL Technology Blog [extol.com] on Topsy.com

  2. Roy Hayward

    Jason,

    This is one of those problems that just keeps seeming to rear its ugly head. But only like once every couple of years. In dealing with EDI trading partners, for some reason there are still tools out there that want to use the EBCDIC “pipe” for a delimiter as opposed to the ASCII “pipe”

    If you haven’t seen this before and haven’t been warned, dealing with a character set problem can be a hair pulling experience until you figure out what is happening.

Leave a Reply

Your email address will not be published. Required fields are marked *


*