The Lowdown on ASCII, ANSI, and Unicode

You must understand Unicode. At least the basics, anyway. Every programmer needs to understand character sets. ASCII, ANSI, OEM, and UNICODE. Without it, you'll soon find yourself in lost territory.

These are character sets used on our beloved computers, and they're very important to programmers. Character sets exist to associate a number (or even a multi-byte set of two or more numbers) to a particular character: that could be a letter, a digit, punctuation, or even a special symbol. All of these character sets provide the same functionality, but they map many of the number-to-character relationships very differently.

Exactly what do we mean by a multi-byte character? You may not have encountered it before. That's common, but it's a fairly simple concept. A byte can hold just 256 different values. However, there are many more than 256 possible characters needed. So, in some cases, it's been necessary to use multiple numbers, in a series, to represent some of these characters. Just keep in mind that characters may need 1 byte, 2 bytes, or even more, to be accurately represented.

A good compiler offers you ANSI strings. This has been the standard for many years. A better compiler lets you choose between ANSI and UNICODE, but only one. If you want unicode, you can't keep binary values in a string. It simply won't work. A great compiler, like PowerBASIC, supports all of them, in the same program, transparently. One variable with ANSI. Another with UNICODE. Mix and match any way you choose with PowerStrings. All the messy details, and even the needed translations, are handled automatically by the compiler. But more about that later.

ASCII characters

This is the original. The first character set to appear on personal computers. It was very simple, as long as you only needed plain old American English characters. No accents. No symbols, No drawing. No international characters. We used the numbers between 32 and 127 to represent what was needed. A blank space was 32, the letter "A" was 65, lower case "c" was 99. The entire set of characters could be stored in 7 bits. Very convenient for an 8-bit computer. Numbers below 32 were reserved for control codes. They were commonly known as non-printing, since they weren't associated with a character. The number 13 was a carriage return, 10 a line feed, 9 was a tab. While ASCII was certainly limited, it remains the basis for all the other character sets to follow. ASCII characters are unchanged in the other character sets.

OEM characters

It wasn't long before all the folks realized that bytes had one more bit. The characters from 128 to 255 were a great temptation. IBM added international characters, line drawing symbols, accents, and even more. WordStar used them for formatting. Who knows what else. Of course, when people outside the U.S. got involved, they needed to support their own characters. The IBM set didn't work, as it was just too small. There were lots of different ideas about what should be supported, but every region of the planet knew that their scheme was the best. We had lots of incompatible, even incomprehensible text. These are OEM character sets. However, to this day, the original IBM character set is universally known as the OEM character set.

ANSI characters

After a time, some resolution was found in the ANSI standard. Pretty much everyone agreed on the lower 128 characters... they remained the same as ASCII. But the upper 128? That depended on your region of the world. Every region had their own "code page", depending upon the characters in their language. You couldn't use two code pages simultaneously, so characters remained equally incompatible. One character code above 127 might represent something totally different in each of the dozens of code pages.

Of course, that wasn't the only issue. Some Asian alphabets have thousands of characters. That would never work in a 8-bit byte. So, the "multi-byte" character was born. Identify a character by a specific set of 2, 3, or even more bytes. While this gave us much more capacity, it was difficult to use for the programmer. It was easy to move forward, character by character, while scanning a string. But what if you had to back up? Was that preceding code a one-byte character? Or was it part of a multi-byte set of codes? All very confusing.

UNICODE characters

The creation of Unicode was an exhaustive effort to build a single character set which could represent every character in any language with a unique code number. This was sorely needed in this age of the Internet to avoid mass confusion. Although not widely known, there are several forms of Unicode. By far the most common form is known as UTF-16, because each character is 16 bits wide. A 2-byte value in the range of 0 to 65535. This form is used by PowerBASIC, Microsoft, and other compiler publishers. Most characters of most languages can be defined this way, so it's a very convenient form for us to use. It's a great tool for programmers, because every value represents just one unique character. Ambiguity is over. Just as before, the lowest 128 values are inherited directly from ASCII. They're just extended to a 2-byte representation. The letter "Z" in ascii is represented by the byte &H5A. In Unicode, it's represented by the word &H005A. Very straightforward.

When a file is created with Unicode characters, you'll find that it is sometimes identified by a "Byte-Order Mark" (BOM). If the first two bytes are &HFF, then &HFE, the file format is Unicode with "Little Endian" encoding. The low-order byte of each word precedes the high-order byte. This is the format used by all Intel CPU's, PowerBASIC, and the vast majority of other origins. However, if you encounter a Byte-Order mark of &HFE, then &HFF, the encoding method is "Big Endian", and the byte order is reversed. It would be nice if every Unicode text file contained a Byte-Order mark, but that's just not the case. Don't ever count on its presence.

As you probably guessed, there's no real maximum to the number of characters defined by Unicode. For that reason, there's also a UTF-32 form, with each character defined as as a 4-byte DWord. While this form expands capacity nicely, it's also very wasteful. UTF-32 is rarely used, and merits just a passing mention at this time.

UTF-8 UNICODE characters

And then came the Internet. Massive amounts of text, with great pressure to present a web page quickly. All those extra zeros on every UTF-16 character were called a huge waste of bandwidth and time, too. How can we speed it all up? By making a compromise between ANSI and UNICODE. UTF-8 is an all out effort to minimize the size of the text which must be served up on a web page. In that context, text size is the #1 issue. In UTF-8, each character from 0 to 127 is stored as a single byte. Characters encoded above 127 are stored as a set of 2-6 bytes. The most used characters are assigned codes with the smallest byte count.

Just as before, UTF-8 also inherits the ASCII values. If you're working in American English only, you won't even notice a change. The inventors made some other nice changes as well. For multi-byte characters, they used unique values for the lead-in byte, and other unique values for the following bytes. This allows you to step through a string, in either direction, with absolutely no ambiguity. A huge improvement in the overall scheme of things. While UTF-8 is seldom used outside of the Internet, PowerBASIC includes simple, easy to use functions, for quick translation of UTF-8 Unicode to and from every other character set. A big boost for your Internet-aware applications.

PowerStrings make it so simple

The new PowerBASIC offers PowerStrings. They actually show signs of intrinsic intelligence. They know the form of the characters they hold. They know if they're Unicode... They know if they're ANSI... They know if they're OEM. And they act accordingly. If a conversion is needed, it's all automatic. Totally automatic. Concatenate ANSI with UNICODE? Sure! It's all automatic and totally transparent. For example, suppose you have an ANSI string as a$, and a Unicode string as u$$. You wish to concatenate them, storing the result as b$$. It's easy. no different than before.

b$$ = a$ + u$$

PowerBASIC automatically converts a$ to Unicode format, appends u$$ to it, then stores the result in the variable b$$. It's just that simple. But is it fast? Of course! As always, PowerBASIC leads the way in performance. It's very special. Just try the execution of an INSTR() function against any other. Unicode or ANSI. As with every time sensitive function, PowerBASIC keeps two versions handy. One for OEM or ANSI, another for Unicode. Each of them is built with explicit, hand crafted machine code. When it's time to create your EXE, PowerBASIC includes only the one which best suits your code.

The POWER changes everything

Unicode is important today. It will be pervasive tomorrow. Don't be left behind, or it may not be possible to catch up. PowerBASIC makes Unicode easy, so you typically don't need to give it much thought at all. Use that fact to your advantage... then spend the time you saved for other important issues. If you don't prepare today, you could face real problems later on. Plan now, or forever hold your peace.