Being in North America (the English speaking side), made me understand that many people are not aware (and don’t actually care) about character sets too much. Everything supports English, and everything works. Coming from Israel, I realized that some of the language issues we have in computerized systems are less understood in the western world. For example, we write from right to left (we are not the only ones, Arabic is the same for example), and I always get strange looks when I sign a piece of paper. Another example is the completely different letters (unlike English and most European languages), and more. But we are not the only ones, many countries must have these difficulties, so I decided to write this post.
Let’s start with a quick introduction to some terms that many people are confused about:
Character Set – a character set is the numeric representation of characters (letters, numbers and control characters). Since computers work with bytes (8 bits, can represent the numbers 0-255), in order to use letters we assign each letter a different byte value. For example, in ASCII (one of the most basic character sets) ‘A’ is represented by the value 65 and ‘a’ by the value 97.
Unicode – Unicode is not an actual character set, but a standard that specifies how to standardize character sets that will include multi language support. The Unicode evolves over time as more languages and new characters are added, so there are different versions of Unicode, the current one is version 9 and it was introduced in June 2016.
UTF8 – UTF8 is an actual character set that implements the Unicode standard. The UTF8 implementation determines that each character will be represented by 1-4 bytes dynamically. The most common characters (numbers and English letters for example) will be represented using a single byte, while more unique ones might take up to 4 bytes. In UTF-8, European languages, Arabic and Hebrew take 2 bytes per character, while the Euro sign takes 3 bytes.
UTF16 – UTF16 is a different character set that implements the Unicode standard. Here, the base unit is 2 bytes, or 16 bits (hence the name), so even the most common characters will take 2 bytes. UTF16 is dynamic as well, so characters that couldn’t fit into the 2-byte mapping will take another unit of 2 bytes, so 4 bytes in total.
AL32UTF8 – This is an Oracle character set which is actually UTF8. The difference is that Oracle, probably because of compatibility issues, kept the database character set called “UTF8” static since Oracle 8.1.7 (Unicode version 3), while they keep updating AL32UTF8 (Unicode version 7 in Oracle 12.2). More details on the Unicode character sets in the database can be found in MOS note# 260893.1.
Oracle Database Character Sets
Oracle database supports many different character sets. In each database there are two character sets, the “character set” and the “national character set”. The “national character set” allows us to save Unicode data even if the main database character set doesn’t support Unicode. When we need to use the “national character set” we simply configure a table column as “NVARCHAR2” instead of “VARCHAR2”.
The character set in the database determines how the data is stored physically. This will affect the characters you can save in the database and the space they will take. For example, the Greek letter Beta (β) takes two bytes in UTF8, but only one byte in EL8ISO8859P7, which is the Latin/Greek character set in Oracle.
In Oracle, VARCHAR2 is limited to 4000 bytes (starting with 12.1 it can be extended to 32767 bytes if the MAX_STRING_SIZE parameter is set to EXTENDED). This is a hard limit, meaning that in no case Oracle can exceed this limit and this is very important when using Unicode character sets:
- With a non-Unicode (single byte) character set, setting a column in a table to VARCHAR2(100 chars) or VARCHAR2(100 bytes) is completely identical.
- With Unicode character set (let’s take AL32UTF8), setting a column to VARCHAR2(100 chars) is different than VARCHAR2(100 bytes).
- Limiting column to 100 chars (with AL32UTF8) can actually take more than 100 bytes (up to 400 theoretically)
- Limiting column to 100 bytes (with AL32UTF8) can raise an error about exceeding the column length even with less than 100 characters
- In any case, the hard limit is 4000 bytes (without extending it in 12.1), so if you have multi-byte characters in your column, you will never be able to get to 4000 characters in VARCHAR2 columns. For example, if the entire text is in Greek, Arabic or Hebrew, we will be able to store something around 3000 characters (as spaces, commas, periods and other simple characters take only one byte while the letters take two). This behaves the same with NVARCHAR2 and with the string extension in 12.1, which only allows longer strings.
As I wrote this post, it got longer and longer, so I decided to split it. In this part I introduced the concept of character sets in the database and the difference between Unicode and non-Unicode character sets. I hope I also made the difference between Unicode, UTF8, UTF16 and AL32UTF8 clear, as I know this is confusing.
In the next post I’ll show you how it works and how we see the characters. I will also discuss about the client side character set and different conversions.