Thursday, March 29, 2012

Does SQL Server support all languages using Unicode

I have an international Internet business and I use SQL Server with the Unicode field types of nchar, nvarchar and ntext. I seem to have no problem with customers entering Slovakian, Slovenian, Romanian, English or German, but Hebrew and Japanese do not seem to work. The resulting data coming back from the database after adding a record is garbage.

I had the software email me the data immediately before adding the data to SQL Server. It was appearing correctly with respect to the foreign language character format. The software does the database add and then returns a list of records currently in that table. The information for Japanese or Hebrew entries is always a string of question marks, e.g. ?

I have been told that the only solution is individual databases for those languages. Is this true?SQL Server supports all languages using UNICODE. You don't need a separate database for each language.

There may be some issues with collation order if you have strings from different languages in the same column. You have to choose a particular collation for the column so indexing can work correctly. The collation you choose may give some unconventional ordering for sorted data in some languages because the conventional ordering differs from region to region. Sharp s, ?, is ordered the same as s in English, but not in German Phonebook, for example. You can specify a collation based on the UI locale for your result ordering in queries to overcome this.

The "?" problem generally happens when you convert UNICODE data to multi-byte using the incorrect locale. If your application must use multi-byte characters for some reason, make sure the locale matches the data's language. If your middle-tier and UI can use UNICODE data directly, its best to avoid the wide character to multi-byte and multi-byte to wide character conversions so this problem doesn't occur.|||You don't understand the problem

1.) The Internet is global in nature.

2.)I have 1 database

3.) data is entered from the contents of a form in a program

4.) customers can come from anwhere in the world. They speak many languages. They fill out the form in their native language (Hebrew, Japanese, etc.) The form contents write to the database.

5.) The database field types are nchar, nvarchar and ntext.

6.) the table has 10 fields. There are 15 tables. If these fields have to be replicated for each unique language, and we assume about 10,000 different languages in the world, the database will not fit on any known hard drive. It will require hundreds of programming hours to determine what language the input data is in and what set of table elements they should go to or be retrieved from (once we figure out what language the viewer uses).

7.) If all of the above is true, any new languages can not be used without prior setup.

8.) Why does MySQL handle this and not SQL Server?

We're not looking to translate anything. What we want is something eminently simple. Whatever they write, in whatever language, goes into the database. When that record is retrieved, whatever is in the record, in any language, is returned as data, unaltered.

There's no sorting and no reporting. The database holds the data and it's delivered when asked for.

Surely there is a solution here that doesn't involve something almost as complex as cloning humans.|||

You don't need to replicate your data for each language. You just need to make sure your application stack is using UNICODE strings from end-to-end, including the web forms that are displaying the data. You've already got the SQL Server end covered if you are using nothing but nchar, nvarchar, and ntext.

I saw the ? problem many, many times back when I was working on international programs that used ANSI strings. The system function that does the UNICODE-to-multi-byte translation uses question-marks for the characters it can't translate into the specified multi-byte locale. In my applications, the problem was never in the server, it was always in my application. Invariably, my application was trying to convert UNICODE data from the server into multi-byte (ANSI) characters in my application, but the locale I had specified for the translation was incorrect for the data, so there was no mapping from the UNICODE data to the ANSI locale. (For example, there is no representation for Japanese characters in the ANSI Hebrew code page.) I stopped seeing this problem once the applications I was working on used nothing but UNICODE in the database, database access layer, middle-tier, and the UI.

The next things I would check are whether you've specified UNICODE strings as the data types in your data access layer (ADO automatically converts between UNICODE and ANSI strings if you tell it you want ANSI strings for example) and whether the UI is using UNICODE or ANSI display controls. Those have been the areas where I've had the most trouble with globalized applications in the past.
I hope this helps,
Steve

No comments:

Post a Comment