Wednesday 1 December 2010

Sixth Task - Review the correct use of character sets in XML, discussing the advantages and disadvantages of different sets, with examples. How would you select your character sets to present Chinese characters?

The most widely used encoding is the American Standard Code for Information Interchange (ASCII), a code devised during the 1950s and 1960s under the sponsorship of the American National Standards Institute (ANSI) to standardize teletype technology. This encoding comprises 128 character assignments (7-bit) and is suitable primarily for North American English.Well Historically speaking from my reserach i have found that, other languages that did not fit in the ASCII 7-bit character set (a-z; A-Z) pretty much created their own character sets, sometimes with local standards acceptance and sometimes not. Some languages have many character encodings and some encodings, particularly Chinese and Japanese, have very complex systems for handling the large number of unique characters.  

Awaiting in the year 2008 it was the predictable character encodings. The foremost characteristic of ASCII code is that it uses a 8 -bit distinct byte to signify each character and well-organized to represent document written in English language and European languages. nevertheless, non-European languages including symbols cannot lever by ASCII encodings. Thus an uncomplicated explanation came by introducing assortment of ASCII character sets and i.e. an individual be capable of choosing necessary encoding from the collections. During this revolution an innovative called  ISO 8859 series came about or The ISO-8859-1 character set, often simply referred to as Latin 1, can represent most Western European languages including: Albanian, Catalan, Danish, Dutch, English, Faroese, Finnish, French, Galician, German, Irish, Icelandic, Italian, Norwegian, Portuguese, Spanish and Swedish.

In the late 1980s, there were two independent attempts to create a single unified character set. One was the ISO 10646 project of the International Organization for Standardization (ISO), the other was the Unicode Project organized by a consortium of (initially mostly US) manufacturers of multi-lingual software. Fortunately, the participants of both projects realized in 1991 that two different unified character sets did not make sense and they joined efforts to create a single code table, now referred to as Unicode. 

Unicode sets out to consolidate many different encodings, all using separate code plans into a single system that can represent all written languages within the same character encoding. Unicode is first a set of code tables to assign integer numbers to characters, also called a code point. 

Unicode then has several methods for how a sequence of such characters or their respective integer values can be represented as a sequence of bytes, generally prefixed by “UTF. Then there was the translation problem of ASCII coded documents into Unicode by adding the numbers  (00000000) at the front of ASCII document. Nevertheless, by by means of UTF-8 encoding, one single byte is simply obligatory to converting one byte of a character. Thus the UTF-16 will be able to handle eastern languages such as Japanese, Korean and Chinese.


Because XML character data itself is represented in Unicode in other words it is designed to allow representing in a same document all the characters specified by the Unicode specification, there is no requirement for specifying character encoding in XML pipelines. However, when an XML info-set is read or written as an textual XML document, specifying a character encoding may may be a useful hint. For example a URL generator can, with this mechanism, communicate to an HTTP serializer the preferred character encoding obtained when the document was read. The serializer may then use that hint, but it is by no means authoritative.

In general, XML documents can be read and written using the UTF-8 character encoding, which allows representing all the Unicode characters. However, when dealing with other types of text documents, tools such as text editors may not be able to deal correctly with UTF-8. In such cases, it can be useful to use even more widespread character encodings such as ISO-8859-1 or us-ASCII. The drawback is that such encodings allow representing a much smaller set of characters than UTF-8. Before an XML parser can read a document, it must know which character set encoding the document uses. In some cases, external meta information tells the parser what encoding the document. For instance, an HTTP header may include a content-type header; example of a character set declaration is as follow: Content-type: text/html; charset=ISO-8859-1.

Furthermore xml documents must have an encoding assertion to identify the parser of which character set is used in the document concept. i.e.  <?xml version=”1.0” encoding=”ISO-8859-1”?>,. UTF-8 and UTF-16 are the standard encodings for Unicode text in HTML documents, with UTF-8 as the preferred and most used encoding. Thus the parser can handle a quantity of other encodings but by clearly indicating information separately from the standard ending is a disadvantage where it influences the document use. The example below demonstrates an approach to set character in Chinese by use of character code in hexadecimal.<?xml version= “1.0”  encoding=“ISO-8859-1”?> <decl> Declaration: &#x4EBA; &# x5EBA; &#x861F; &#x899C; &#x81FB; </decl>

No comments:

Post a Comment