Sunday, September 22, 2013

Understanding Character Sets, Encoding and Unicode

Character Set/charset

- set of characters that may or may not define an encoding
- Examples: ASCII (covers all English characters), ISO/IEC 646, Unicode (covers characters from all living languages in the world)

Encoding/Character encoding/Character set encoding

- General meaning: a set of rules or system for representing a character in some form such as bit pattern, sequence of natural numbers, octets, or electrical pulses, e.g. Morse code, Baudot code, ASCII and Unicode
- More strict meaning: a mapping of characters to how they are stored in memory (bit sequence)
- Examples: ASCII encoding, Unicode encodings like UTF-8 and UTF-16


Source of Encoding Standards:

  1. Standards bodies
    ANSI (American National Standards Institute)
    - is the U.S. standards organization that creates standards (like the ASCII) for the computer industry
    ISO (International Organization for Standardization)
    - largest developer of voluntary International Standards
    - adopted ASCII as ISO 646:IRV
  2. Independent software vendors
    IBM
    - developed codepage 437 for DOS, codepage 852 for Eastern European languages that use Latin script, codepage 855 for Russian and some other Eastern European languages that use Cyrillic script, etc.
    Windows
    - developed the familiar Windows codepages, such as codepage 1252, alternately known as "Western", "Latin 1" or "ANSI"


Examples of character sets or encodings


ASCII (American Standard Code for Information Interchange)
- is a 7-bit encoding scheme used to encode letters, numerals, symbols, and device control codes as fixed-length codes using integers
- includes definitions for 128 characters
- 128 to 255 is free causing varied character representation of 128 to 255 resulting to varied ASCII extensions


EBCDIC (Extended Binary Coded Decimal Interchange Code)
- is an 8-bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems.


Codepage 1252 and ISO 8859-1
- ISO 8859-1 “Latin 1” is a standard developed by American National Standards Institute (ANSI)
- Codepage 1252 is a standard created by the Microsoft for Western European languages based on an early draft of the ANSI proposal that later became ISO 8859-1 “Latin 1”
- Codepage 1252 was finalised before ISO 8859-1 was finalised, however, and the two are not the same: Codepage 1252 is a superset of ISO 8859-1

ANSI codepage
- Microsoft referred Codepage 1252 as "the ANSI codepage" but around the time of Windows 95 development, Microsoft began to use the term "ANSI" in a different sense to mean any of the Windows codepages, as opposed to Unicode
- currently in the context of Windows, the terms "ANSI text" or "ANSI codepage" should be understood to mean text that is encoded with any of the legacy 8-bit Windows codepages rather than Unicode. It really should not be used to mean the specific codepage associated with the US version of Windows, which is Codepage 1252.

Other Legacy encoding standards
- most encode each character in terms of a single 8-bit processing unit, or byte
- some are double-byte encodings like Microsoft codepages for Chinese, Japanese and Korean


UTF-8 and Unicode


Unicode
- is a standard developed by the Unicode Consortium that assigns a unique number/identifier for every character, no matter what the platform, no matter what the program, no matter what the language
- In Unicode, every character is assigned a unique number called "code point"

Ways of Encoding Unicode

  1. UCS-2 (because it has two bytes) - the traditional store-it-in-two-byte methods
  2. UTF-16 (because it has 16 bits) - you have to figure out if it's high-endian UCS-2 (most significant byte first) or low-endian UCS-2 (least significant byte first) through the BOM (byte-order mark)
  3. UTF-8 (Unicode Transformation Format 8-bit)
    - is a variable-width encoding that can represent every character in the Unicode character set. It was designed for backward compatibility with ASCII and to avoid the complications of endianness and byte order marks in UTF-16 and UTF-32 
  4. UTF-7 - similar to UTF-8 but guarantees that the high bit will always be zero
  5. UCS-4 - stores each code point in 4 bytes


Other related terms


Code Page

- is a term that originated from IBM that essentially means the same as character set and encoding

Internationalized URL / URL encoding / Percent encoding 

- see https://www.w3.org/International/articles/idn-and-iri/http://www.url-encode-decode.com/

Sources:
http://www.unicode.org/
http://en.wikipedia.org
http://www.joelonsoftware.com/articles/Unicode.html
http://scripts.sil.org/cms/scripts/page.php?item_id=IWS-Chapter03
http://mikesusan.com/ascii.html
http://www.utf-8.com/
http://kunststube.net/encoding/

How to enable Search Widget/Gadget

If the the Search widget or gadget of your Blogger blog is not working but the embedded search box on the Navigation bar at the top is working, the cause could be the setting of robots.txt.

  1. View the robots.txt at http://YOURBLOGURL.blogspot.com/robots.txt. If Disallow property is set to /search, search is ignored.
  2. Go to Blogger Dashboard > Select Blog > Select Settings tab > Search Preferences
  3. Enable Custom robots.txt
  4. Copy the content of current robots.txt but set the Disallow property to blank.
  5. Save changes

Sunday, September 8, 2013

Derby

http://db.apache.org/derby/docs/10.9/devguide/cdevdvlp17453.html

Steps involved during Execution of a Java Program

  1. JVM startup
  2. Loading – finding binary representation of class/interface then constructing the Class object
  3. Linking – combining class/interface into the run-time state of the JVM so that it can be executed
    1. Verification - semantic/structure validation
    2. Preparation - storage allocation, all static fields are created and initialized with default values
    3. Resolution – optionally resolve symbolic reference to other classes/interfaces
  4. Initialization – static initialization 
    1. superclass/superinterface static initialization
      • superclasses are initialized before subclasses
      • interface initialization does not initialize superinterfaces
      • only the class that declares static field is initialized, even though it might be referred to through the name of a subclass, a subinterface, or a class that implements an interface
    2. all static explicit field initializers and static initialization blocks are executed in textual order
  5. Instantiation - creation of object/class instance
    All the instance variables, including those declared in superclasses, are initialized to their default values first.
    1. start the constructor
    2. call explicit constructor this() if available
    3. call explicit/implicit super() unless class is Object – process recursively using same steps a. to e. 
    4. all non-static field initializers and non-static initialization blocks are executed in textual order
    5. execute the rest of the body of constructor
  6. Finalization – finalize() method is called before storage for object is reclaimed by GC
  7. Unloading – happens if its classloader is reclaimed by GC. Bootstrap loader may not be unloaded.
  8. Program Exit