Byte order mark - Java Encoding hazards of unicode in windows

Found a peculiar situation which stranded us for quite some time.

There is a HashMap ht = {ShareStatus=True} and an ArrayList att = [ShareStatus]

When I try, ht.get(String(att.get(0))); it returns a null.

Upon close inspection, I found that the key of the Hashmap is different in ISO-8859-1 encoding and looked like this :: {ï»¿ShareStatus=True}

This is because of a zero-width no-break character which is used in byte-order-marking and is used at the start of certain files.

We generally use Java's String.trim() to remove extreme whitespace characters from a text but that seems to be a moot point as the handling of whitespaces is not effective to handle the different types of spaces/breaks available in unicode encoding standard.

We might have to come up with an API to handle this or rely of available API's like apache-commons-lang

For more details on white-spaces refer the links in the mail and the ones listed below:

whitespace? what's that? - https://spreadsheets.google.com/pub?key=pd8dAQyHbdewRsnE5x5GzKQ
Unicode Look up - http://unicodelookup.com/#ShareStatus/1
zero-width no-break character's properties - http://www.fileformat.info/info/unicode/char/feff/index.htm
Unicode spaces - http://www.cs.tut.fi/~jkorpela/chars/spaces.html
Byte-order-mark wiki link - https://en.wikipedia.org/wiki/Byte_order_mark
Strange working of java's String.trim() - http://closingbraces.net/2008/11/11/javastringtrim/

Note:

1. This issue is almost always present in windows environment and not in linux.

2. Please suggest the following solution for the customers using powershell for removing the no-break BOM character in the output file itself.

http://stackoverflow.com/questions/5596982/using-powershell-to-write-a-file-in-utf-8-without-the-bom -- powershell soln

A Different Land

Search This Blog

Byte order mark - Java Encoding hazards of unicode in windows

Comments

Post a Comment

Popular posts from this blog

Recently executed queries in SQL Server

Mundasupatti - a fun movie

Apache Nutch-Solr Integration