Skip to main content

Byte order mark - Java Encoding hazards of unicode in windows

Found a peculiar situation which stranded us for quite some time.

There is a HashMap ht = {ShareStatus=True} and an ArrayList att = [ShareStatus]

When I try, ht.get(String(att.get(0)));  it returns a null.
 
Upon close inspection, I found that the key of the Hashmap is different in ISO-8859-1 encoding and looked like this :: {ShareStatus=True}

This is because of a zero-width no-break character which is used in byte-order-marking and is used at the start of certain files.

We generally use Java's String.trim() to remove extreme whitespace characters from a text but that seems to be a moot point as the handling of whitespaces is not effective to handle the different types of spaces/breaks available in unicode encoding standard.

We might have to come up with an API to handle this or rely of available API's like apache-commons-lang

For more details on white-spaces refer the links in the mail and the ones listed below:


Note:
1. This issue is almost always present in windows environment and not in linux.  
2. Please suggest the following solution for the customers using powershell for removing the no-break BOM character in the output file itself.

Comments

Popular posts from this blog

Recently executed queries in SQL Server

To find out what all queries were executed recently in a sql server database, use the following queries Specific database: SELECT deqs.last_execution_time AS [Time], dest.text AS [Query], dest.* FROM sys.dm_exec_query_stats AS deqs CROSS APPLY sys.dm_exec_sql_text(deqs.sql_handle) AS dest WHERE dest.dbid = DB_ID(' msdb ') ORDER BY deqs.last_execution_time DESC   All Databases: SELECT deqs.last_execution_time AS [Time], dest.text AS [Query] FROM sys.dm_exec_query_stats AS deqs CROSS APPLY sys.dm_exec_sql_text(deqs.sql_handle) AS dest ORDER BY deqs.last_execution_time DESC  

Mundasupatti - a fun movie

I saw a movie the other day called Mundasupatti, I’m sure you’ve heard of it if you’re in Tamil Nadu. It was funny, they took a simple superstition which is around and made a fun movie out of it. It was a period film, exploring the 70’s of rural India. I was instantly reminded of swades, a great hindi film which again explores the development gap between cities and villages, it goes a bit further to compare development as seen by an Indian NASA engineer and his old house-hold nanny’s village.  While Swades was a serious film about self empowerment and braking society’s rules about casteism and encouraging education, Mundasupatti is just a funny movie about how stupid, people are.  The movie revolves around a village after which the film is named, the people in the village believe that taking a photograph causes people to get sick and die. The movie did a faithful representation of the rural India, with its proud people and crazy traditions which make no sense. People...

Apache Nutch-Solr Integration

Apache Nutch-Solr Integration   As of this writing, I am using Solr 4.8.0 and Nutch 1.8.0 binaries for the integration. Will catch up to later versions as and when my project requires. We can cover installation and operation of Solr and Nutch separately and then talk about the integration. the version of Nutch that I am using is very closely built with Solr and the integration is very simple. For simplicity sake, Ill stick to Linux environment for both as Nutch does not operate in windows natively. Getting Solr to work 1. All you need to for Solr to work are the binaries. you can get them from their  official page  (version 4.8.0) 2. Extract the  solr-4.8.0.zip  in some location. for this tutorial, lets assume that its in  /home/test/Research/solr 4.8.0/ 3. Open terminal, navigate to  /home/test/Research/solr-4.8.0/example/  and execute the following command to start solr server java -jar start.jar  4...