Morphex's blogologue (Life, technology, music, politics, business, mental health and more)

This is the blog of Morten W. Petersen, aka. morphex in various places. I blog about my life, and what I find interesting and/or important. This is a personal blog without any editor or a lot of oversight so treat it as such. :)

My email is morphex@gmail.com.

An OGG/Vorbis player, implemented in Javascript.

My Kiva bragging page
My shared (open source) code on GitHub

Morphex's Blogodex

News
Slashdot

Zope hosting by Nidelven IT

Morten Petersen on Linkedin

Morten Petersen on Facebook

Morten Petersen on SoundCloud

Morten Petersen on MixCloud

Blogologue on Twitter



Older entries



Atom - Subscribe - Categories

Facebook icon Share on Facebook Google+ icon Share on Google+ Twitter icon Share on Twitter LinkedIn icon Share on LinkedIn

Testing out Java's Unicode support, UTF-16LE the choice for now

So, in my efforts to create a tool to export my Tweets from Twitter for safe-keeping, and learning a bit of Java in the process, I today created a Java snippet of code to write an XHTML file, to get acquainted with Java and its Unicode support.

I've worked with Unicode in Python and in C, and a while ago I started some discussions on comp.lang.c, as I was testing out C to write an XML parser. That project ran out of steam, but I still think today that UTF-32 is the least discriminatory (and uncomplicated) approach to sharing information like web pages.

Anyway, here's the Java code for writing a test XHTML page, fully validating on the W3C validator:

https://github.com/morphex/twitter-exporter/blob/71a4cdf1a6f...

Well, it was validating on the W3C validator, but once I added code to write a BOM, it no longer validated. đŸ¤” But shows up fine in Firefox, and "file test.html" says:

test.html: XHTML document text (version 1.0), Little-endian UTF-16 Unicode text, with no line terminators

So I'm not sure what's going on there.

As I was testing out Java, I was hoping and optimistic for having a Unicode solution that just worked, and even more so when OutputStreamWriter accepted an encoding argument.

But as you can see on the history of WriteXHTML.java, I had to revert to using just FileOutputStream, and encoding strings and writing them as bytes to the test.html file.

So much for that dream. Well, this is all alright, I find Java a bit verbose, but it also reminds me a bit about C and that's nice.

A thing that surprised me though, was that Firefox did not support UTF-32LE encoded XHTML. Why not?

[Permalink] [By morphex] [A Java-based tool to export tweets from Twitter for safe keeping (Atom feed)] [04 Apr 00:06 Europe/Oslo]