Morphex's blogologue (Life, technology, music, politics, business, mental health and more)

Share on Facebook Share on Google+ Share on Twitter Share on LinkedIn

Testing out Java's Unicode support, UTF-16LE the choice for now

So, in my efforts to create a tool to export my Tweets from Twitter for safe-keeping, and learning a bit of Java in the process, I today created a Java snippet of code to write an XHTML file, to get acquainted with Java and its Unicode support.

I've worked with Unicode in Python and in C, and a while ago I started some discussions on comp.lang.c, as I was testing out C to write an XML parser. That project ran out of steam, but I still think today that UTF-32 is the least discriminatory (and uncomplicated) approach to sharing information like web pages.

Anyway, here's the Java code for writing a test XHTML page, fully validating on the W3C validator:

https://github.com/morphex/twitter-exporter/blob/71a4cdf1a6f...

Well, it was validating on the W3C validator, but once I added code to write a BOM, it no longer validated. 🤔 But shows up fine in Firefox, and "file test.html" says:

test.html: XHTML document text (version 1.0), Little-endian UTF-16 Unicode text, with no line terminators

So I'm not sure what's going on there.

As I was testing out Java, I was hoping and optimistic for having a Unicode solution that just worked, and even more so when OutputStreamWriter accepted an encoding argument.

But as you can see on the history of WriteXHTML.java, I had to revert to using just FileOutputStream, and encoding strings and writing them as bytes to the test.html file.

So much for that dream. Well, this is all alright, I find Java a bit verbose, but it also reminds me a bit about C and that's nice.

A thing that surprised me though, was that Firefox did not support UTF-32LE encoded XHTML. Why not?

[Permalink] [By morphex] [A Java-based tool to export tweets from Twitter for safe keeping (Atom feed)] [04 Apr 00:06 Europe/Oslo]

Morphex's blogologue (Life, technology, music, politics, business, mental health and more)

Morphex's Blogodex

Older entries

Testing out Java's Unicode support, UTF-16LE the choice for now