Testing out Java's Unicode support, UTF-16LE the choice for now
So, in my efforts to create a tool to export my Tweets from Twitter for safe-keeping, and learning a bit of Java in the process, I today created a Java snippet of code to write an XHTML file, to get acquainted with Java and its Unicode support.I've worked with Unicode in Python and in C, and a while ago I started some discussions on comp.lang.c, as I was testing out C to write an XML parser. That project ran out of steam, but I still think today that UTF-32 is the least discriminatory (and uncomplicated) approach to sharing information like web pages.
Anyway, here's the Java code for writing a test XHTML page, fully validating on the W3C validator:
https://github.com/morphex/twitter-exporter/blob/71a4cdf1a6f...
Well, it was validating on the W3C validator, but once I added code to write a BOM, it no longer validated. 🤔 But shows up fine in Firefox, and "file test.html" says:
test.html: XHTML document text (version 1.0), Little-endian UTF-16 Unicode text, with no line terminators
So I'm not sure what's going on there.
As I was testing out Java, I was hoping and optimistic for having a Unicode solution that just worked, and even more so when OutputStreamWriter accepted an encoding argument.
But as you can see on the history of WriteXHTML.java, I had to revert to using just FileOutputStream, and encoding strings and writing them as bytes to the test.html file.
So much for that dream. Well, this is all alright, I find Java a bit verbose, but it also reminds me a bit about C and that's nice.
A thing that surprised me though, was that Firefox did not support UTF-32LE encoded XHTML. Why not?
[Permalink] [By morphex] [A Java-based tool to export tweets from Twitter for safe keeping (Atom feed)] [04 Apr 00:06 Europe/Oslo]