Morphex's blogologue (Life, technology, music, politics, business, mental health and more)

Share on Facebook Share on Google+ Share on Twitter Share on LinkedIn

Now making (an almost) complete Twitter archive using Java, validating as UTF-16LE XHTML

So.

I was figuring out the latest bits to my Twitter archival tool today, and once I made a complete dump of all Tweets, I saw that I had forgotten to HTML-encode URLs, so & was breaking the validation.

Boy, was my face red for a split second. Anyway, the nice org.apache.commons folks have created a tool to encode URLs for HTML, so that was an easy fix.

However, the W3C validator

https://validator.w3.org/

Was still giving weird errors, and after some fiddling around and googling, I found the W3C I18N checker, which explained that XHTML couldn't be prefixed with an <?xml declaration. It's been a while since I worked properly with web documents I guess, so I'd forgotten about that.

So OK, I removed the declaration, and thought I'd add a meta http-equiv or charset tag to the document, to be declarative about the content of the document.

However, the I18n validator first complained about that, saying I couldn't use UTF-16LE as an encoding, and that I had to say UTF-16, but when that was fixed, the validator complained and said that I shouldn't specify the encoding in the document.

And that makes sense, because the document starts with a BOM, which indicates the charset and the endianness of the charset.

Anyway, I get a whiff of cultural imperialism when I see that W3C says UTF-8 is the recommended encoding, and the main validator is so unhelpful when giving feedback on what is wrong with the document. :\

So, here's the commit of today's work:

https://github.com/morphex/twitter-exporter/commit/f32b12f42...

A productive day, and

http://blogologue.com/test3.html.bin

Shows an almost complete archive of my Tweets. A grep shows that my archive is 3128 tweets:

iconv test3.html -f utf-16|grep "class='tweet"|wc -l
iconv: incomplete character or shift sequence at end of buffer
3128

while Twitter says I have well over 4000 tweets:

    Tweets 4,943
    Following 98
    Followers 189

So I guess Twitter has some segmentation of its archive, for unknown reasons.

I did get some feedback on my blogging about Java and type-casting, and using Java generics it is possible to avoid this tedious type-casting I was talking about earlier. I had seen the syntax for Java generics before and found it to a bit of an eyesore, but seeing how it saves quite a bit of tedious typing and that the syntax is straightforward, I guess I can't complain.

It's not Perl after all..

[Permalink] [By morphex] [A Java-based tool to export tweets from Twitter for safe keeping (Atom feed)] [23 Apr 16:52 Europe/Oslo]

Morphex's blogologue (Life, technology, music, politics, business, mental health and more)

Morphex's Blogodex

Older entries

Now making (an almost) complete Twitter archive using Java, validating as UTF-16LE XHTML