Morphex's blogologue (Life, technology, music, politics, business, mental health and more)

This is the blog of Morten W. Petersen, aka. morphex in various places. I blog about my life, and what I find interesting and/or important. This is a personal blog without any editor or a lot of oversight so treat it as such. :)

My email is morphex@gmail.com.

I am leavingnorway.info.

An OGG/Vorbis player, implemented in Javascript.

My Kiva bragging page
My shared (open source) code on GitHub

Morphex's Blogodex

News
Me on Instagram
Slashdot

Zope hosting by Nidelven IT

Morten Petersen on Linkedin

Morten Petersen on Facebook

Morten Petersen on SoundCloud

Morten Petersen on MixCloud

Blogologue on Twitter



Older entries



Atom - Subscribe - Categories

Facebook icon Share on Facebook Google+ icon Share on Google+ Twitter icon Share on Twitter LinkedIn icon Share on LinkedIn

Now making (an almost) complete Twitter archive using Java, validating as UTF-16LE XHTML

So.

I was figuring out the latest bits to my Twitter archival tool today, and once I made a complete dump of all Tweets, I saw that I had forgotten to HTML-encode URLs, so & was breaking the validation.

Boy, was my face red for a split second. Anyway, the nice org.apache.commons folks have created a tool to encode URLs for HTML, so that was an easy fix.

However, the W3C validator

https://validator.w3.org/

Was still giving weird errors, and after some fiddling around and googling, I found the W3C I18N checker, which explained that XHTML couldn't be prefixed with an <?xml declaration. It's been a while since I worked properly with web documents I guess, so I'd forgotten about that.

So OK, I removed the declaration, and thought I'd add a meta http-equiv or charset tag to the document, to be declarative about the content of the document.

However, the I18n validator first complained about that, saying I couldn't use UTF-16LE as an encoding, and that I had to say UTF-16, but when that was fixed, the validator complained and said that I shouldn't specify the encoding in the document.

And that makes sense, because the document starts with a BOM, which indicates the charset and the endianness of the charset.

Anyway, I get a whiff of cultural imperialism when I see that W3C says UTF-8 is the recommended encoding, and the main validator is so unhelpful when giving feedback on what is wrong with the document. :\

So, here's the commit of today's work:

https://github.com/morphex/twitter-exporter/commit/f32b12f42...

A productive day, and

http://blogologue.com/test3.html.bin

Shows an almost complete archive of my Tweets. A grep shows that my archive is 3128 tweets:

iconv test3.html -f utf-16|grep "class='tweet"|wc -l
iconv: incomplete character or shift sequence at end of buffer
3128

while Twitter says I have well over 4000 tweets:

    Tweets 4,943
    Following 98
    Followers 189

So I guess Twitter has some segmentation of its archive, for unknown reasons.

I did get some feedback on my blogging about Java and type-casting, and using Java generics it is possible to avoid this tedious type-casting I was talking about earlier. I had seen the syntax for Java generics before and found it to a bit of an eyesore, but seeing how it saves quite a bit of tedious typing and that the syntax is straightforward, I guess I can't complain.

It's not Perl after all..

[Permalink] [By morphex] [A Java-based tool to export tweets from Twitter for safe keeping (Atom feed)] [23 Apr 16:52 Europe/Oslo]