Now making (an almost) complete Twitter archive using Java, validating as UTF-16LE XHTML
So.I was figuring out the latest bits to my Twitter archival tool today, and once I made a complete dump of all Tweets, I saw that I had forgotten to HTML-encode URLs, so & was breaking the validation.
Boy, was my face red for a split second. Anyway, the nice org.apache.commons folks have created a tool to encode URLs for HTML, so that was an easy fix.
However, the W3C validator
https://validator.w3.org/
Was still giving weird errors, and after some fiddling around and googling, I found the W3C I18N checker, which explained that XHTML couldn't be prefixed with an <?xml declaration. It's been a while since I worked properly with web documents I guess, so I'd forgotten about that.
So OK, I removed the declaration, and thought I'd add a meta http-equiv or charset tag to the document, to be declarative about the content of the document.
However, the I18n validator first complained about that, saying I couldn't use UTF-16LE as an encoding, and that I had to say UTF-16, but when that was fixed, the validator complained and said that I shouldn't specify the encoding in the document.
And that makes sense, because the document starts with a BOM, which indicates the charset and the endianness of the charset.
Anyway, I get a whiff of cultural imperialism when I see that W3C says UTF-8 is the recommended encoding, and the main validator is so unhelpful when giving feedback on what is wrong with the document. :\
So, here's the commit of today's work:
https://github.com/morphex/twitter-exporter/commit/f32b12f42...
A productive day, and
http://blogologue.com/test3.html.bin
Shows an almost complete archive of my Tweets. A grep shows that my archive is 3128 tweets:
iconv test3.html -f utf-16|grep "class='tweet"|wc -l
iconv: incomplete character or shift sequence at end of buffer
3128
while Twitter says I have well over 4000 tweets:
Tweets 4,943
Following 98
Followers 189
So I guess Twitter has some segmentation of its archive, for unknown reasons.
I did get some feedback on my blogging about Java and type-casting, and using Java generics it is possible to avoid this tedious type-casting I was talking about earlier. I had seen the syntax for Java generics before and found it to a bit of an eyesore, but seeing how it saves quite a bit of tedious typing and that the syntax is straightforward, I guess I can't complain.
It's not Perl after all..
[Permalink] [By morphex] [A Java-based tool to export tweets from Twitter for safe keeping (Atom feed)] [23 Apr 16:52 Europe/Oslo]