Morphex's blogologue (Life, technology, music, politics, business, mental health and more)

Share on Facebook Share on Google+ Share on Twitter Share on LinkedIn

Now making (an almost) complete Twitter archive using Java, validating as UTF-16LE XHTML

So.

I was figuring out the latest bits to my Twitter archival tool today, and once I made a complete dump of all Tweets, I saw that I had forgotten to HTML-encode URLs, so & was breaking the validation.

Boy, was my face red for a split second. Anyway, the nice org.apache.commons folks have created a tool to encode URLs for HTML, so that was an easy fix.

However, the W3C validator

https://validator.w3.org/

Was still giving weird errors, and after some fiddling around and googling, I found the W3C I18N checker, which explained that XHTML couldn't be prefixed with an <?xml declaration. It's been a while since I worked properly with web documents I guess, so I'd forgotten about that.

So OK, I removed the declaration, and thought I'd add a meta http-equiv or charset tag to the document, to be declarative about the content of the document.

However, the I18n validator first complained about that, saying I couldn't use UTF-16LE as an encoding, and that I had to say UTF-16, but when that was fixed, the validator complained and said that I shouldn't specify the encoding in the document.

And that makes sense, because the document starts with a BOM, which indicates the charset and the endianness of the charset.

Anyway, I get a whiff of cultural imperialism when I see that W3C says UTF-8 is the recommended encoding, and the main validator is so unhelpful when giving feedback on what is wrong with the document. :\

So, here's the commit of today's work:

https://github.com/morphex/twitter-exporter/commit/f32b12f42...

A productive day, and

http://blogologue.com/test3.html.bin

Shows an almost complete archive of my Tweets. A grep shows that my archive is 3128 tweets:

iconv test3.html -f utf-16|grep "class='tweet"|wc -l
iconv: incomplete character or shift sequence at end of buffer
3128

while Twitter says I have well over 4000 tweets:

    Tweets 4,943
    Following 98
    Followers 189

So I guess Twitter has some segmentation of its archive, for unknown reasons.

I did get some feedback on my blogging about Java and type-casting, and using Java generics it is possible to avoid this tedious type-casting I was talking about earlier. I had seen the syntax for Java generics before and found it to a bit of an eyesore, but seeing how it saves quite a bit of tedious typing and that the syntax is straightforward, I guess I can't complain.

It's not Perl after all..

[Permalink] [By morphex] [A Java-based tool to export tweets from Twitter for safe keeping (Atom feed)] [23 Apr 16:52 Europe/Oslo]

Can I Java cup of coffee? Isn't that valid?

Yesterday I made some progress on the twitter-exporter tool, working and thinking quite a bit in one day.

I'm surprised I am as productive as I am in Java, since I haven't used it in work-related projects. Yesterday it felt like I had a couple of cups of coffee too many, maybe it's the spring, but I was cranking out code and dealing with Exceptions like I'd been eating a lot of cake and drinking a lot of coffee.

I'm particularly pleased with this piece of code:

class TwitterStatusFetcher {
        Twitter twitter;
        ArrayList statuses = new ArrayList();
        Integer page = 1;
        Integer count = 20;

        TwitterStatusFetcher(Twitter twitter_) {
                twitter = twitter_;
        }

        Status getNextStatus() throws TwitterException {
                if (statuses.isEmpty()) {
                        // See if more tweets can be found
                        ResponseList results = twitter.getUserTimeline(
                                new Paging(page, count));
                        page++;
                        statuses.addAll(results);
                }
                if (statuses.isEmpty()) {
                        return null;
                }
                Status status = (Status) statuses.get(0);
                statuses.remove(0);
                return status;
        }
}

Which works with the twitter4j code and Twitter API to deliver one status at a time to the rest of my app code. Maybe it's not that big of a deal, but I found it elegant, and it does use the object-oriented paradigm that Java is big on.

It seems the most frequent mistake I make coming from Python, is forgetting the semicolon; I like to indent code regardless of language, so it's the semicolon that I forget.

I guess one thing I've found annoying about Java so far, is the type casting that is necessary at different places. I was working on some code which I guess was eventually removed, where I knew the method to call, but because of some abstraction in a method call, the returned objects were java.lang.Object, and calling a method, a String method I think, on that object failed in compilation.

I was also almost smacking my head yesterday, when I discovered that twitter4j returns shortened (and expanded) URLs in their Tweets as well, and thought I'd developed quite a bit of code for no good reason.

However, as it turns out, some tweets do contain shortened URLs that aren't mentioned in the Tweet metadata, so the class I wrote to resolve t.co URLs still has some use. Phew.

Finally, I found Firefox now displayed the test.html output page as XML, instead of XHTML. So I made the code output a Unicode BOM, and Firefox says it is in standards compliance mode, and Chrome does not complain. However, the validator on w3c.org, https://validator.w3.org/ - still does not validate the page when it has a UTF-16LE BOM.

I'm not sure what's going on there, but to me it looks like the w3c validator isn't working.

A test.html page I just created is available at http://blogologue.com/test2.html.bin - just rename it to test2.html after download and open it in Firefox to see the rendered archive of my tweets.

[Permalink] [By morphex] [A Java-based tool to export tweets from Twitter for safe keeping (Atom feed)] [14 Apr 10:35 Europe/Oslo]

Doodeling along with Java development, "unit development" and more

Yeah, I've been developing a Java app to export all my tweets for safe-keeping.

I'm new to Java, so I asked a bit about limiting the amount of pieces of development software, so it was less complicated to see what the problem was when the code failed.

I asked on comp.lang.java.programmer, with the message ID <d25b03ed-607f-45d0-8b45-25f3835d44bb@googlegroups.com>. I did get some useful feedback, but I guess my idea of a simple bootstrapping development environment has something to it. Why not make something less complex if you can?

I worked on the code for resolving Twitter URLs today, and this is the ResolveRedirect.java file as of now:

https://github.com/morphex/twitter-exporter/blob/e6b9c55470f...

I have to say, that coming from the Python world, a lot is the same, although I have to say Python is less bureaucratic and easier on the eyes. Maybe in Java's advantage, things are well packaged and one is pretty much forced to think in an object-oriented way.

I can't say that developing in Java is worse than Python, I think they are even in many ways.

The code I've been develolping is fairly simple, the data it works on is also uncomplicated, with standard encodings, data formats and protocols.. so I'm not learning anything new there, and keeping things very simple on whatever I need to store as well, plain-text file format. If I was learning some new protocols and software libraries it could be a bit more difficult.

It's nice though, to see that I'm able to be very productive in a language I haven't used in a work-related context, feels like I'm firing on all cylinders again.

[Permalink] [By morphex] [A Java-based tool to export tweets from Twitter for safe keeping (Atom feed)] [09 Apr 13:33 Europe/Oslo]

Automating builds using Maven, writing XHTML & refactoring in Java

In my efforts to write a tool for exporting Tweets from Twitter for safe-keeping in Java, I came a bit closer to the goal to create such an archive today, when I managed to automate the build of the app a bit, and make the app create something that resembles the desired end-result.

Earlier I had to download jar files from projects that my project depended on, but with the following commit:

https://github.com/morphex/twitter-exporter/commit/ed8303638...

One jar is built, and it is run using the command

java -cp ./twitter-exporter-1.0-SNAPSHOT-jar-with-dependencies.jar org.morphex.app.App

I would like run the app in an even simpler way, but it is a big improvement from downloading jar files manually and including them on the "java -cp" command.

I was initially thinking I'd keep things as simple as possible and avoid unnecessary dependencies and systems, but seeing that there are collections of Java classes with useful features, features you'd expect to be in the Java core, it is great to be able to include these using some lines in the pom.xml Maven file.

Secondly I refactored the code a bit, so that the main App class was able to write an XHTML file containing a number of tweets:

https://github.com/morphex/twitter-exporter/commit/093545931...

You can download the generated output here:

http://blogologue.com/test.html.bin

I had to call the file test.html.bin to fool the application server into treating it as a binary file, as UTF16-LE is the encoding of the file and I guess the application server uses UTF-8. Renaming it to test.html and opening it in Firefox should work.

StringEscapeUtils.escapeXml10 is a nice tool, and something you'd assume would be so necessary and standard that it would be included with Java by default. Seeing this is the way it is, I guess there is really no way around using a tool like Maven to build projects.

[Permalink] [By morphex] [A Java-based tool to export tweets from Twitter for safe keeping (Atom feed)] [05 Apr 19:36 Europe/Oslo]

Testing out Java's Unicode support, UTF-16LE the choice for now

So, in my efforts to create a tool to export my Tweets from Twitter for safe-keeping, and learning a bit of Java in the process, I today created a Java snippet of code to write an XHTML file, to get acquainted with Java and its Unicode support.

I've worked with Unicode in Python and in C, and a while ago I started some discussions on comp.lang.c, as I was testing out C to write an XML parser. That project ran out of steam, but I still think today that UTF-32 is the least discriminatory (and uncomplicated) approach to sharing information like web pages.

Anyway, here's the Java code for writing a test XHTML page, fully validating on the W3C validator:

https://github.com/morphex/twitter-exporter/blob/71a4cdf1a6f...

Well, it was validating on the W3C validator, but once I added code to write a BOM, it no longer validated. 🤔 But shows up fine in Firefox, and "file test.html" says:

test.html: XHTML document text (version 1.0), Little-endian UTF-16 Unicode text, with no line terminators

So I'm not sure what's going on there.

As I was testing out Java, I was hoping and optimistic for having a Unicode solution that just worked, and even more so when OutputStreamWriter accepted an encoding argument.

But as you can see on the history of WriteXHTML.java, I had to revert to using just FileOutputStream, and encoding strings and writing them as bytes to the test.html file.

So much for that dream. Well, this is all alright, I find Java a bit verbose, but it also reminds me a bit about C and that's nice.

A thing that surprised me though, was that Firefox did not support UTF-32LE encoded XHTML. Why not?

[Permalink] [By morphex] [A Java-based tool to export tweets from Twitter for safe keeping (Atom feed)] [04 Apr 00:06 Europe/Oslo]

Resolving URLs from a URL shortening service in Java

So, in my efforts to create a tool to keep a (complete) archive of my Twitter activity, I today wrote a tool to help replace t.co URLs in my tweets, with the actual URL, here:

https://github.com/morphex/twitter-exporter/commit/3341f7780...

Now I come from the Python world, and lately I've been looking at Java to learn it properly. To improve my chances of having gigs that are interesting, regardless of programming language.

I started looking at URL objects and generating a connection from that, but since I'm going to resolve a lot of t.co addresses, found that it would be better if I kept an HTTPS connection open to the t.co server, and then pass the final part of the URL over to t.co. Another issue here is maybe a "resting period" between each resolved URL, so as to not appear "spammy" or "resource hogging" on the t.co server.

Anyway, luckily I found an example online that I could adapt for my purposes, and I'd have to say, compared to Python, Java is a bit bureacratic with its classes and types, but other than that, Java is just fine.

[Permalink] [By morphex] [A Java-based tool to export tweets from Twitter for safe keeping (Atom feed)] [01 Apr 12:05 Europe/Oslo]

A Java-based tool to export tweets from Twitter for safe keeping

So, I was looking into exporting all my tweets the other day, to keep a copy just in case something happened to my Twitter account.

I tried exporting from Twitter (using a desktop browser as they asked on the export page), but I have so far not seen a dump of all my tweets.

I see there are other tools available for exporting Tweets, websites, but I thought to myself, why not create a simple, safe tool for exporting and safe-keeping Tweets.

So, after a bit of rummaging I found twitter4j, and started to build a Twitter export client using a text editor, a Makefile and javac/java.

However, I quickly ran into a NoClassDefFoundError, and after googling a bit, I felt an obscure-configuration-detail-headache coming on, and decided I could try to use Maven, which is mentioned on the Twitter4j website for the build process.

The Maven quickstart guide instructed me to use this command to generate a Maven project:

mvn archetype:generate -DgroupId=org.morphex.app -DartifactId=twitter-exporter -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false

And after downloading some components etc. - Maven had a file hierarchy for me.

I then added the following section to pom.xml

    <dependency>
      <groupId>org.twitter4j</groupId>
      <artifactId>twitter4j-core</artifactId>
      <version>4.0.4</version>
    </dependency>

To include the twitter4j components in the Maven project. I then preceeded to run "mvn package" and then entered the target directory, and with the command

java -cp ./twitter-exporter-1.0-SNAPSHOT.jar org.morphex.app.App

I got the result

Hello World!

Which is what one can expect, as the project was just generated. Afterwards I created a new repository on GitHub, and uploaded the project (output of "history|grep git" in bash):

2172  git init
2173  git add pom.xml src
2175  git config --global user.email "morphex@gmail.com"
2176  git config --global user.name "Morten W. Petersen"
2177  git commit
2182  git remote add origin https://github.com/morphex/twitter-exporter.git
2183  git push -u origin master
2184  history|grep git

I started with this before I decided to blog about it, so I guess to setup the enviroment you have to run "sudo apt install maven openjdk-8-jdk openjdk-8-jre git" on Ubuntu/Debian.

The source code website for the project is here: https://github.com/morphex/twitter-exporter

[Update same day: I had to download the twitter4j jar and add it to the java command above: "java -cp ./twitter-exporter-1.0-SNAPSHOT.jar:./twitter4j-core-4.0.4.jar org.morphex.app.App"]

[Update later same day: twitter4j-core-4.0.4.jar can be downloaded from http://twitter4j.org/maven2/org/twitter4j/twitter4j-core/4.0...]

[And update even later same day: twitter4j-core-4.0.6.jar can be downloaded from https://repo1.maven.org/maven2/org/twitter4j/twitter4j-core/... - it's necessary for the latest code to run]

[Permalink] [By morphex] [Java (Atom feed)] [30 Mar 15:19 Europe/Oslo]