apache spark

Four Fucking Years of Donald Trump

Nearly four years ago I decided to start collecting tweets to Donald Trump out of morbid curiosity. If I was a real archivist, I would have planned this out a little bit better, and started collecting on election night in 2016, or inaguration day 2017. I didn’t. Using twarc, I started collecting with the Filter (Streaming) API on May 7, 2017. That process failed, and I pivoted to using the Search API.

Enhancing Archives Unleashed Toolkit Usability with Spark-Submit

Originally posted here. Over the last month, we have put out several Toolkit releases. The primary focus of the releases has been firming up and improving spark-submit support. What does this mean? The short answer is that it makes the Toolkit easier to use. Think of the “Let’s move tools towards our users” graphic from my “Cloud-hosted web archive data: The winding path to web archive collections as data” post from a few weeks back.

Cloud-hosted web archive data: The winding path to web archive collections as data

Originally posted here. Web archives are hard to use, and while the past activities of Archives Unleashed has helped to lower these barriers, they’re still there. But what if we could process our collections, place the derivatives in a data repository, and then allow our users to directly work with them in a cloud-hosted notebook? 🤔 Last year around this time, the Archives Unleashed team was working on what can now be referred to as our first iteration of notebooks for web archive analysis.

twut. Wait, wut? twut?

Originally posted here. Introduction A few of the Archives Unleashed team members have a pretty in-depth background of working with Twitter data. Jimmy Lin spent some time at Twitter during an extended-sabbatical, Sam Fritz spent some time working with members of the Social Media Lab team previous to joining the Archives Unleashed Project, and Ian Milligan and I have done a fair bit of analysis and writing on our process of collecting and analyzing Canadian Federal Election tweets.