EITM | Documentation - Data Collection

Collection Process

This section describes in detail how the items were collected. Different crawling and extraction approaches had to be used depending on the source. The section Descriptive Statistics shows which approach was applied to which source.

Process Overview

Process Steps

Kimono

For news websites that did not provide their own RSS feed, Kimono by kimonolabs.com was used to create a custom RSS feed. Kimono is a visual crawling solution that can extract reoccurring parts from websites and convert them into several machine-readable formats. The parts that needed to be extracted were defined using the Kimono Chrome extension and then the websites’ item headlines and links were converted into RSS format once a day. Unfortunately, the Kimono cloud service was closed during the crawling period, so Kimono Desktop was deployed to a Windows Virtual Machine and automated using a Python Script and the Kimono API.
Java Crawler

For the continuous crawling, a Java-based crawler was developed using ROME. ROME was used to read the RSS feeds and extract the article links. The links were then crawled using FiveFilters (see below) and the article plain text was created using JSOUP and Unbescape (see XML/HTML to Text Cleanup).
Splash

The source pi-news.net used JavaScript and XHR to fetch the news item from their server, instead of providing it with the first request. This page had to be rendered using Splash, so the source code was complete when running the content extraction.
Fivefilters

To extract the plain text from the news websites, it was necessary to remove elements from the HTML source that did not belong to the item content, i.e. design elements, navigational elements, advertising and so on. Fivefilters Full-Text, a heuristic full-text extraction service was used to extract the item content into an RSS format. Fivefilters can retrieve and combine multipage articles and provide a clean XML result comprising all elements of the item using advanced heuristics. If the extraction failed for a particular source, for example for tagesschau.de, the extraction was configured manually by specifying XPath / CSS selectors.
Scrapy

For some sources, it was necessary to develop special crawlers. The first problem was that foraus.ch changed their website to an Angular.JS based front-end that could not be crawled using the Java-based crawler. redstate.com used Cloudflare's DDOS protection that requires CAPTCHAS to be solved before the article content is displayed.

The Scrapy Framework for Python was used in these cases to develop special crawlers. In the case of foraus.ch, the articles were extracted directly from the backend that gets called by the Angular.JS front-end app. For redstate.com the captchas were solved manually and the cookie was then used with an Nginx proxy and a Python script to extract as many articles as possible as long as the cookie was valid.
Websearch

washingtonpost.com that did not get crawled as several RSS feeds changed during the crawling period, which was discovered late. Therefore, the item metadata, i.e. date and headline, were extracted using Scrapy from pqarchiver.com. After that a Google and BING search with the metadata was used to find the URLs of the original items on washingtonpost.com. The list that was generated by this procedure was then crawled using another Python/Scrapy script.
Python Transformation

Some sources provided item collections in various structured data formats (XML, XLSX, and HTML). Those formats were parsed and transformed to integrate them into the target database using Python Scripts. A JSON-based format was used as an intermediate step before loading those articles into the database.

XML/HTML to Text Cleanup

For the extraction of the plain text from the XML/HTML source, the JAVA packages JSOUP and Unbescape were used. Because Fivefilters converted HTML into XML, the first step was to reverse the XML escaping by using the HTMLEscape.unescape function. After that the resulting HTML code was parsed using Jsoup, and the Jsoup.cleanup function was applied. Finally remaining HTML entities were removed by applying the HTMLEscape.unescape function a second time.

For the preservation of newlines and a better readability of the item, newlines were added after headlines, i.e. <p>, <h1>, <h2>, <h3>, <h4>, <h5> and <h6> tags and after each line break in the item, i.e. a
tag. This way line breaks that are naturally displayed by the browser, i.e. after a headline or a paragraph, will also be included in the plain text version of the article generated by this step.

The following is a listing of the code for the cleanup function used: