EITM | Documentation - Data Collection

Additional Notes

This section gives additional notes on the data set.

Collection Errors

For the source Derin Düsünce only the metadata for each news item were collected, i.e. the content field for those items is empty. The metadata contains the URL to the original items, that could be used to extract the full texts after the fact.

Variation in Item Output

The statistics for the daily item output for each source show some interesting variation in the output during the collection phase. This variation can be explained by a combination of extraction and saving errors during the collection process. When these errors were identified for a particular source, they are mentioned in the data set overview.

Another explanation is a seasonal trend in data output for the sources as even those data collections directly provided by the publishers, which should not contain any extraction and saving errors, show some variation over time.

Duplicate Items

The collection process used a URL-based item deduplication strategy, i.e. for a given source an item was only added to the collection if no item with the same URL existed in the collection. In some of the sources, especially from the USA, the same item was published with different URLs on different feeds, e.g. on the main feed and the politics feed.

Therefore, the collection was scanned for duplicates after the fact to ascertain the number of duplicates for each source in the dataset. Each document was pre-processed and turned into word-vectors as described in section Text Pre-Processing, with the additional creation of n-grams up to length 3. After that the resulting vectors were transformed into a TF-IDF model using the scikit package. Then each document was compared pairwise with all other documents for a given source using a linear kernel from scikit.

The threshold for flagging a document as a duplicate of another document was set to 75% similarity. For sources with less than 10,000 documents, the exact duplicate ratio was calculated. For sources with more than 10,000 documents, a random sample from the corpus was drawn. The sample was then split into 5 sub-samples of 2,000 items. For each sub-sample, the number of duplicates was calculated and then averaged over all 5 samples. The code for the duplicate analysis is attached below.

The results of this calculation are shown in the table in section Descriptive Statistics. The results show that very few sources have significant numbers of duplicate documents.

Language of the Sources

The language label of the sources was manually determined before the data collection and later cross-checked using the Python langdetect package. package. Each article with more than 50 words was analyzed with the langdetect function.

The analysis showed that most of the sources are mainly monolingual. The language label on the source level was based on the majority language in each source. However, two sources turned out to be multilingual, namely reimann-blog.ch, and forausblog.ch. From these sources items written in languages other than the project languages were removed from the data set.

Material

Python Script Document Similarity [PY]