Evaluation

This section describes the evaluation process for the thematic classification. First, all Australian, Swiss and Turkish blog posts were manually coded as thematically irrelevant or relevant. We used this human-coded gold standard to evaluate the performance of the classifier (EITM).

Receiver Operating Characteristic (ROC)

A commonly used graphical evaluation method for scoring classifiers is the receiver operating characteristic (ROC). The ROC can be used to evaluate the trade-off when setting the detection threshold of a radio receiver between the true positives and the false positives.

When drawing the ROC, the true positive rate (recall) is drawn on the y-axis and the false positive rate is drawn on the y-axis. Similar to the recall, the false positive rate is defined as the number of falsely as positive classified instances, versus the number of all negative instances in the dataset.

A perfect classifier would yield 100% true positives and 0% false positives, which would put a dot on the (0.1) corner of the ROC curve. A bad classifier, that would just randomly guess results, would yield points on the diagonal.

Precision-Recall Curve (PRC)

Another visual analysis method for the quality of a classifier is the precision-recall curve (PRC). It is a measure mainly used in the field of information retrieval. The precision can be interpreted as the fraction of correctly identified documents and the recall as the fraction of all documents of interest that can be identified by the information retrieval system.

The PRC shows the development of the precision and recall over all possible thresholds. Therefore, it plots the precision on the y-axis versus the recall on the x-axis of a graph, each point on the curve representing one threshold value.

Both methods of evaluation capture different aspects of a classifier which makes it necessary to consider both in the evaluation.

Materials

The data and python code used to generate the ROC and the PRC curves can be found below.

Baseline Comparison

For the baseline comparison, we decided to run a simple keyword-based search on our test corpus. For that, we used the keywords provided by the experts, removed too broad keywords that would match too large proportions of the corpus, and then selected all documents containing the remaining keywords.

Materials

The used keyword lists and the code for the document selection can be found below.