Model Selection and Validation

This section explains the parameter optimization for the LdaModel.

Model Selection based on Topic Coherence

The LdaModel was generated with the number of topic parameter set to 100, 500 and 1000. Therefore, the algorithm produced three models for each country that had to be evaluated and from which the best model had to be selected. To do so, human coders evaluated the top 20% of topics (by CMP) for the coherence of words in the topic. The human coders ranked each topic on a 3-point scale. The results were then averaged and an average topic coherence for each model was calculated. Finally, the topic model with the highest average coherence was selected.

Mean topic coherence for each candidate model in each country
Figure C2.1 - The mean topic coherence for each candidate model in each of the countries.

Figure C2.1 shows the average topic coherence scores across the model candidates and by country. For all countries, the topic models with 100 topics had the highest coherence. The coherence was lower for the models with 500 topics and lowest for the models with 1000 topics. Thus, the topic models with more topics are harder to interpret than topic models with less topics.


Furthermore, we looked at the overlap between the topics generated by the topic model and selected by the human coders one the one hand and the topics named by the experts on the other. For this task we used the measures BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) to judge the coverage of the expert topics within the trained models. The measures can be interpreted as precision (BLEU) and recall (ROUGE) of the topic models in comparison to the expert-generated topics. BLEU was originally developed to determine the quality of computer-generated translations when compared with a reference translation. ROUGE is a measure created to judge the coverage of text summaries in comparison with the respective original text.

In our case, we decided to use ROUGE-1 to calculate the coverage of the expert topics by the topic models as we did not include n-grams in the training of the topic models. Additionally, we used BLEU to balance out the effect that models with more topics having a larger corpus of words and therefore automatically exhibit larger overlap. This means that a good topic model needs to have a high recall of expert topics while having a high precision to not include a lot of noise in the topics.

For calculating the ROUGE-1 and BLEU scores of the topics in each of the topic models, we used the following definitions:

(1) modelKeywords = ∩i=1|ldaTopics| modelKeywords(ldaTopici)
(2) recall(topicx) = rouge-1(topicx) = expertKeywords(expertTopicx) ∩ modelKeywords / expertKeywords(expertTopicx)
(3) precision(topicx) = bleu(topicx) = expertKeywords(expertTopicx) ∩ modelKeywords / modelKeywords
(4) expertKeywords = ∩i=1|expertTopics| expertKeywords(expertTopicx)
(5) overallRecall = overallRouge-1 = expertKeywords ∩ modelKeywords / expertKeywords
(6) overallPrecision = overallBleu = expertKeywords ∩ modelKeywords / modelKeywords

For the set of keywords in an ldaTopic, we decided to use the top 20 keywords for the topic. We then collected a set of keywords that contained the top 20 for all of the relevant topics (see definition 1 above). For all of the expert topics, we then compared those keywords to the keywords given by the experts and calculated the coverage of expert keywords by the topic model as well as the precision of the keywords returned by the topic, i.e. the proportion of the 20 keywords that are found in the keywords given by the experts (2) and (3). For the overall recall and precision of the topics models, we collected the keywords for all of the topics in a country (4) similar to (1) and then calculated the proportions of keywords found in the expert topics (5) and (6).

ROUGE-1 and BLEU scores for each topic model candiate in each country
Figure C2.2 - ROUGE-1 and BLEU scores for each of the topic model candidates in each of the countries

Figure C2.2 shows the ROUGE-1 and BLEU scores for each of the topic model candidates in each of the countries. While ROUGE-1 generally increases for topic models with a larger number of topics learned, BLEU is decreasing with the number of topics in the topic model.

Thus there is a trade-off between the recall of the expert topics on the one hand and the precision of the topic model and the coherence of the topics in the model on the other hand (see section Topic Coherence and Figure C2.1).

Our goal was to maximize the recall of the expert topics and also the topic coherence. Therefore, we selected the topic model candidate with 500 topics in each of the countries as these represent a good balance between each of the measures.



Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics (pp. 311–318). Philadelphia, Pennsylvania, USA: Association for Computational Linguistics.

Lin, C.-Y. (2004). ROUGE: A package for automatic evaluation of summaries. In S. S. Marie-Francine Moens (Ed.), Text Summarization Branches Out: Proceedings of the ACL-04 Workshop (pp. 74–81). Barcelona, Spain: Association for Computational Linguistics.