Model / Topic Evaluation

This section explains the selection of relevant topics for the relevance scoring.

Topic Selection

All of the topics mentioned by the experts surveyed in each country were combined into a list of keywords and pre-processed by the same text pre-processing as the documents used to train the topic model.

Cumulative Match Probability (CMP) Ranking

The output of the LdaModel is a set of topics consisting of a probability distribution for words within that topic. Thus, a topic is a set of words, each with a certain probability of occurring in that topic. Given the list of relevant keywords generated by the experts, this set of words could be divided into a set of relevant words, i.e. words that appear on the expert list and a set of non-relevant words. Summing up the probability in the relevant set of words then gives a cumulative match probability (CMP), i.e. the sum of probabilities for words appearing on the relevant keyword list. This assigned a CMP to each topic in the topic model.

The topics were then ranked by CMP and the top 20% of topics in each topic model were selected for further review. In the review process, human coders decided on the relevance of a topic by determining its significance for debates on the public role of religion in societal life, i.e. the topical domain of the study.

Human Validation of Topic Relevance

Three human coders were instructed to code the relevance for each topic as well as the certainty of their judgment. The topic selection protocol defines the criteria for this task.

Afterwards, reliability scores for each of the following three coding schemes were calculated:

  • liberal scheme: treat the topic as relevant even if the coder was uncertain whether a topic is relevant or not
  • tendency: ignoring the uncertainty, use the coders’ tendency to assign a topic as relevant or not at face value
  • conservative: treat the topic as irrelevant if the coder was uncertain whether a topic is relevant or not

The highest intercoder reliability was achieved when using the liberal coding scheme. The results of the reliability analysis are documented in the Reproduction Stata Script Topic Selection Human Coding.

On this basis, the liberally coded topics were selected for the subsequent thematic classification of documents. In this way the topic model was able to augment the list of relevant keywords with keywords co-occurring in the documents with those generated by the expert input while at the same time human coders were able to filter out non-sensical topics generated by the CMP ranking.