This section describes the topic modeling pipeline used to score the relevance of the news items. First, a graphical overview with a brief description of each step is given. After that a detailed account of the pre-processing decisions is given in section Text Pre-Processing.
The first step in the topic modeling pipeline was to pre-process the news items for use with the LDA algorithm. After the pre-processing, every item was represented as a word vector representing the significant words in that document. The details are described in section Text Pre-Processing.
The second step involved combining the created document word vectors into a document corpus. We implemented the topic model pipeline using gensim by Radim Rehurek. The documents were therefore converted into a matrix-market corpus, as expected by the next step.
The next step is the training of the topic model. For the topic model, we used gensim’s LdaModel implementation. For each language and country topic models with 100, 500 and 1000 topics were created. One of the three models was selected in the next step. All topic models were trained with three passes.
Human coders evaluated the topic models. The topics in each topic model were ranked by cumulating the probability of the expert-generated keywords in the topic.
This gives a cumulative match probability (CMP) for each topic. The topics were then ordered by CMP from high to low and the top 20% of topics in each topic model were selected. These topics were then cross-checked by human coders for relevance.
Additionally, the human coders scored the coherence of words in each topic. This was then used to calculate a quality score and select the model with the optimal number of topics (100, 500 and 1000 topics). According to the performance criteria, the models with 500 topics performed best in each case. For a further description see section Model Evaluation and Topic Selection.
The selected topic model was used to calculate the probability for each document that it contains one of the selected relevant topics. This was done by cumulating each probability for each relevant topic into a relevance score. The relevance scores for all documents were then used in the Material Identification and Sampling step.