MSc Thesis Defense by Marco Johns – Københavns Universitet

MSc Thesis Defense by Marco Johns

Title

Investigating Skip-grams for Language Model Ranking in Information Retrieval

Abstract

Language model ranking is widely used in Information Retrieval facilitating single words as terms, i.e. unigrams. Unigram based language models are also known as the bag of words language model and assumes that words appear independently from each other in a document. Language models that do not share this assumption may use n-grams, that take into account n-􀀀1 surrounding words. However, n-grams do not perform nearly as well in ranking. This work investigates the use of k-skip-n-grams, or skip-grams, which allow to skip k words in order to widen the context and possibly improve ranking performance. For this purpose query likelihood ranking with selected smoothing methods, as well as TF-IDF and BM25 for comparison, are applied using skip-grams with different configurations of n and k. An empirical evaluation, focusing on common Information Retrieval metrics (e.g. MAP, MRR, P@10, P@100, nDCG, bpref), was conducted using the CACM dataset and a proof of concept implementation. The results show that for a given length n the performance increases for larger k. In comparison to the baseline of n-grams without skips (k = 0), the Mean Average Precision (MAP) improved by at least +0:01 and the Precision at rank 10 (P@10) by at least +0:03 across all ranking methods used, only by introducing a single skip (k = 1). Due to higher requirements of skip-grams regarding space and computation time, the use of skip-grams is advised where space and computation time are not critical and even little improvements in ranking performance are relevant.

Supervisors

Jakob Grue Simonsen and Christina Lioma, DIKU