Detecting and Analyzing Text Reuse with BLAST

Vesanto, Aleksi

Detecting and Analyzing Text Reuse with BLAST

Vesanto, Aleksi

2019-01-15

Pro gradu -tutkielma

Tietojenkäsittelytiede

Vesanto_Aleksi_opinnayte.pdf

3.33 MB

avoin

Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.

Lataukset1071

Pysyvä osoite

https://urn.fi/URN:NBN:fi-fe201901313724

Tiivistelmä

In this thesis I expand upon my previous work on text reuse detection. I propose a novel method of detecting text reuse by leveraging BLAST (Basic Local Alignment Search Tool), an algorithm originally designed for aligning and comparing biomedical sequences, such as DNA and protein sequences. I explain the original BLAST algorithm in depth by going through it step-by-step. I also describe two other popular sequence alignment methods. I demonstrate the effectiveness of the BLAST text reuse detection method by comparing it against the previous state-of-the-art and show that the proposed method beats it by a large margin. I apply the method to a dataset of 3 million documents of scanned Finnish newspapers and journals, which have been turned into text using OCR (Optical Character Recognition) software. I categorize the results from the method into three categories: every day text reuse, long term reuse and viral news. I describe them and provide examples of them as well as propose a new, novel method of calculating a virality score for the clusters.

Tietueen kaikki tiedot

Detecting and Analyzing Text Reuse with BLAST

Toimittaja(t)

Pysyvä osoite

Verkkojulkaisu

DOI

Tiivistelmä

item.page.okmtext