High Performance Parallelization of Pattern Matching Algorithms
Soroushnia, Shima (2018-10-02)
High Performance Parallelization of Pattern Matching Algorithms
Soroushnia, Shima
(02.10.2018)
Tätä artikkelia/julkaisua ei ole tallennettu UTUPubiin. Julkaisun tiedoissa voi kuitenkin olla linkki toisaalle tallennettuun artikkeliin / julkaisuun.
Turun yliopisto
Tiivistelmä
Pattern discovery is one of the fundamental tasks in bioinformatics and pattern recognition is a powerful technique for searching large sequences of patterns in the huge biological sequence databases. Fast and high performance algorithms are highly demanded in many applications in bioinformatics and computational molecular biology since the significant increase in the number of DNA and protein sequences expands the need for raising the performance of pattern matching algorithms. For this purpose, heterogeneous architectures that provide different processing elements with different characteristics on the same machine, can be a good choice due to their potential for high performance processing and energy efficiency. Such architectures integrate different multicore CPUs with many-core accelerators (GPGPUs) and while running an application, some parts can execute on CPU while the other parts of the same application are running on the GPU side to benefit from the both processing elements features. For implementing the algorithm on heterogeneous architecture, we have used the message passing interface (OpenMP) for parallelizing the building of state machine on shared memory multicore CPU architectures and CUDA for fine grain parallelization of searching on the GPU.
In this thesis we propose different solutions for the aforementioned issues by presenting efficient implementations of Aho-Corasick (AC) algorithm, which is awell known exact multi-pattern matching algorithm with linear complexity, that uses the state machines for locating pattern inside the database. We have also proposed an optimized implementation of Parallel Failureless Aho-Corasick (PFAC) algorithm which is the massively parallelized version of AC algorithm without failure transitions, on a heterogeneous CPU/GPU architecture. We progressively redesigned both algorithms and data structures to fit on the GPU architecture. The result of our implementation was 15% speedup comparing the previous implementations.
We have also covered the topic of approximate pattern matching which is needed for the problems which can not be solved with the normal exact pattern matching algorithms. That means the problems in which the similarity of two strings should be measured, other than finding only the exact matches. Mistyping errors or genes mutation in DNA sequences can be examples of this kind of problems. As a solution to this problem, we have proposed an efficient parallel implementation of fuzzified Aho-Corasick algorithm, in which we have modified the Aho-Corasick algorithm in order to find the fuzzy matches inside the dataset. We have tried the algorithm with different fuzzy algorithms on data sets with different string sizes, in order to find out which algorithm works the best for different type of patterns.
In this thesis we propose different solutions for the aforementioned issues by presenting efficient implementations of Aho-Corasick (AC) algorithm, which is awell known exact multi-pattern matching algorithm with linear complexity, that uses the state machines for locating pattern inside the database. We have also proposed an optimized implementation of Parallel Failureless Aho-Corasick (PFAC) algorithm which is the massively parallelized version of AC algorithm without failure transitions, on a heterogeneous CPU/GPU architecture. We progressively redesigned both algorithms and data structures to fit on the GPU architecture. The result of our implementation was 15% speedup comparing the previous implementations.
We have also covered the topic of approximate pattern matching which is needed for the problems which can not be solved with the normal exact pattern matching algorithms. That means the problems in which the similarity of two strings should be measured, other than finding only the exact matches. Mistyping errors or genes mutation in DNA sequences can be examples of this kind of problems. As a solution to this problem, we have proposed an efficient parallel implementation of fuzzified Aho-Corasick algorithm, in which we have modified the Aho-Corasick algorithm in order to find the fuzzy matches inside the dataset. We have tried the algorithm with different fuzzy algorithms on data sets with different string sizes, in order to find out which algorithm works the best for different type of patterns.