dc.contributor.author | Tajuddin, Farzana | |
dc.date.accessioned | 2022-04-28T21:01:33Z | |
dc.date.available | 2022-04-28T21:01:33Z | |
dc.date.issued | 2022-03-24 | |
dc.identifier.uri | https://www.utupub.fi/handle/10024/153743 | |
dc.description.abstract | Natural Language Processing (NLP) is a subfield of artificial intelligence and linguistics that is
concerned with how a computer machine interacts with human language. With the increasing
computational power and the advancement in technologies, researchers have been successful at
proposing various NLP tasks that have already been implemented as real-world applications today.
Automated text summarization is one of the many tasks that has not yet completely matured
particularly in health sector. A success in this task would enable healthcare professionals to grasp
patient's history in a minimal time resulting in faster decisions required for better care.
Automatic text summarization is a process that helps shortening a large text without sacrificing
important information. This could be achieved by paraphrasing the content known as the abstractive
method or by concatenating relevant extracted sentences namely the extractive method. In general, this
process requires the conversion of text into numerical form and then a method is executed to identify
and extract relevant text.
This thesis is an attempt of exploring NLP techniques used in extractive text summarization
particularly in health domain. The work includes a comparison of basic summarizing models
implemented on a corpus of patient notes written by nurses in Finnish language. Concepts and
research studies required to understand the implementation have been documented along with the
description of the code.
A python-based project is structured to build a corpus and execute multiple summarizing models. For
this thesis, we observe the performance of two textual embeddings namely Term Frequency - Inverse
Document Frequency (TF-IDF) which is based on simple statistical measure and Word2Vec which is
based on neural networks. For both models, LexRank, an unsupervised stochastic graph-based
sentence scoring algorithm, is used for sentence extraction and a random selection method is used as a
baseline method for evaluation.
To evaluate and compare the performance of models, summaries of 15 patient care episodes of each
model were provided to two human beings for manual evaluations. According to the results of the
small sample dataset, we observe that both evaluators seem to agree with each other in preferring
summaries produced by Word2Vec LexRank over the summaries generated by TF-IDF LexRank.
Both models have also been observed, by both evaluators, to perform better than the baseline model of
random selection. | |
dc.format.extent | 72 | |
dc.language.iso | eng | |
dc.rights | fi=Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.|en=This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.| | |
dc.subject | Natural Language Processing, Text Summarization, Nursing Notes, Sentence Level Extraction, Extractive Summarization, Finnish | |
dc.title | Extractive Summarization : Experimental work on nursing notes in Finnish | |
dc.type.ontasot | fi=Pro gradu -tutkielma|en=Master's thesis| | |
dc.rights.accessrights | avoin | |
dc.identifier.urn | URN:NBN:fi-fe2022042831307 | |
dc.contributor.faculty | fi=Teknillinen tiedekunta|en=Faculty of Technology| | |
dc.contributor.studysubject | fi=Tietojenkäsittelytieteet|en=Computer Science| | |
dc.contributor.department | fi=Tietotekniikan laitos|en=Department of Computing| | |