Automated Readability Assessment of German Language
No Thumbnail Available
URL
Journal Title
Journal ISSN
Volume Title
Perustieteiden korkeakoulu |
Master's thesis
Authors
Date
2020-01-20
Department
Major/Subject
Cloud Computing and Services
Mcode
SCI3081
Degree programme
Master's Programme in ICT Innovation
Language
en
Pages
65 + 3
Series
Abstract
Studies have shown that data-driven approaches towards readability assessment, using automated linguistic analysis and machine learn- ing (ML), is a viable road forward for readability rankings. This thesis investigates the existing text readability techniques for the German language at the sentence level and describes the develop- ment of an automated readability assessment estimator. The esti- mator is developed by employing supervised learning algorithms over German text corpora annotated with grade-levels. This thesis sys- tematically explores traditional, lexical and morphological features. Natural language processing tools are used to extract 73 linguistic features grouped categorically. Feature engineering approaches are employed to understand most informative features. It is found that morphological features constitute 19 of the top 20 ranked features, regarding importance. Four different supervised learning models are implemented, with the top-ranked features fed as input. The results obtained depict that ensemble machine learning models exhibit the best performance, with Random Forest Regressor yielding best val- ues for evaluation metrics.Description
Supervisor
Nurminen, JukkaThesis advisor
Naderi, BabakKeywords
readability, assessment, data, machine learning