Automated Readability Assessment of German Language

No Thumbnail Available

URL

Journal Title

Journal ISSN

Volume Title

Perustieteiden korkeakoulu | Master's thesis

Date

2020-01-20

Department

Major/Subject

Cloud Computing and Services

Mcode

SCI3081

Degree programme

Master's Programme in ICT Innovation

Language

en

Pages

65 + 3

Series

Abstract

Studies have shown that data-driven approaches towards readability assessment, using automated linguistic analysis and machine learn- ing (ML), is a viable road forward for readability rankings. This thesis investigates the existing text readability techniques for the German language at the sentence level and describes the develop- ment of an automated readability assessment estimator. The esti- mator is developed by employing supervised learning algorithms over German text corpora annotated with grade-levels. This thesis sys- tematically explores traditional, lexical and morphological features. Natural language processing tools are used to extract 73 linguistic features grouped categorically. Feature engineering approaches are employed to understand most informative features. It is found that morphological features constitute 19 of the top 20 ranked features, regarding importance. Four different supervised learning models are implemented, with the top-ranked features fed as input. The results obtained depict that ensemble machine learning models exhibit the best performance, with Random Forest Regressor yielding best val- ues for evaluation metrics.

Description

Supervisor

Nurminen, Jukka

Thesis advisor

Naderi, Babak

Keywords

readability, assessment, data, machine learning

Other note

Citation