Improved algorithms for string searching problems

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.authorSalmela, Leena
dc.contributor.departmentTietotekniikan laitosfi
dc.date.accessioned2012-08-21T12:43:43Z
dc.date.available2012-08-21T12:43:43Z
dc.date.issued2009
dc.description.abstractWe present improved practically efficient algorithms for several string searching problems, where we search for a short string called the pattern in a longer string called the text. We are mainly interested in the online problem, where the text is not preprocessed, but we also present a light indexing approach to speed up exact searching of a single pattern. The new algorithms can be applied e.g. to many problems in bioinformatics and other content scanning and filtering problems. In addition to exact string matching, we develop algorithms for several other variations of the string matching problem. We study algorithms for approximate string matching, where a limited number of errors is allowed in the occurrences of the pattern, and parameterized string matching, where a substring of the text matches the pattern if the characters of the substring can be renamed in such a way that the renamed substring matches the pattern exactly. We also consider searching multiple patterns simultaneously and searching weighted patterns, where the weight of a character at a given position reflects the probability of that character occurring at that position. Many of the new algorithms use the backward matching principle, where the characters of the text that are aligned with the pattern are read backward, i.e. from right to left. Another common characteristic of the new algorithms is the use of q-grams, i.e. q consecutive characters are handled as a single character. Many of the new algorithms are bit parallel, i.e. they pack several variables to a single computer word and update all these variables with a single instruction. We show that the q-gram backward string matching algorithms that solve the exact, approximate, or multiple string matching problems are optimal on average. We also show that the q-gram backward string matching algorithm for the parameterized string matching problem is sublinear on average for a class of moderately repetitive patterns. All the presented algorithms are also shown to be fast in practice when compared to earlier algorithms. We also propose an alphabet sampling technique to speed up exact string matching. We choose a subset of the alphabet and select the corresponding subsequence of the text. String matching is then performed on this reduced subsequence and the found matches are verified in the original text. We show how to choose the sampled alphabet optimally and show that the technique speeds up string matching especially for moderate to long patterns.en
dc.format.extentVerkkokirja (1023 KB, 141 s.)
dc.format.mimetypeapplication/pdf
dc.identifier.isbn978-951-22-9888-4
dc.identifier.isbn978-951-22-9887-7 (printed)#8195;
dc.identifier.issn1797-6936
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/4613
dc.identifier.urnURN:ISBN:978-951-22-9888-4
dc.language.isoenen
dc.publisherTeknillinen korkeakouluen
dc.relation.ispartofseriesTKK research reports in computer science and engineering, A, 1/09en
dc.subject.keywordstring matchingen
dc.subject.keywordapproximate string matchingen
dc.subject.keywordmultiple string matchingen
dc.subject.keywordparameterized string matchingen
dc.subject.keywordweighted string matchingen
dc.subject.keywordq-gramsen
dc.subject.keywordbit parallelismen
dc.subject.keywordtext indexingen
dc.subject.otherComputer scienceen
dc.titleImproved algorithms for string searching problemsen
dc.typeG4 Monografiaväitöskirjafi
dc.type.dcmitypetexten
dc.type.ontasotVäitöskirja (monografia)fi
dc.type.ontasotDoctoral dissertation (monograph)en
local.aalto.digiauthask
local.aalto.digifolderAalto_67980

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
isbn9789512298884.pdf
Size:
999.27 KB
Format:
Adobe Portable Document Format