Algorithms for XML filtering

 |  Login

Show simple item record

dc.contributor Aalto-yliopisto fi
dc.contributor Aalto University en
dc.contributor.advisor Sippu, Seppo
dc.contributor.advisor Soisalon-Soininen, Eljas, Prof. Silvasti, Panu 2012-08-31T07:41:20Z 2012-08-31T07:41:20Z 2011
dc.identifier.isbn 978-952-60-4286-2 (PDF)
dc.identifier.isbn 978-952-60-4285-5 (printed)
dc.identifier.issn 1799-4942
dc.description.abstract In a publish-subscribe system based on XML filtering, the subscriber profiles are usually specified by filters written in the XPath language. The system processes the stream of XML documents and delivers to subscribers a notification or the content of those documents that match the filters. The number of interested subscribers and their stored profiles can be very large, thousands or even millions. In this case, the scalability of the system is critical. In this thesis, we develop several algorithms for XML filtering with linear XPath expressions. The algorithms are based on a backtracking Aho-Corasick pattern-matching automaton (PMA) built from "keywords" extracted from the filters, where a keyword is a maximal substring consisting only of XML element names. The output function of the PMA indicates which keyword occurrences of which filter are recognized at a given state. Our best results have been obtained by using a dynamically changing output function, which is dynamically updated during the processing of the input document. We have conducted an extensive performance study in which we compared our filtering algorithms with YFilter and the lazy DFA, two well-known automata-based filtering methods. With a non-recursive XML data set, PMA-based filtering is tens of times more efficient than YFilter and also significantly more efficient than the lazy DFA. With a slightly recursive data set PMA-based filtering has the same performance as the lazy DFA and it is significantly more efficient than YFilter. We have also developed an optimization method called filter pruning. This method improves the performance of filtering by utilizing knowledge about the XML document type definition (DTD) to simplify the filters. The optimization algorithm takes as input a DTD and a set of linear XPath filters and produces a set of pruned linear XPath filters that contain as few wildcards and descendant operators as possible. With a non-recursive data set and with a slightly recursive data set the filter-pruning method yielded a tenfold increase in the filtering speed of the PMA-based algorithms and a hundredfold increase with YFilter and the lazy DFA. Filter pruning can also increase the filtering speed in the case of highly recursive data sets. en
dc.format.extent Verkkokirja (1129 KB, 136 s.)
dc.format.mimetype application/pdf
dc.language.iso en en
dc.publisher Aalto University en
dc.relation.ispartofseries Aalto University publication series DOCTORAL DISSERTATIONS , 85/2011 en
dc.subject.other Computer science
dc.title Algorithms for XML filtering en
dc.type G4 Monografiaväitöskirja fi Perustieteiden korkeakoulu fi
dc.contributor.department Tietotekniikan laitos fi
dc.contributor.department Department of Computer Science and Engineering en
dc.subject.keyword XML stream processing en
dc.subject.keyword string processing en
dc.subject.keyword XPath en
dc.subject.keyword automata-based filtering en
dc.identifier.urn URN:ISBN:978-952-60-4286-2
dc.type.dcmitype text en
dc.type.ontasot Väitöskirja (monografia) fi
dc.type.ontasot Doctoral dissertation (monograph) en

Files in this item

This item appears in the following Collection(s)

Show simple item record

Search archive

Advanced Search

article-iconSubmit a publication


My Account