Compressed suffix trees: design, construction, and applications
FacultiesFakultät für Ingenieurwissenschaften und Informatik
LicenseStandard (Fassung vom 01.10.2008)
In sequence analysis it is often advantageous to build an index data structure for large texts, as many tasks -- for instance repeated pattern matching -- can then be solved in optimal time. The most popular and important index data structure for many years was the suffix tree, which was proposed in the early 1970s by Weiner and was later mproved by McCreight. Gusfield showed in his textbook the power of the suffix tree by presenting over 20 problems which can be solved in optimal time complexity with ease by using the suffix tree. Despite its good properties -- the suffix tree can be constructed in linear time for a text over a constant size alphabet, and most operations can be performed in constant time -- it has a severe drawback. Its space consumption in practice is at least 17 times the text size. Therefore compressed suffix trees (CSTs) were proposed. In theory CSTs use at most two times the text size and provide all functionality of its uncompressed counterpart in essentially the same time complexity. In this work we focus on the practical implementation of CSTs, but also provide new theoretical suggestions for CST components. The presented design makes it possible to get myriads of CSTs with different time-space trade-offs. We show in an experimental study that the space of our CSTs is close to the theoretically expected space, and the runtime of the operations is faster -- sometimes by orders of magnitudes -- than the runtime of previous implementations. We have also improved the runtime of the construction process by an order of magnitude by using state of the art construction algorithms and a new cache-friendly algorithm for the construction of the longest common prefix array. Finally, we show that our CST implementation can replace the use of uncompressed suffix trees in some applications.
Subject HeadingsAlgorithmus [GND]
Data compression: Computer science [LCSH]
Data structures: Computer science [LCSH]