TY - JOUR

T1 - Practical approaches to reduce the space requirement of Lempel-Ziv-based compressed text indices

AU - Arroyuelo, Diego

AU - Navarro, Gonzalo

PY - 2010/12/1

Y1 - 2010/12/1

N2 - © 2010 ACM 1084-6654/2010/12-ART1.5 $10.00. Given a text T[1..n] over an alphabet of size σ, the full-text search problem consists in locating the occ occurrences of a given pattern P[1..m] in T. Compressed full-text self-indices are space-efficient representations of the text that provide direct access to and indexed search on it. The LZ-index of Navarro is a compressed full-text self-index based on the LZ78 compression algorithm. This index requires about 5 times the size of the compressed text (in theory, 4nHk(T) + o(nlogσ) bits of space, where Hk(T) is the k-th order empirical entropy of T). In practice, the average locating complexity of the LZ-index is O(σmlogσn + occ σml2), where occ is the number of occurrences of P. It can extract text substrings of length l in O(l) time. This index outperforms competing schemes both to locate short patterns and to extract text snippets. However, the LZ-index can be up to 4 times larger than the smallest existing indices (which use nHk(T) + o(n logσ) bits in theory), and it does not offer space/time tuning options. This limits its applicability. In this article, we study practical ways to reduce the space of the LZ-index. We obtain new LZ-index variants that require 2(1 + ε)nHk(T) + o(nlogσ) bits of space, for any 0 < e < 1. They have an average locating time of O(1/ε(mlogn + occ σm/2)), while extracting takes O(l) time. We perform extensive experimentation and conclude that our schemes are able to reduce the space of the original LZ-index by a factor of 2/3, that is, around 3 times the compressed text size. Our schemes are able to extract about 1 to 2 MB of the text per second, being twice as fast as the most competitive alternatives. Pattern occurrences are located at a rate of up to 1 to 4 million per second. This constitutes the best space/time trade-off when indices are allowed to use 4 times the size of the compressed text or more.

AB - © 2010 ACM 1084-6654/2010/12-ART1.5 $10.00. Given a text T[1..n] over an alphabet of size σ, the full-text search problem consists in locating the occ occurrences of a given pattern P[1..m] in T. Compressed full-text self-indices are space-efficient representations of the text that provide direct access to and indexed search on it. The LZ-index of Navarro is a compressed full-text self-index based on the LZ78 compression algorithm. This index requires about 5 times the size of the compressed text (in theory, 4nHk(T) + o(nlogσ) bits of space, where Hk(T) is the k-th order empirical entropy of T). In practice, the average locating complexity of the LZ-index is O(σmlogσn + occ σml2), where occ is the number of occurrences of P. It can extract text substrings of length l in O(l) time. This index outperforms competing schemes both to locate short patterns and to extract text snippets. However, the LZ-index can be up to 4 times larger than the smallest existing indices (which use nHk(T) + o(n logσ) bits in theory), and it does not offer space/time tuning options. This limits its applicability. In this article, we study practical ways to reduce the space of the LZ-index. We obtain new LZ-index variants that require 2(1 + ε)nHk(T) + o(nlogσ) bits of space, for any 0 < e < 1. They have an average locating time of O(1/ε(mlogn + occ σm/2)), while extracting takes O(l) time. We perform extensive experimentation and conclude that our schemes are able to reduce the space of the original LZ-index by a factor of 2/3, that is, around 3 times the compressed text size. Our schemes are able to extract about 1 to 2 MB of the text per second, being twice as fast as the most competitive alternatives. Pattern occurrences are located at a rate of up to 1 to 4 million per second. This constitutes the best space/time trade-off when indices are allowed to use 4 times the size of the compressed text or more.

U2 - 10.1145/1883683.1883684

DO - 10.1145/1883683.1883684

M3 - Article

JO - ACM Journal of Experimental Algorithmics

JF - ACM Journal of Experimental Algorithmics

SN - 1084-6654

ER -