Optimizing aho­corasick for word counting.

DSpace Principal
→
Campus Campina Grande | Centro de Engenharia Elétrica e Informática - CEEI
→
CURSOS DE GRADUAÇÃO DO CEEI
→
Curso de Bacharelado em Ciência da Computação
→
Trabalho de Conclusão de Curso - Artigo - Ciência da Computação
→
Ver ítem

Optimizing ahocorasick for word counting.

LUCENA, E. L.; http://lattes.cnpq.br/5944567562075735; LUCENA, Emerson Leonardo.

URI: http://dspace.sti.ufcg.edu.br:8080/jspui/handle/riufcg/20128

Fecha: 2020

Resumen:

The Aho-Corasick algorithm is used to recognize all occurrences of a set of strings in a text. Its time complexity is 𝑂 (𝑚 + 𝑛 + 𝑜), where 𝑚 is the sum of the lengths of the keywords, 𝑛 is the text length and 𝑜 is the number of occurrences of all keywords in the text. However, when the input contains a large amount of matches with the dictionary patterns, the algorithm performance decreases. In many domains, such as information retrieval, natural language processing and DNA sequence analysis, Aho-Corasick is used for word counting, possibly with many repetitions. In this paper, we improve the Aho-Corasick algorithm to count the number of occur rences of a set of words in a text. The new algorithm works offline and does not depend on the frequencies of the dictionary words. Its time complexity is 𝑂 (𝑚 + 𝑛 + 𝑢), where 𝑢 is the number of distinct keywords found in the text, and its space complexity is the same as the Aho-Corasick. We compare the original and the new algorithm performances with texts varying up to 100MB and dictionaries with sizes 1KB, 1MB and 10MB. The new algorithm performed better in every experiment made, from 50% to 300% faster in comparison with the Aho-Corasick.

Mostrar el registro completo del ítem