Please use this identifier to cite or link to this item: http://dspace.sti.ufcg.edu.br:8080/jspui/handle/riufcg/20128
Title: Optimizing aho­corasick for word counting.
Other Titles: Otimizando ahocorasick para contagem de palavras.
???metadata.dc.creator???: LUCENA, Emerson Leonardo.
???metadata.dc.contributor.advisor1???: GHEYI, Rohit.
???metadata.dc.contributor.referee1???: MONTEIRO , João Arthur Brunet.
???metadata.dc.contributor.referee2???: MASSONI , Tiago Lima.
Keywords: Aho-Corasick algoritm;Pattern matching;Correspondência de padrões;Filtrage;Coincidencia de patrones;Word counting;Recuento de palabras;Comptage de mots;Contagem de palavras;Algoritmo offline;Algorithme hors ligne;Algoritmo sin conexión;Offline algorithm;Processamento de textos;Processing of texts;Procesamiento de textos;Traitement des textes
Issue Date: 2020
Publisher: Universidade Federal de Campina Grande
Citation: LUCENA, E. L. Optimizing aho­corasick for word counting. 12 f. Trabalho de Conclusão de Curso - Artigo (Curso de Bacharelado em Ciência da Computação) Graduação em Ciência da Computação, Centro de Engenharia Elétrica e Informática, Universidade Federal de Campina Grande - Paraíba - Brasil, 2020. Disponível em: http://dspace.sti.ufcg.edu.br:8080/jspui/handle/riufcg/20128
Abstract: The Aho-Corasick algorithm is used to recognize all occurrences of a set of strings in a text. Its time complexity is 𝑂 (𝑚 + 𝑛 + 𝑜), where 𝑚 is the sum of the lengths of the keywords, 𝑛 is the text length and 𝑜 is the number of occurrences of all keywords in the text. However, when the input contains a large amount of matches with the dictionary patterns, the algorithm performance decreases. In many domains, such as information retrieval, natural language processing and DNA sequence analysis, Aho-Corasick is used for word counting, possibly with many repetitions. In this paper, we improve the Aho-Corasick algorithm to count the number of occur rences of a set of words in a text. The new algorithm works offline and does not depend on the frequencies of the dictionary words. Its time complexity is 𝑂 (𝑚 + 𝑛 + 𝑢), where 𝑢 is the number of distinct keywords found in the text, and its space complexity is the same as the Aho-Corasick. We compare the original and the new algorithm performances with texts varying up to 100MB and dictionaries with sizes 1KB, 1MB and 10MB. The new algorithm performed better in every experiment made, from 50% to 300% faster in comparison with the Aho-Corasick.
Keywords: Aho-Corasick algoritm
Pattern matching
Correspondência de padrões
Filtrage
Coincidencia de patrones
Word counting
Recuento de palabras
Comptage de mots
Contagem de palavras
Algoritmo offline
Algorithme hors ligne
Algoritmo sin conexión
Offline algorithm
Processamento de textos
Processing of texts
Procesamiento de textos
Traitement des textes
???metadata.dc.subject.cnpq???: Ciência da Computação
URI: http://dspace.sti.ufcg.edu.br:8080/jspui/handle/riufcg/20128
Appears in Collections:Trabalho de Conclusão de Curso - Artigo - Ciência da Computação

Files in This Item:
File Description SizeFormat 
EMERSON LEONARDO LUCENA - TCC CIÊNCIA DA COMPUTAÇÃO 2020.pdf1.49 MBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.