Mostrar registro simples

dc.creator.ID LUCENA, E. L. pt_BR
dc.creator.Lattes http://lattes.cnpq.br/5944567562075735 pt_BR
dc.contributor.advisor1 GHEYI, Rohit.
dc.contributor.advisor1ID GHEYI, R. pt_BR
dc.contributor.advisor1Lattes http://lattes.cnpq.br/2931270888717344 pt_BR
dc.contributor.referee1 MONTEIRO , João Arthur Brunet.
dc.contributor.referee2 MASSONI , Tiago Lima.
dc.publisher.country Brasil pt_BR
dc.publisher.department Centro de Engenharia Elétrica e Informática - CEEI pt_BR
dc.publisher.initials UFCG pt_BR
dc.subject.cnpq Ciência da Computação pt_BR
dc.title Optimizing aho­corasick for word counting. pt_BR
dc.date.issued 2020
dc.description.abstract The Aho-Corasick algorithm is used to recognize all occurrences of a set of strings in a text. Its time complexity is 𝑂 (𝑚 + 𝑛 + 𝑜), where 𝑚 is the sum of the lengths of the keywords, 𝑛 is the text length and 𝑜 is the number of occurrences of all keywords in the text. However, when the input contains a large amount of matches with the dictionary patterns, the algorithm performance decreases. In many domains, such as information retrieval, natural language processing and DNA sequence analysis, Aho-Corasick is used for word counting, possibly with many repetitions. In this paper, we improve the Aho-Corasick algorithm to count the number of occur rences of a set of words in a text. The new algorithm works offline and does not depend on the frequencies of the dictionary words. Its time complexity is 𝑂 (𝑚 + 𝑛 + 𝑢), where 𝑢 is the number of distinct keywords found in the text, and its space complexity is the same as the Aho-Corasick. We compare the original and the new algorithm performances with texts varying up to 100MB and dictionaries with sizes 1KB, 1MB and 10MB. The new algorithm performed better in every experiment made, from 50% to 300% faster in comparison with the Aho-Corasick. pt_BR
dc.identifier.uri http://dspace.sti.ufcg.edu.br:8080/jspui/handle/riufcg/20128
dc.date.accessioned 2021-07-20T13:15:32Z
dc.date.available 2021-07-20
dc.date.available 2021-07-20T13:15:32Z
dc.type Trabalho de Conclusão de Curso pt_BR
dc.subject Aho-Corasick algoritm pt_BR
dc.subject Pattern matching pt_BR
dc.subject Correspondência de padrões pt_BR
dc.subject Filtrage pt_BR
dc.subject Coincidencia de patrones pt_BR
dc.subject Word counting pt_BR
dc.subject Recuento de palabras pt_BR
dc.subject Comptage de mots pt_BR
dc.subject Contagem de palavras pt_BR
dc.subject Algoritmo offline pt_BR
dc.subject Algorithme hors ligne pt_BR
dc.subject Algoritmo sin conexión pt_BR
dc.subject Offline algorithm pt_BR
dc.subject Processamento de textos pt_BR
dc.subject Processing of texts pt_BR
dc.subject Procesamiento de textos pt_BR
dc.subject Traitement des textes pt_BR
dc.rights Acesso Aberto pt_BR
dc.creator LUCENA, Emerson Leonardo.
dc.publisher Universidade Federal de Campina Grande pt_BR
dc.language eng en
dc.title.alternative Otimizando ahocorasick para contagem de palavras. pt_BR
dc.identifier.citation LUCENA, E. L. Optimizing aho­corasick for word counting. 12 f. Trabalho de Conclusão de Curso - Artigo (Curso de Bacharelado em Ciência da Computação) Graduação em Ciência da Computação, Centro de Engenharia Elétrica e Informática, Universidade Federal de Campina Grande - Paraíba - Brasil, 2020. Disponível em: http://dspace.sti.ufcg.edu.br:8080/jspui/handle/riufcg/20128 pt_BR


Arquivos deste item

Este item aparece na(s) seguinte(s) coleção(s)

Mostrar registro simples

Buscar DSpace


Busca avançada

Navegar

Minha conta