Please use this identifier to cite or link to this item: http://dspace.sti.ufcg.edu.br:8080/jspui/handle/riufcg/20128
Full metadata record
DC FieldValueLanguage
dc.creator.IDLUCENA, E. L.pt_BR
dc.creator.Latteshttp://lattes.cnpq.br/5944567562075735pt_BR
dc.contributor.advisor1GHEYI, Rohit.
dc.contributor.advisor1IDGHEYI, R.pt_BR
dc.contributor.advisor1Latteshttp://lattes.cnpq.br/2931270888717344pt_BR
dc.contributor.referee1MONTEIRO , João Arthur Brunet.
dc.contributor.referee2MASSONI , Tiago Lima.
dc.publisher.countryBrasilpt_BR
dc.publisher.departmentCentro de Engenharia Elétrica e Informática - CEEIpt_BR
dc.publisher.initialsUFCGpt_BR
dc.subject.cnpqCiência da Computaçãopt_BR
dc.titleOptimizing aho­corasick for word counting.pt_BR
dc.date.issued2020
dc.description.abstractThe Aho-Corasick algorithm is used to recognize all occurrences of a set of strings in a text. Its time complexity is 𝑂 (𝑚 + 𝑛 + 𝑜), where 𝑚 is the sum of the lengths of the keywords, 𝑛 is the text length and 𝑜 is the number of occurrences of all keywords in the text. However, when the input contains a large amount of matches with the dictionary patterns, the algorithm performance decreases. In many domains, such as information retrieval, natural language processing and DNA sequence analysis, Aho-Corasick is used for word counting, possibly with many repetitions. In this paper, we improve the Aho-Corasick algorithm to count the number of occur rences of a set of words in a text. The new algorithm works offline and does not depend on the frequencies of the dictionary words. Its time complexity is 𝑂 (𝑚 + 𝑛 + 𝑢), where 𝑢 is the number of distinct keywords found in the text, and its space complexity is the same as the Aho-Corasick. We compare the original and the new algorithm performances with texts varying up to 100MB and dictionaries with sizes 1KB, 1MB and 10MB. The new algorithm performed better in every experiment made, from 50% to 300% faster in comparison with the Aho-Corasick.pt_BR
dc.identifier.urihttp://dspace.sti.ufcg.edu.br:8080/jspui/handle/riufcg/20128
dc.date.accessioned2021-07-20T13:15:32Z
dc.date.available2021-07-20
dc.date.available2021-07-20T13:15:32Z
dc.typeTrabalho de Conclusão de Cursopt_BR
dc.subjectAho-Corasick algoritmpt_BR
dc.subjectPattern matchingpt_BR
dc.subjectCorrespondência de padrõespt_BR
dc.subjectFiltragept_BR
dc.subjectCoincidencia de patronespt_BR
dc.subjectWord countingpt_BR
dc.subjectRecuento de palabraspt_BR
dc.subjectComptage de motspt_BR
dc.subjectContagem de palavraspt_BR
dc.subjectAlgoritmo offlinept_BR
dc.subjectAlgorithme hors lignept_BR
dc.subjectAlgoritmo sin conexiónpt_BR
dc.subjectOffline algorithmpt_BR
dc.subjectProcessamento de textospt_BR
dc.subjectProcessing of textspt_BR
dc.subjectProcesamiento de textospt_BR
dc.subjectTraitement des textespt_BR
dc.rightsAcesso Abertopt_BR
dc.creatorLUCENA, Emerson Leonardo.
dc.publisherUniversidade Federal de Campina Grandept_BR
dc.languageengen
dc.title.alternativeOtimizando ahocorasick para contagem de palavras.pt_BR
dc.identifier.citationLUCENA, E. L. Optimizing aho­corasick for word counting. 12 f. Trabalho de Conclusão de Curso - Artigo (Curso de Bacharelado em Ciência da Computação) Graduação em Ciência da Computação, Centro de Engenharia Elétrica e Informática, Universidade Federal de Campina Grande - Paraíba - Brasil, 2020. Disponível em: http://dspace.sti.ufcg.edu.br:8080/jspui/handle/riufcg/20128pt_BR
Appears in Collections:Trabalho de Conclusão de Curso - Artigo - Ciência da Computação

Files in This Item:
File Description SizeFormat 
EMERSON LEONARDO LUCENA - TCC CIÊNCIA DA COMPUTAÇÃO 2020.pdf1.49 MBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.