BISPO, M. C. T.; http://lattes.cnpq.br/3907917269744642; BISPO, Magna Celi Tavares.
Resumo:
Terms ambiguity is one of the factors that hinders the document indexation and
information retrieval processes desired by a user. This work is based on the hypothesis
that part of this problem can be minimized by knowing beforehand the field of the
document that contains ambiguous terms. To determine this domain, typical
vocabularies were created through the extraction of terms from documents of
predetermined knowledge domains, with the use of syntactical rules. Wikipedia was
used as a consultation base because it is a digital encyclopedia that contains the
categories defined similar to the Universal Decimal Classification (UDC), each
category containing a vast amount of specific documents, being this feature essential
for the formation of a domain-specific vocabulary. The choice of the categories was
based on the UDC, composed of 10 domains and their respective subdomains. The
vocabularies obtained, denominated as Thematic Domain Vectors (TDV), served as
the basis for the classification of new documents. For the validation of the TDVs,
three different types of experiments were performed: the first was to classify new
documents using the vectorial method, with the TDV as a basis of consultation. The
second experiment was a classification using another classifier, the Intellexer
Categorizer. For the third experiment was created a vector of terms through Weka,
which was submitted to serve as a a consultation base to classify new documents using
the vectorial model. The results were satisfactory, because they showed that the TDV
obtained a better classification relative to other methods. Of the 14 new documents,
properly it rated 10 and 4 incorrectly, with an accuracy of 80%, against 57% accuracy
of the Intellexer Categorizer program and 50% of the classification using the Weka
created vector of terms.