CUNHA, M. Q.; http://lattes.cnpq.br/6398825826705311; CUNHA, Mateus Queiroz.
Abstract:
The Legal domain stands as a promising application field for Natural Language Processing.
Official Journals contain exceptionally relevant information across various legal subdomains,
with significant implications for both public and private sectors. This study used a text classification
approach to identify tax-related publications within the Brazilian Official Journal.
While analyzing the tax-related context, we addressed the challenge of highly imbalanced
data. Our investigation culminated in the creation of an automatically annotated dataset.
Using transformer-based Large Language Models (LLMs) in our experiments underscored
their suitability for tax-related data classification within the Brazilian Official Journal. Also,
our study generated evidence that inserting imbalance into the training set can lead to better
results in highly imbalanced contexts. Findings from our study indicate that encoder LLMs
remain an efficient choice, offering speed and compatibility with consumer-grade hardware.
These models maintain effectiveness even as the prevailing trend leans towards large decoder
LLMs.