ARRUDA, M. M.; ARRUDA, MILENA M.; http://lattes.cnpq.br/3299838657781132; ARRUDA, Milena Marinho.
Resumo:
The growth of biological databases and the need to understand how the many
components present in a living cell are interacting and working together to perform cellular
functions are reasons that justify the interdisciplinary application of mathematical,
statistical and computational theories for the analysis and processing of genomic
information. The genetic information of an organism is encoded in deoxyribonucleic
acid molecules (DNA) by means of units called bases. The analysis and processing of
DNA sequences to obtain biological knowledge constitute the domain of this document.
The research developed aims to integrate the theory and methods of signal processing
and information theory to extract genomic information. One of the main challenges
is, therefore, to define a mapping rule to represent DNA sequences that are initially
in a symbolic domain, taking them to a numerical domain. The first result considers a
bijective unidimensional mapping for elements of a finite field with the aim of analyzing the
hypothesis that DNA is acting as a linear code in the transmission of stored information.
Hence, there will be an error-correcting code underlying the DNA sequences. In this
context, a new algorithm is proposed to search for BCH codes whose codewords are
at a Hamming distance at most unity from the numerical vector resulting from the
mapping of a given DNA sequence. Furthermore, it is shown that the DNA sequences
are approximately uniformly distributed, under the Hamming metric, in a vector space
of dimension n. Therefore, the genrator polynomial of the codes that identify collections
of taxonomically close sequences do not provide enough biological information to group
and classify them. The second result based on the hypothesis that when considering a
fixed mapping for all DNA sequences, it is not possible to guarantee that the intrinsic
characteristics of each sequence will be properly extracted. Therefore, two new algorithms
are proposed: SNR-SE and TBP-SE, both based on the spectral envelope theory to
calculate these mappings. The applicability of these methods in the context of spectral
analysis to discriminate coding and non-coding sequences of proteins is analyzed and
compared with other mappings already consolidated in the literature. In this scenario,
the proposed algorithm, TBP-SE, had the highest accuracy and sensitivity among all
evaluated. This stands out, since, in this application, sensitivity is especially important,
as the probability of having a coding sequence that will not be identified is low. In
addition, TBP-SE demonstrated good assertiveness even to detect regions with shorter
coding sequences.