All genome sequence data contains inherent information
in it. Shannon's uncertainty theory can be used to measure of how
much information a sequence has. Here we show that the amount of
information in a sequence correlates with the similar sequences
that will be found in the database using search algorithms (BLAST).
Hence, a sequence with more information (higher uncertainty), has
a higher probability of being significantly similar to other sequences
in the database. Measuring uncertainty maybe a rapid way to screen
for sequences likely to be similar to things in the database, and
also show which sequences with no known similarities are likely
to be false negatives. Here, we also present some work on amino
acid composition for each of the complete bacterial genome sequences.
| |
|