An Improved Hashing Approach for Biological Sequence to Solve Exact Pattern Matching Problems

Loading...
Thumbnail Image
File version

Version of Record (VoR)

Author(s)
Mahmud, P
Rahman, A
Hasan Talukder, K
Griffith University Author(s)
Primary Supervisor
Other Supervisors
Editor(s)

Haldorai, Anandakumar

Date
2023
Size
File type(s)
Location
Abstract

Pattern matching algorithms have gained a lot of importance in computer science, primarily because they are used in various domains such as computational biology, video retrieval, intrusion detection systems, and fraud detection. Finding one or more patterns in a given text is known as pattern matching. Two important things that are used to judge how well exact pattern matching algorithms work are the total number of attempts and the character comparisons that are made during the matching process. The primary focus of our proposed method is reducing the size of both components wherever possible. Despite sprinting, hash-based pattern matching algorithms may have hash collisions. The Efficient Hashing Method (EHM) algorithm is improved in this research. Despite the EHM algorithm's effectiveness, it takes a lot of time in the preprocessing phase, and some hash collisions are generated. A novel hashing method has been proposed, which has reduced the preprocessing time and hash collision of the EHM algorithm. We devised the Hashing Approach for Pattern Matching (HAPM) algorithm by taking the best parts of the EHM and Quick Search (QS) algorithms and adding a way to avoid hash collisions. The preprocessing step of this algorithm combines the bad character table from the QS algorithm, the hashing strategy from the EHM algorithm, and the collision-reducing mechanism. To analyze the performance of our HAPM algorithm, we have used three types of datasets: E. coli, DNA sequences, and protein sequences. We looked at six algorithms discussed in the literature and compared our proposed method. The Hash-q with Unique FNG (HqUF) algorithm was only compared with E. coli and DNA datasets because it creates unique bits for DNA sequences. Our proposed HAPM algorithm also overcomes the problems of the HqUF algorithm. The new method beats older ones regarding average runtime, number of attempts, and character comparisons for long and short text patterns, though it did worse on some short patterns.

Journal Title

Applied Computational Intelligence and Soft Computing

Conference Title
Book Title
Edition
Volume

2023

Issue
Thesis Type
Degree Program
School
Publisher link
Patent number
Funder(s)
Grant identifier(s)
Rights Statement
Rights Statement

© 2023 Prince Mahmud et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Item Access Status
Note
Access the data
Related item(s)
Subject
Persistent link to this record
Citation

Mahmud, P; Rahman, A; Hasan Talukder, K, An Improved Hashing Approach for Biological Sequence to Solve Exact Pattern Matching Problems, Applied Computational Intelligence and Soft Computing, 2023, 2023, pp. 278505

Collections