Tsinghua Science and Technology


Hidden Markov Model, hierarchical clustering, sequential motif, bioinformatics


Protein sequence motifs extraction is an important field of bioinformatics since its relevance to the structural analysis. Two major problems are related to this field: (1) searching the motifs within the same protein family; and (2) assuming a window size for the motifs search. This work proposes the Hierarchically Clustered Hidden Markov Model (HC-HMM) approach, which represents the behavior and structure of proteins in terms of a Hidden Markov Model chain and hierarchically clusters each chain by minimizing distance between two given chains’ structure and behavior. It is well known that HMM can be utilized for clustering, however, methods for clustering on Hidden Markov Models themselves are rarely studied. In this paper, we developed a hierarchical clustering based algorithm for HMMs to discover protein sequence motifs that transcend family boundaries with no assumption on the length of the motif. This paper carefully examines the effectiveness of this approach for motif extraction on 2593 proteins that share no more than 25% sequence identity. Many interesting motifs are generated. Three example motifs generated by the HC-HMM approach are analyzed and visualized with their tertiary structure. We believe the proposed method provides a unique protein sequence motif extraction strategy. The related data mining fields using Hidden Markova Model may also benefit from this clustering on HMM themselves approach.


Tsinghua University Press