Detect methylated DNA-bound TFs

Abstract Transcription factors (TFs) are proteins specifically involved in gene expression regulation. It is generally accepted in epigenetics that methylated nucleotides could prevent the TFs from binding to DNA fragments. However, recent studies have confirmed that some TFs can interact with methylated DNA fragments. Although biochemical experiments could recognize TFs binding to methylated DNA sequences, these wet-experimental methods not only need a long experimental period but also need expensive experimental consumables. Machine learning methods provide a good choice to identify these TFs fast without experimental materials. Thus, this study aims to design a powerful predictor to detect methylated DNA-bound TFs. We firstly proposed using tripeptide word vector feature to describe protein samples. Subsequently, a two-steps computational model was designed based on recurrent neural network (RNN) with long short-term memory (LSTM). The first step predictor, with single bidirectional layer RNN LSTM units, was utilized to discriminate transcription factors from other proteins (non-transcription factors). Once proteins were predicted as TFs, the second step predictor, with two unidirectional layers RNN LSTM units, was employed to judge whether the TFs can bind to methylated DNA. By using independent dataset test, we found that the accuracies are 86.63% and 73.59%, respectively, for the first and second step predictors. In addition, the distribution of tripeptides in training samples was statistically analyzed. We inferred that the position and number of some tripeptides in the sequence affect the binding of TFs to methylated DNA.