Large-Margin Learning of Transcription Factor Binding Models from High-Resolution Data | Department of Mathematics

Large-Margin Learning of Transcription Factor Binding Models from High-Resolution Data

Event Information
Event Location: 
GAB 461, 4-5 PM; refreshments: GAB 472, 3:30 PM
Event Date: 
Monday, March 5, 2012 - 4:00pm

Abstract: Gene regulatory programs are orchestrated by proteins called transcription factors (TFs), which coordinate expression of target genes both through direct binding to genomic DNA and through interaction with cofactors. Accurately modeling the DNA sequence preferences of TFs and predicting their genomic binding sites are key problems in genomics. These efforts have long been frustrated by the limited availability and accuracy of TF binding site motifs. Today, protein binding microarray (PBM) experiments and chromatin immunoprecipitation followed by sequencing (ChIP-seq) experiments are generating unprecedented high-resolution data on in vitro and in vivo TF binding. PBM data is available for 100s of mammalian TFs, including broad surveys of diverse TFs and deep studies of numerous TFs in the same structural family. Meanwhile, genome-wide data profiling the cell-type specific chromatin state provide additional information for predicting the genomic binding locations of TFs. We will describe flexible large-margin algorithmic approaches for representing and learning TF binding preferences using these massive data sets. First, we investigate how DNA sequence and chromatin information can be used to predict the cell-type specific binding of TFs. We train string kernel-based SVM models directly on TF ChIP-seq data to learn in vivo TF sequence models and investigate the cell-type specificity of TF binding profiles. In a large-scale evaluation on 184 TF ChIP-seq experiments from the human ENCODE project -- comprising data from 64 TFs across two human hematopoietic cell lines -- we confirmed that SVM sequence models significantly outperform existing motif discovery algorithms. Further, we found that SVM spatial signatures trained on cell-type specific DNase-seq data can further improve TF binding prediction performance within the same cell type. We also identified several TFs with strongly cell-type specific binding profiles that display distinct sequence signatures between cell types, showing that the DNA sequences recognized by TFs can indeed depend on cellular context. Second, we address the classical problem of modeling how changes in the amino acid sequence of the TF binding domain affect which DNA sequences the TF binds. We have developed a novel statistical method for predicting protein domain-DNA binding affinities by training on a set of PBM experiments for a family of structurally related TFs. Our method, called affinity regression, is a very general new approach for learning interpretable models that explain the observed data as interactions between two kinds of inputs, in this case, protein domain sequences and short DNA sequences. We applied the model to a large PBM data set measuring the binding preferences of 178 mouse homeodomain factors. We were able not only to infer the binding motif for held-out members of the TF family but also to correctly identify which residues from different proteins are in contact with DNA as observed in crystallographic structures. The talk will briefly present necessary background on biology and experimental data types in order to make the machine learning problems understandable to a mathematics and computer science audience.