March 23, 2019, Filed Under: 2019SArKS: de novo discovery of gene expression regulatory motif sites and domains by suffix array kernel smoothing Citation: Wylie DC, Hofmann HA, BV Z. SArKS: de novo discovery of gene expression regulatory motif sites and domains by suffix array kernel smoothing. Bioinformatics [Internet]. btz198. Publisher’s Version Abstract Motivation: We set out to develop an algorithm that can mine differential gene expression data to identify candidate cell type-specific DNA regulatory sequences. Differential expression is usually quantified as a continuous score—fold-change, test-statistic, P-value—comparing biological classes. Unlike existing approaches, our de novo strategy, termed SArKS, applies non-parametric kernel smoothing to uncover promoter motif sites that correlate with elevated differential expression scores. SArKS detects motif k-mers by smoothing sequence scores over sequence similarity. A second round of smoothing over spatial proximity reveals multi-motif domains (MMDs). Discovered motif sites can then be merged or extended based on adjacency within MMDs. False positive rates are estimated and controlled by permutation testing. Results: We applied SArKS to published gene expression data representing distinct neocortical neuron classes in Mus musculus and interneuron developmental states in Homo sapiens. When benchmarked against several existing algorithms using a cross-validation procedure, SArKS identified larger motif sets that formed the basis for regression models with higher correlative power. Availability and implementation: https://github.com/denniscwylie/sarks. Contact: denniswylie@austin.utexas.edu or zemelmanb@mail.clm.utexas.edu Supplementary information: Supplementary data are available at Bioinformatics online. wylie_et_al._2019.pdf