Abstract
G-quadruplex (GQ) is a four-stranded DNA structure that can be formed in guanine-rich sequences. GQ structures have been proposed to regulate diverse biological processes including transcription, replication, translation and telomere maintenance. Recent studies have demonstrated the existence of GQ DNA in live mammalian cells and a significant number of potential GQ forming sequences in the human genome. We present a systematic and quantitative analysis of GQ folding propensity on a large set of 438 GQ forming sequences in double-stranded DNA by integrating fluorescence measurement, single-molecule imaging and computational modeling. We find that short minimum loop length and the thymine base are two main factors that lead to high GQ folding propensity. Linear and Gaussian process regression models further validate that the GQ folding potential can be predicted with high accuracy based on the loop length distribution and the nucleotide content of the loop sequences. Our study provides important new parameters that can inform the evaluation and classification of putative GQ sequences in the human genome.