Combining TSS-MPRA and sensitive TSS profile dissimilarity scoring to study the sequence determinants of transcription initiation

Abstract Cis-regulatory elements (CREs) can be classified by the shapes of their transcription start site (TSS) profiles, which are indicative of distinct regulatory mechanisms. Massively parallel reporter assays (MPRAs) are increasingly being used to study CRE regulatory mechanisms, yet the degree to which MPRAs replicate individual endogenous TSS profiles has not been determined. Here, we present a new low-input MPRA protocol (TSS-MPRA) that enables measuring TSS profiles of episomal reporters as well as after lentiviral reporter chromatinization. To sensitively compare MPRA and endogenous TSS profiles, we developed a novel dissimilarity scoring algorithm (WIP score) that outperforms the frequently used earth mover's distance on experimental data. Using TSS-MPRA and WIP scoring on 500 unique reporter inserts, we found that short (153 bp) MPRA promoter inserts replicate the endogenous TSS patterns of ∼60% of promoters. Lentiviral reporter chromatinization did not improve fidelity of TSS-MPRA initiation patterns, and increasing insert size frequently led to activation of extraneous TSS in the MPRA that are not active in vivo. We discuss the implications of our findings, which highlight important caveats when using MPRAs to study transcription mechanisms. Finally, we illustrate how TSS-MPRA and WIP scoring can provide novel insights into the impact of transcription factor motif mutations and genetic variants on TSS patterns and transcription levels.


INTRODUCTION
Recent advances in next-generation sequencing have broadened the molecular toolkit available to study the link between cis -regulatory grammar --the set of rules that govern the process by which DNA-encoded information is interpreted by transcription factors and other regulatory proteins --and regulatory element function. In particular, ne w methods capab le of detecting transcriptional start sites (TSSs) such as CAGE ( 1 ), Start-seq ( 2 ), GRO-cap ( 3 , 4 ), csRNA-seq ( 5 ) and STAP-seq (6)(7)(8) have allowed r esear chers to discover and investigate previously underapprecia ted fea tures of regula tory grammar. For instance, base-resolution identification of TSSs have re v ealed that enhancers, which ar e fr equently transcribed ( 9 , 10 ), strongly resemble gene promoters at the sequence level ( 4 ).
Promoters can be classified by the shape of their transcription initiation patterns. Focused (sharp) promoters have a single, well-defined TSS, while dispersed (broad) promoters have multiple TSSs that cluster closely together in linear space ( 11 ). Whereas focused promoters tend to dri v e the transcription of cell type-or stimulus-specific genes, dispersed promoters are often associated with housekeeping genes ( 12 ). Focused and dispersed initiation patterns have been linked to distinct families of transcription factors and core promoter elements ( 11 , 13 ). Furthermore, changes in transcription factor binding near TSSs have been shown to affect TSS profile shape and nucleosome positioning at promoters and to result in aberrant gene expression ( 14 ). Similarly, studies of quantitati v e trait loci (QTL) impacting TSS patterns in Drosophila melanogaster suggest that genomic variants that affect promoter TSS profile shape often increase transcriptional noise ( 15 ). Taken together, these results provide evidence that TSS profiles reflect regulatory element function ( 15 ) and that differences between transcription initiation patterns point to differences in regulatory mechanisms ( 7 , 16-17 ).
Interroga ting the rela tionship between cis -regula tory sequence and function in vivo remains laborious and low throughput, largely due to the difficulties posed by editing genomic sequences in living cells. Massively parallel reporter assays (MPRAs) overcome these limitations by enabling functional screening of thousands of sequences for regulatory activity in parallel, making them increasingly popular tools to analyze cis-regulatory element (CRE) function ( 6-8 , 18-33 ).
Broadly speaking, MPRAs either rely on RNA-seq or DNA-seq after fluorescence-activated cell sorting (FACS) to measure transcriptional output of reporter constructs. RN A-seq-based MPRA a pproaches quantify CRE activity by counting unique CRE-associated sequences encoded in the reporter RNA transcript (either synthetic barcodes ( 18 ) or the transcribed CRE sequence itself ( 23 )). FACS-based Sort-seq approaches ( 19 ) measure CRE activity by FACSsorting cells into bins based on reporter gene expression and counting relati v e plasmid CRE abundance in each bin after high-throughput sequencing. RNA-seq MPRAs are more prevalent than Sort-seq MPRAs, but Sort-seq MPRAs have played and continue to play important roles in the quantitati v e study of transcriptional regulation (e.g. ( 22 , 30 )).
One benefit of RNA-seq-based MPRAs is that they can be modified to provide data on reporter transcript initiation patterns ( 6 , 25-26 , 33 ), which are indicati v e of regulatory mechanisms ( 7 , 17 , 34 ). This opens the door for using transcription initiation profiles to study the grammar of genomic promoter sequences and to distinguish the effects of human genetic variants on transcription initiation from the effects on transcription le v els. Ther efor e, it is essential to kno w ho w well MPRAs recapitulate endogenous initiation profiles when using them to stud y regula tory mechanisms. The only prior study that touched on this subject has suggested that TSS patterns in an MPRA match endogenous TSS patterns, based on the analysis of averaged initiation signals across thousands of Drosophila promoters ( 6 ). Howe v er, it is well known that promoters exhibit a wide variety of TSS patterns ( 11 ), and no study to date has performed a detailed analysis of individual promoters. Hence, the degree to which individual MPRA constructs recapitulate the transcription initia tion pa tterns of the corresponding endogenous promoters still remains unexplored.
Her e, we pr esent the first analysis of TSS pattern fidelity of individual promoters in an MPRA compared to the genome. In the process of analyzing our data we found that previously used computational approaches to compare TSS patterns did not adequately quantify the magnitude of the observed TSS pattern differences. In an effort to better capture the degree to which MPRA inserts replicate the transcription initiation landscape of endogenous sequences, we de v eloped a windowed initiation profile (WIP) score that takes full advantage of the single-nucleotide resolution transcription initia tion da ta genera ted by 5 RNA sequencing methods. Using it to compare TSS-MPRA-deri v ed and endogenous initia tion pa tterns of individual human promoters as determined by csRNA-seq, we find that MPRA initiation profiles of short 153-bp promoter inserts correctly replicate ∼60% of the endogenous initia tion pa tterns of the human promoters we tested. Unexpectedly, for the remaining promoters, neither increasing the insert size nor lentiviral reporter chroma tiniza tion improved the fidelity of their MPRA-deri v ed transcription initiation profiles. Instead, longer inserts frequently gave rise to additional, extraneous initiation e v ents, decreasing ov erall TSS profile fidelity when comparing TSS-MPRA to endogenous initiation profiles. We discuss the putati v e causes for this observa tion and demonstra te how TSS-MPRA and TSS profile scoring can be used to study the transcriptional effects of human genetic variants. We anticipate that our new lowinput protocol, the lentiviral TSS-MPRA vector and novel sensiti v e TSS profile analysis approach will broaden the utility of TSS-aware MPRAs for studying the effect of human genetic variation. In addition, our results highlight important caveats to consider when using MPRAs to decipher DNA transcription regulatory function.

Experimental methods
TSS-MPRA plasmid designs. The episomal TSS-MPRA plasmid is based on the background-reduced pGL4.10 luciferase reporter plasmid (Promega). To allow directional cloning of oligonucleotide libraries via Gibson assembly, we replaced the multiple cloning site and Luciferase 2 gene of pGL4.10 with a new eGFP reporter cassette to quantify insert-initiated transcription by fluorescence microscopy and flow cytometry and an upstream cloning site containing dual BsaI restriction sites. The latter are flanked by an arbitrary upstream 18-nt sequence for Gibson assembly (5 -GGTAACCGGT CCAGCT CA-3 ) and downstream binding sites for Illumina TruSeq Read 2 (5 -AGACGT GT GCT CTT CCGAT CT-3 ) and RS2 re v erse transcription (RT) primers (5 -AGCGGA TAACAA TTT CACACAGGA-3 ) for r eporterspecific re v erse tr anscription and gener ation of 5 -RNA-seq libraries, respecti v ely. The resulting pTSS-MPRA-Empty plasmid was sequence-verified by Sanger sequencing.
The pLenti-TSS-MPRA plasmid for the lentiviral integration experiments was generated by replacing the multiple cloning site of pLS-SceI ( 35 ) between the SbfI and AgeI sites with a cassette containing dual BsmBI restriction sites and flanking upstream cloning and downstream TruSeq Read 2 and RS2 RT primer landing sites described above by Gibson assembly. The resulting pLenti-TSS-MPRA-Empty plasmid was sequence-verified by Sanger sequencing. The 200-bp PCR product was gelextracted from a 2% TBE / agarose gel and isolated using a PureLink Quick Gel Extraction Kit (Invitrogen; K210012). 400 ng of pTSS-MPRA-Empty or pLenti-TSS-MPRA-Empty plasmid was digested with BsaI or BsmBI-v2, respecti v ely, in 20 l [1 l BsaI (10 U), 2 l CutSmart Buffer (NEB) or 1 l BsmBI-v2 (10 U), 2 l NEB3.1 (NEB)], respecti v ely, at 37 • C for 1 h. Linearized plasmid was gelextracted and cleaned up using PureLink Quick Gel Extraction Kit (Invitrogen; K210012). Amplified oligo library was Gibson-assembled into cut plasmid using NEBuilder HiFi DN A Assembl y Master Mix (NEB) with a 5-fold molar excess of the library at 50 • C for 1 h in 4 l total volume.

Supplementary Note 1:
The amount of insert oligonucleotide pool used as starting material for PCR depends on the sequence complexity and size of the pool. The 2000-insert pool analyzed here is complex enough and small enough that the c loning pr ocedur e described is sufficient to g ener ate a plasmid library pool that faithfully represents the diversity of the synthesized oligonucleotide pool. However, when amplifying lar g e, lo w-complexity oligonuc leotide pools (containing e.g. lar g e number s of sequences that differ only in a handful of positions, such as when mutating transcription factor motifs) (e.g. 18 000 oligonucleotides), with increasing PCR cy c les and DNA amounts, partially elongated PCR products accumulate that can cross-hybridize to highly similar oligonuc leotides . Elongation of these cross-hybrids leads to chimera formation, w her e the upstr eam sequence does not match the do wnstr eam barcode of the original oligonuc leotides . This occur s mor e fr equently at high DNA concentr ations, lik el y when the PCR reaction goes into saturation earl y. Theref ore, when amplifying lar g er oligo pools with many similar sequences, it is recommended to either start with much less template, e.g. 10 pg oligo pool / 50 μl reaction and a single-step or twostep PCR, or emulsion PCR ( 36 ). Either appr oach lo w er s the star ting concentr ation of the PCR template and will r equir e scaling up the PCR volume to g ener ate sufficient amounts of DNA for cloning plasmid libraries with good representation of the synthetic oligo pool. In either case, w e r ecommend confirming a low chimera level in the cloned oligo pool plasmid libr ary by pair ed-end sequencing and aligning the cloned sequences to the original sequences before proceeding .
350-bp insert library cloning. One-hundred nanograms of a 350-nt DNA oligonucleotide pool containing 500 insert / barcode combinations (IDT) (Supplementary Table  S2) were PCR-amplified in 50 l volume [25 l NEBNext Ultra II Q5 Master Mix (NEB), 0.25 l 100 M pMPRA1-LH (5 -GGTAACCGGT CCAGCT CA-3 ), 0.25 l 100 M pMPRA1-RH (5 -CGT GT GCT CTT CCGAT CT-3 )] with the following conditions: 30 s at 98 • C, followed by 15x cycles of [10 s a t 98 • C , 20 s at 65 • C and 15 s at 72 • C], and finally 2 min at 72 • C. The PCR product (350 bp) was run on a 2% TBE / agarose gel, then gel-extracted and cleaned up using PureLink Quick Gel Extraction Kit (Invitrogen; K210012). Four-hundred nanograms pTSS-MPRA-Empty plasmid were digested with BsaI in 20 l [1 l BsaI, 2 l CutSmart Buf fer (NEB)] a t 37 • C for 1 h. Cut plasmid was gel-extracted and cleaned up using PureLink Quick Gel Extraction Kit (Invitrogen; K210012). Amplified library was Gibson-assembled into cut plasmid using NEBuilder HiFi DN A Assembl y Master Mix (NEB) with a 5-fold molar excess of the library at 50 • C for 1 hour in 4 l total volume (higher complexity library assembly should be carried out in higher volumes).  Plasmid electr opor ation. Three million K562 cells per electroporation were centrifuged (6 min, 300 g, RT), supernatant discarded and the cell pellet resuspended and centrifuged twice in 10 ml reduced-serum Opti-MEM medium (Thermo) with the same centrifugation conditions. Cells were then resuspended in 200 l Opti-MEM per electroporation and mixed with 10 g of plasmid insert library. The Opti-MEM + insert library was deposited into a 4 mm cuvette and electroporated with a BioRad GenePulser Xcell with a single exponential decay pulse at 200 V, 1000 uF, ∞ Ohm resistance. Electroporated cells were quickly transferred into 2.8 ml of 10% FBS / RPMI 1640 and incubated a t 37 • C , 100% humidity, 5% CO 2 for 24 h.

450-bp
Lentivir al tr ansduction. HEK293T cells wer e cultur ed in 6-well pla tes tha t had been coa ted with pol y-D -l ysine for 20 min at room temperature (RT), washed 4 × with 1 ml PBS and filled with 1 ml of media to avoid drying out the plates. The da y bef ore transfection, 600k HEK293T cells were plated per well. HEK293T cells were transfected the next morning with Lipofectamine 3000 using insert library, pMD2.G, and psPAX2 plasmids at a ratio of 4:1:3 (1.5 g of insert library). Media was changed ∼18 h after transfection. Viral supernatant was collected in 15 ml centrifuge tubes 48 h after replacing media and replaced with 3 ml of fresh media per well and stored a t 4 • C . Viral supernatant was collected a second time 24 h later (72 h after first media change). Twenty-four h before transduction, 200k K562 cells per well were plated in a 6-well plate in 3 ml of 10% FBS / IMDM. Lentiviral supernatant was centrifuged at 1000 g for 5 min and the supernatant was passed through a 0.45 m syringe filter to remove cell debris. K562 cells were transduced by adding 1.5 ml of unconcentr ated vir al supernatant and 8 g of polybrene to each well, and spinoculating at 600 × g for 90 min at 37 • C. After spin-fection, 1.5 ml of fresh media was added to each well and the cells cultured for 24 h in a humidified incubator at 37 • C, 5% CO 2 . The virus-containing media was replaced with 3 ml of fresh media and cells were cultured for another 48 h before being collected for downstream experiments.
RNA extraction. Cells were collected and resuspended in 250 l ice-cold PBS. Trizol LS (750 l) was added to each sample and incubated for 5 min at RT. 230 l of chloroformisoamyl alcohol was added to each and incubated for 5 min at RT. Samples were then centrifuged for 15 min (13 000 × g, 4 • C). Supernatant containing the RNA was transferred to a new 1.5 ml tube and 1 l of GlycoBlue (Thermo) + 1 / 10th volume of 3 M sodium acetate was added. After mixing quickly, 1 × volume of isopropanol was added and inverted 10 ×, then spun down briefly. Samples were incubated overnight a t −20 • C , then centrifuged for 30 min (25 000 × g, 4 • C) the next day. Supernatant was discarded, RNA pellet w as w ashed 1 × with 75% ethanol, and then air-dried for 3 min at RT. The RNA pellet was resuspended in 30 l of TLoET [0.05% Tween, 0.1 mM EDTA, 10 mM Tris pH 7.5].
DNA extraction. The lower Trizol phase from RNA extraction was spun down at 12 000 × g for 10 min and any remaining aqueous layer overlaying the interphase was removed. DNA was precipitated by adding 400 l of 100% ethanol to each sample, inverting tubes 10 times, incubating samples at RT for 3 min and 5-min centrifugation at 2000 × g, 4 • C. The interphase and phenol layer were removed and the DNA pellet was washed by incubation with 1 ml of 0.1 M sodium citrate in 10% ethanol (pH 8.5) for 30 min. The pellet was centrifuged for 5 min at 2000 × g, 4 • C and the supernatant was discarded. The sodium citrate wash was repeated once before washing the pellet in 1.5 ml 75% ethanol for 20 min. The pellet was centrifuged for 5 min at 2000 × g, 4 • C, the supernatant was discarded, and the DNA pellet was air-dried for 10 min. The pellet was dissolved in 100 l 8 mM NaOH and the solution centrifuged for 10 min at 12 000 × g, 4 • C to remove insolubles. The supernatant containing the DNA was transferred to a clean tube and its pH was adjusted to a pH of 8.3-8.4 using HEPES pH 8.0.

PAGE 5 OF 20
Nucleic Acids Research, 2023, Vol. 51, No. 15 e80 Capped MPRA 5 RNA-seq. Total RNA (5-15 g) in 15 l TLoET was denatured at 75 • C for 90 s, and ra pidl y chilled on wet ice. To dephosphorylate non-5 -capped 5 ends and sim ultaneousl y degrade carried-over DNA, we added 35 l calf intestinal phosphatase mix [25.25 l 0.05% Tween 20, 5 l CutSmart Buffer (NEB), 0.75 l SUPERase In ™ RNase Inhibitor (20 U / l, Thermo Fisher Scientific), 0.5 l RQ1 DNase (Promega), 2 l Quick CIP (NEB)], mixed well and incuba ted a t 37 • C for 1 h. To improve 5 enrichment, RNA was briefly denatured a second time at 75 • C for 30 s, quickly chilled on ice for 2 min and incubated at 37 • C for an additional 30 min. Following dephosphorylation of uncapped RNAs, r eactions wer e inactiv ated b y mixing with 500 l Trizol LS, then 140 l TLoET and 140 l CHCl 3 + IAA (24:1, Sigma) were added. Samples wer e vortex ed vigorously and centrifuged for 10 min at 12 000 g, RT (21 • C). Following phase separation, the upper layer was transferred to a new tube and RNA precipitated by mixing with 1 / 10 volume 3 M NaOAc pH 5.5 and 1 volume 100% isopropanol. The mixture was incubated at −20 • C for at least 20 min to overnight, and the RNA precipitated by centrifuging for 30 min at 21 000 g, 4 • C. The supernatant was discarded, remaining liquid collected by brief centrifugation and removed prior to washing the RNA pellet with 400 l 75% EtOH by inversion. Samples were centrifuged for 5 min at 21000 g, 4 • C, the supernatant completely removed and the pellet air-dried at RT for 5 min.
RNA pellets were resuspended in 5 l TLoET, denatured at 75 • C for 90 s, then quickly chilled on ice. RNA was decapped by adding 10 l Decapping Master Mix l Sera-Mag SpeedBeads (Cytiva), 34 l 5 M NaCl, 37.5 l 40% PEG], beads were collected on a magnet, washed 2x with 80% ethanol and air-dried at RT for 12 min. DNA was eluted by resuspending the beads in 10 l LoTE [1:4diluted Tris-EDTA buffer pH 7.5]. DNA size distribution was checked by running the DNA on a 10% PAGE / TBE gel (Nov e x) and staining with SYBR Gold (Invitro gen). DN A larger than 125 bp up to a size of (insert length + 121 bp (adapter size)) were size-selected and isolated using Zymo Clean & Concentrator-5 columns (Zymo Research) and sequenced paired-end on an Illumina NextSeq 500. An
Quantification. TSS frequencies for both TSS-MPRA and csRNA-seq data were calculated by counting the number of times the 5 end of read 1 overlapped each nucleotide. Total RNA abundance was calculated as the total number of TSS-MPRA-seq r eads (r ead 2) that mapped to the unique barcode associated with each sequence. DNA abundance was calculated as the total number of DNA-seq reads (read 2) that mapped to the unique barcode associated with each sequence.
RNA abundance normalization. After quantification, total RN A and DN A abundances were transformed into counts per million reads sequenced (CPM) to account for differences in sequencing depth. CPM-transformed insert RNA values were then normalized by dividing RNA abundance by the corresponding DN A abundance (RN A / DN A), i.e. transcript le v els of each insert were expressed as the ratio between its total CPM-transformed RNA abundance divided by total CPM-transformed DNA abundance. T r ack visualization. TSS frequencies for all inserts were quantified as described above, and single-nucleotide bedgraph files were generated using a custom script (see GitHub). All bedgraph files were visualized using CoolBox ( 37 ).
Barcode replicate variance analysis. To measure the degree of dispersion in transcription le v els of barcode replicates we zero-centered transcription le v els by subtracting the median Lo g2 DN A-normalized RN A count of each insert group (each insert group includes 4 inserts made up of barcode replicates) from each insert's RNA count value within the group. The standar d de viation of the resulting value distribution of all zero-centered insert values was then calculated and outliers determined as those inserts that had DNAnormalized transcription le v els at least 3 standard deviations from the mean of the insert group.
Calculating WIP similarity scores. The windowed initiation profile (WIP) scoring metric quantifies the dissimilarity between two transcription initia tion pa tterns by comparing the position-specific initiation signal between two data sets. The algorithm employs a sliding window approach to scan across two gi v en TSS distributions and then sums up the absolute differences in relati v e initiation signal for each window. This is done for increasing window sizes, starting with a size of 1 (here: ranging from 1 to 5 base positions). The sum of the combined dif ferences a t each window size multiplied by the window size makes up the WIP score, which sensiti v ely quantifies the degree of di v ergence between two TSS profiles (Supplementary Figure S4).
Computationally, TSS distributions ar e r epr esented as 1D arrays of identical length. Each value in the array r epr esents the position-specific raw signal (here: 5 RNA-seq sequencing reads that start at that nucleotide position within the TSS distribution). Raw values are transformed to counts per million (CPM) to account for sequencing depth differences between experiments. To ensure that comparisons between arrays are independent of overall signal intensity, arrays are further normalized so that all values within the array sum up to one. Dissimilarity between two 1D arrays is computed by sliding a window of length k across both arrays, calculating the absolute differences in normalized array values at each window and adding them to a running k-diff score. Each final k-diff score is multiplied by the length of the window ( k ) and then added to a diffsum score. This process is repeated a total of k times ( k = 5 in this manuscript). The final diffsum score computed after this iterati v e process is equivalent to the WIP score.
WIP scor es ar e a useful metric to quantify how different two (experimentally defined) TSS profiles are, but by themselves are not sufficient to determine whether two distributions are significantly different in relation to the other experimental r esults. Ther efor e, to determine whether two TSS distributions are significantly different from each other, we used csRNA-seq replicate datasets as ground truth to fit a novelty detection model. Our assumption was that there should be some minimal variation in TSS distribution between identical gene promoters in any two replicates, and that any large differences outside of that variation could be considered as outliers. We first examined the 300 bp sur-rounding annotated transcription start sites ( −150 / +150 from the TSS) and quantified TSS frequencies as described above. We calculated WIP scores for identical gene promoters between two replicates (after filtering out promoters with < 25 reads). We then trained a one-class support vector machine (OCSVM) model ( 38 ) on the WIP scores and mean quantile-normalized RNA abundance counts from each promoter replicate. Using this novelty detection model we can determine whether a new data point (WIP score and mean quantile-normalized RNA abundances between any two TSS distributions) is a member of a specific class or not (i.e. significantly different from the normal variation observed in replicate csRNA-seq datasets or not). The output of our model is the classification score of a new data point being an outlier (where 0 is completely within the normal variation observed between replicate datasets, and 1 is completel y outside), w herein data points closer to 1 r epr esent two significantly different TSS profiles. For most purposes in this manuscript, unless stated otherwise, we assume that any two TSS profiles with an outlier score greater than 0.5 ar e consider ed to be significantly differ ent.
Initiation pattern r epr oducibility analysis . To calcula te the reproducibility of TSS-MPRA initiation patterns across r eplicates, we compar ed the WIP scor es of thr ee groups: (A) scores between identical promoters in two TSS-MPRA replicate experiments (true positives), (B) average scor es between TSS-MPRA bar code r eplicates (quadruplicates within each experiment) and (C) average scores between non-bar code-r eplicates (all against all within experiment) (true negati v es).
WIP scor e vs ear th mover's distance comparisons . We first calculated the WIP scores and Earth Mover's Distance between each insert in the TSS-MPRA against all 2000 inserts (500 sequences barcoded 4-fold) within the same experiment, in two replicate experiments, for a total of 4000 separate comparisons. We then ranked the dissimilarity score lists for each sequence against all sequences in each experiment and counted the number of times that a gi v en sequence's bar code r eplicates appear ed in the top 4 most similar (lowest) scores in each list. To calculate Earth Mover's Distance, we used three different methods: (i) EMD using the Python Optimal Transport (POT) library ( https://pythonot.github.io/ ), (ii) EMD using the pyEMD library ( https://github.com/laszukdawid/PyEMD ), EMD using Scipy's Wasserstein Distance function and (iii) a custom function. We observed that EMD calculations between POT, pyEMD and Scipy vary widel y, w hile results from Scipy and our custom EMD calculation were identical.
OCSVM model accurac y anal ysis. To measure the accuracy of our OCSVM model, we w ork ed under the assumption that there should be little differences between the WIP scores of identical promoters in replicate csRNA-seq and TSS-MPRA experiments. WIP scores between csRNA-seq vs TSS-MPRA were calculated for the same promoters in each replicate separately and used as input, along with the median RNA expression of each promoter, for our OCSVM model. We then defined the accuracy of our model as the

PAGE 7 OF 20
Nucleic Acids Research, 2023, Vol. 51, No. 15 e80 frequency in which the same promoter is labeled differently (e.g. inlier in replicate 1, but outlier in replicate 2) by the model.

Sequencer size bias normalization.
To account for the bias in sequencing shorter inserts versus longer inserts we fit a spline to the discrete NextSeq size bias profiles from Gohl et al. ( 39 ) and used the interpolated size bias as scaling function to normalize the read counts at each bp for track visualization.
Calculating focus ratios. To quantify the spread of a TSS profile, focus ratios are calculated as the number of reads within 10 bp of a TSS location of interest divided by the total number of reads within a region (here: TSS-MPRA insert size). Focus ratios range from 0 to 1 where 0 indicates an extremely dispersed TSS profile while 1 indicates a highly focused (single-nucleotide) TSS profile.

TSS-MPRA, a simplified TSS-a w ar e MPRA protocol with low input r equir ements
Differ ent transcription r egulatory mechanisms ar e associated with distinct promoter transcription initia tion pa tterns ( 7 , 16 , 17 ). Yet, despite widespread use of MPRAs to study transcription regulation for over a decade, it remains unclear how well transcription initiation patterns of promoters in MPRA reporter constructs recapitulate those of the endogenous promoters in vivo . Only a single study has attempted to compare transcription initiation in vivo with initiation at corresponding inserts in an MPRA. Arnold et al. ( 6 ) analyzed averaged initiation signals of thousands of drosophila genome fragments in STAP-seq and found that the main averaged initiation site aligned with gene TSS annotated in FlyBase. While this indica tes tha t the averaged main TSS of the majority of promoters in MPRAs coincides with the main TSS positions of genes in vivo , this analysis does not address whether the initia tion pa tterns of individual MPRA inserts recapitulate the corresponding initiation patterns --and hence regulatory mechanisms --of the corresponding endogenous sites.
To determine TSS profiles of synthetic DNA inserts and assess promoter transcription initiation fidelity of individual inserts relati v e to the genome in an MPRA, we devised an MPRA insert barcoding scheme with 11-mer barcodes immediately downstream of the genomic sequence that are synthesized together as part of one oligonucleotide (Supplementary Figure S1A). Cloned into a reporter plasmid, these bar codes ar e transcribed as part of the 5 UTR of the reporter transcript. This enables interrogating the effect of DNA sequence variants upstream of the reporter transcript on the location of transcription initiation (Figure 1 , Supplementary Figure S1C). A barcoded variant of STAP-seq with an almost identical barcoding approach was published while this manuscript was under re vie w ( 8 ) (Supplementary Figure S1B). This type of 5 reporter transcript barcoding is broadly similar to two previously published methods ( 25 , 26 ) but with important differences. Lubliner et al. encoded barcodes as silent mutations in the N-terminal 12 codons of the reporter gene, taking up a 36-nt portion of each designed oligonucleotide insert used to clone each promoter construct. Together with a 10-nt A-rich tract, this limited the effecti v e insert size of the promoter portion of each synthetic 200-nt oligonucleotide to 118 nucleotides (Supplementary Figure S1B). In contrast, the MASTER method ( 26 , 32 ) uses randomized oligonucleotides (7-10 nt random sequence at the TSS of a minimal promoter and 15-20 nt random sequence downstream of the TSS) (Supplementary Figure S1B). This provides full coverage of all possible sequence variants but is only feasible for short stretches of random sequence (at 100x RNA-seq coverage, the diversity of e.g. a 15-mer random region (1.07 × 10 9 × 100 = 107 × 10 9 ) would already exceed the capabilities of the most recent Illumina sequencer (52 × 10 9 reads)) and does not permit introducing defined nucleotide changes throughout larger inserts.
For high-throughput cloning of DNA inserts, we generated promoter-less deri vati v es of the commercial pGL4. To enable performing TSS-MPRA from limited amounts of RNA after e.g. lentiviral transduction, we devised a library prep procedure similar to the one used by Lubliner et al. ( 40 ), which generates a sequencing library from total RNA by ligating a 5 adapter selecti v ely to capped RNAs followed by reporter-specific re v erse transcription and amplification (Supplementary Figure S1C). This allows TSS-MPRA to be performed from 5-10 g total RNA, or 2-3 million cells, which compares favorably to the ≥ 40 million cells r equir ed to isolate the 10 g of mRNA starting material needed for STAP-seq library preparation ( 7 ).

Moder ate agr eement betw een MPRA-derived and endogenous initiation profiles and transcription levels of human promoters in an episomal MPRA setting
To test TSS-MPRA performance, we designed a library of acti v e TSSs as determined by csRNA-seq ( 5 ) in K562 erythroleukemia cells. Half of the sequences were randomly chosen regulatory elements covering a range of transcriptional activities and TSS profile shapes, the other half were sequences designed to test the effect of mutating transcription factor or core promoter motifs and single nucleotide polymorphisms (SNPs) on initiation patterns and transcription le v els. Each 200-bp synthetic sequence in the library contained a 153-bp genomic sequence insert (from −110 to + 43 bp around the most acti v e TSS within a gi v en window), a unique 11-mer barcode ( 41 ) as well as flanking 18-bp homology arms for directional cloning via Gibson assembly (Figure 1 A). To account for potential barcode-specific effects, each sequence was barcoded 4-fold redundantly, for a total of 2000 inserts. We refer to this oligonucleotide pool as epi-short throughout the text.
Epi-short inserts were PCR-amplified and shotguncloned into the pTSS-MPRA plasmid. The plasmid pool was electroporated into K562 cells in duplicate and electropora tion ef ficiency / viability verified by fluorescence microscopy ( > 80% cells eGFP + with ∼60% cell viability as measured by trypan b lue e xclusion). Total RNA was isolated after 24 h, and reporter-specific 5 RNA-seq libraries were prepared from 10 g total RNA. Briefly, nonca pped RN A was dephosphorylated with calf intestinal phosphatase, mRNAs were decapped with RppH 5 pyr ophosphohydr olase to leave a 5 monophosphate, and an Illumina-compatible 5 RN A ada pter was ligated to these previousl y ca pped RN A 5 ends. Plasmid-derived transcripts were selecti v ely re v erse-transcribed with a reporter transcript-specific re v erse transcription primer and the cDNA PCR-amplified with Illumina-compatible primers (Supplementary Figure S1C). In parallel, we generated input DNA-seq libraries from DNA isolated from the same cell lysates by PCR to normalize TSS-MPRA RNA signal to the plasmid le v els in the cells. After sequencing, 96% of TSS-MPRA RNA and 99% of DNA reads aligned to library inserts, with 95% of inserts exhibiting quantifiab le le v els of transcription (defined as having > 25 mapped reads). As exemplified for the sequence surrounding the HBE1 promoter locus, TSS-MPRA can reliably capture transcription initiation frequencies and locations from DNA inserts (Figure 1 A). RNA, DNA and DNAnormalized RNA le v els between replicate experiments were highly reproducible ( > 0.99 for both; Figure 1 B, Supplementary Figure S2), with little transcript-le v el variance between bar code r eplicates within insert groups (SD = 1.4-fold around the median; see Methods) (Supplementary Figure S3).
Next, we determined how promoter activity measured by TSS-MPRA compares to a previously published non-TSS-aware promoter MPRA that employs 3 barcoding, SuRE-seq ( 42 ). While promoters in a reporter plasmid lack their nati v e chr omatin envir onment, the only previous comparison of SuRE-seq and nascent transcription initia tion da ta measured by GRO-cap has shown that MPRA da ta replica tes endogenous promoter activity surprisingly well ( r = 0.43) ( 42 )). The transcription le v els of the 250 random regulatory elements in our epi-short plasmid pool measured by TSS-MPRA and genomic transcription initiation as measured by csRNA-seq correlated to the same degree (Spearman = 0.43, Peason r = 0.44; Figure 1 C), suggesting that 5 barcoding of reporter tr anscripts yields compar a ble information a bout promoter activity as the more widely used 3 barcoding MPRA strategy.
The randomly chosen inserts of the epi-short pool contain genomic loci annotated as either promoters or putati v e enhancers. Segr egating TSS-MPRA r esults by promoter and enhancer annotation re v ealed that promoters had higher le v els of correlation with csRNA-seq than enhancers (Promoter = 0.30, Enhancer = 0.16; Figure 1 C).
Taken together, our data indicate that TSS-MPRA recapitulates transcription le v els of promoters to a similar degree as a published, non-5 -aware MPRAs, while proper enhancer transcription may be more frequently dependent on additional sequence features not encoded within the short, 153-bp pieces of genomic sequence tested.

Windowed initiation profiling scores sensitively quantify differ ences betw een TSS profiles
Next, we set out to compare TSS-MPRA and endogenous transcription initiation profiles. Historically, two different approaches have been used to char acterize tr anscription initia tion pa tterns: promoters are either classified based on the 'shape' of their TSS distribution using rule-based methods, or pairwise comparisons are carried out using a dissimilarity metric to quantify the difference between the TSS distributions of individual promoters. The first global analysis of TSS patterns by Carninci et al. used a rule-and quantilebased classification scheme on CAGE data to stratify promoter TSS patterns into four classes: single-peak patterns were classified as 'sharp', while broad TSS patterns were further separated into three subclasses (without peak, with one dominant peak or multimodal) ( 11 ). While this type of analysis helps cluster promoters with similar distributions together, it does not offer an easy way to quantify how different the TSS patterns of individual promoters are from one another. In 2011, Zhao et al. extended Carninci's classification system by modifying a non-parametric distance metric based on the Minimum Difference of Pair Assignments (MDPA) ( 43 ) to measure how dissimilar one TSS distribution is from another, which they called the Generalized Minimum Distance of Distributions (GM-distance) ( 44 ). This approach offers a quantifiable metric that can be used to describe how similar or different the transcription initia tion pa tterns of two promoters are. Since then, the majority of analyses comparing TSS distributions have used some version of MDPA or the similar Earth Mover's Distance (EMD) (45)(46)(47)(48).
We initially benchmar ked se v eral distance similarity metrics, including EMD ( 45 ), GMDP ( 49 ), Pearson Correlation, Cosine Similarity, Dynamic Time Warp, and Root Mean Squared De viation. Howe v er, none of them performed satisfactorily to capture the often subtle but clearly visible differences in TSS patterns. To more accurately assess the degree to which TSS-MPRA mirrors endogenous transcription initiation patterns, we de v eloped a ne w metric called the 'windowed initiation profile' (WIP) score. WIP scores quantify the degree of dissimilarity between two initia tion pa tterns by summing up the weighted differences in relati v e magnitude of the initiation signal at each nucleotide position within sliding windows of increasing size across a region. By summing up weighted differences at different scales, this approach takes the discontinuous, but hierarchical nature of transcription initiation patterns into account (Supplementary Figure S4). Benchmarking the WIP score against EMD, a popular metric used to compare TSS profiles ( 44 , 48 ), on TSS-MPRA replicate data, we found the WIP score metric to more frequently identify differently barcoded insert replicates within each MPRA experiment (Supplementary Figures S5-S7).
While the WIP score provides an excellent metric for initiation profile dissimilarity, whether two TSS profiles can be considered 'significantly' different from each other depends on how replicable profile shapes are in general, which is dependent on signal and noise le v els and ov erall assay variability. To determine when two TSS shapes should be considered 'significantly' different, we trained a novelty e80 Nucleic Acids Research, 2023, Vol. 51, No. 15 PAGE 10 OF 20 detection model based on a one-class support vector machine (OCSVM) on replicate genome-wide transcription initia tion da ta set measured by csRNA-seq, using the WIP scores and average transcription le v els for all RefSeqannotated gene promoters to experimentally define dispersion between biological replicates (Figure 2 A; see Materials and Methods). Based on the score distribution, we call TSS profile shapes different with an SVM outlier score of < 0.5.
A ppl ying WIP score anal ysis to TSS profiles of TSS-MPRA inserts and their corresponding endogenous profiles as measured by csRNA-seq, we found that 63% of the 250 randomly chosen regulatory elements in the epi-short pool recapitulated the corresponding endogenous transcription patterns. Analysis of the csRNA-seq TSS profiles of the 37% regula tory elements tha t did not recapitula te endogenous TSS profile shapes showed that they had more dispersed initia tion pa tterns and lower av erage le v els of transcription (Figure 2 B, C).
To gain insights into the sequence features of the two different classes of inserts, we analyzed their nucleotide frequency distributions. Cumulati v e TSS-centered analysis of the nucleotide frequencies surrounding each TSS re v ealed that the TSSs within TSS-MPRA inserts that recapitulated endogenous TSS profile shapes exhibited nucleotide frequency patterns consistent with Initiator and upstream T AT A box motif enrichment at 0 nt and −30 nt relati v e to the TSS, respecti v ely (Figure 2 D, upper right panel). These two core promoter elements are also enriched at regions surrounding endogenous sites of transcription initiation (Figure 2 D, upper left panel) and are known to be associated with focused, high-le v el transcription initiation ( 12 ). Similarly, individual TSSs that wer e mor e active in TSS-MPRA inserts with TSS patterns different from the endogenous TSS profiles were also enriched for T AT A and Inr-like nucleotide frequencies (Figure 2 D, lower left panel). In contrast, individual TSSs that were more acti v e in csRNA-seq than in TSS-MPRA and tha t were loca ted in inserts that did not recapitulate the endogenous TSS profile in TSS-MPRA had an overall mor e p yrimidine-rich profile upstr eam, including at the T AT A box location at −30 nt, and a bias for G as the first tr anscribed nucleotide, r ather than the typical A nucleotide (Figure 2 D, lower right panel).
We also found that the transcription le v els of inserts with lower WIP scores (i.e. with initia tion pa tterns more similar to the endogenous locus) correlated better with endogenous transcription le v els than sequences with higher WIP scores (Supplementary Figure S8), indicating a relationship between TSS profile shape and transcription le v els. While the majority of sequences in our TSS-MPRA dri v e accurate initia tion pa tterns, a significant fraction of tested sequences gave rise to divergent TSS shapes. These sequences likely use alternati v e regulatory mechanisms when in the context of a reporter plasmid, and mechanistic studies of their function in a reporter may lead to misleading results.

Increasing the size of oligonucleotide inserts in the MPRA does not lead to more accurate initiation patterns or expression levels
The influence of insert size on MPRA-deri v ed TSS profiles or TSS strength has not been studied before. Gi v en that core promoter sequences ( −40 to + 40 from the TSS) are typically sufficient to determine where transcription initiates ( 50 ), we hypothesized that epi-short TSS-MPRA inserts that did not replicate endogenous initiation profiles and had weaker evidence for T AT A and Inr elements lacked important sequence features outside of the -110-to + 43-nt sequences captured by the epi-short plasmid pool.
To determine whether increasing the insert size of cloned sequences would lead to a more accurate transcription initiation landscape we designed a second oligonucleotide pool ( epi-long ) with longer versions of the same 250 randomly chosen promoters as the epi-short pool. By taking advantage of the longest commercially available oligonucleotide synthesis process (350 nt length), we designed an epi-long library with 303 nt of genomic sequence. Each 350 nt insert consisted of 75 nt of additional flanking genomic sequence on either side of the original epi-short insert sequence for 303 nt total insert length, plus 47 nt for downstream 11-mer barcodes and flanking homology arms for Gibson assembly as above. Each genomic insert sequence was barcoded 2-fold redundantly, for a total of 500 inserts.
Performing TSS-MPRA, we found that the increased insert size of the epi-long plasmid pool did not improve the TSS-MPRA initia tion pa tterns of the 37% of promoter inserts that did not previously match their endogenous profiles in the epi-short libr ary. Additionally, the over all fr action of promoters that recapitulated endogenous initiation patterns by WIP / SVM scoring decreased from 63% in the epi-short pool to 25% in the epi-long pool. This drop in fidelity was mostly the result of additional TSSs that were not present in the corresponding csRNA-seq data. These exogenous TSSs occurred outside and mostly downstream of the 153-bp 'core' insert sequence present in the epi-short pool (Figure 3 A). Consequently, restricting the WIP analysis to the 153-bp core region of epi-long sequences increased the percentage of inserts with similar initiation patterns to their endogenous loci from 25% to 41%, with the majority of core initiation patterns not being affected by the genomic sequence added to the epi-short inserts. Decrease in initiation pattern fidelity was accompanied by a drop in correlation between transcription le v els of the epi-long constructs and csRNA-seq ( = 0.35; Figure 3 B), which remained the same w hen onl y quantifying initiation in the 'cor e' r egions of the long inserts (Figure 3 C).
W hen stra tifying inserts by their genomic annota tion we observed that with increasing insert size, the transcription le v els of enhancers, but not of promoters, correlated better with endogenous transcription le v els (Figures 1 C, 3B ). This indica tes tha t accura te enhancer transcription (and potentially function) relies on additional sequence information encoded within longer stretches of genomic DNA than promoters, such as bidirectional initiation sites ( 51 ).

Chromatinization by lentiviral transduction affects transcription levels of reporter constructs but not initiation patterns
Studies comparing transiently transfected episomal against lentivirally chromatinized enhancer reporters have suggested that lentivirally integrated reporters are more reflecti v e of endogenous cis -regulatory activity ( 52 , 53 ). To test whether chroma tiniza tion of promoters in TSS-MPRA   Figure S9).
WIP analysis of lenti-TSS-MPRA data indicated that 62% of sequences in our lenti-short experiments mirrored their endogenous initiation patterns, which dropped to 28% in lenti-long libraries. These r esults wer e similar to the observations seen with episomal constructs (63% and 25% respecti v ely), with episomal and chromatinized reporters sharing the majority of inserts tha t recapitula te endogenous TSS profiles (Figure 4 C) .
In contrast to the similarity of TSS profiles between episomal and lentivirally integrated reporters, genomic integration decreased the correlation between lenti-TSS-MPRA and endogenous transcription activity when compared to episomal experiments ( lenti-short = 0.29, lentilong = 0.17; Figure 4 D-F). Taken together, our results suggest that while reporter chromatinization affects the ov erall acti vity of some promoters, it has little effect on reporter insert TSS profiles.

Large (up to 950 bp) TSS-MPRA promoter inserts do not impr ov e MPRA initiation pattern fidelity
Gi v en that neither increasing the insert size to the maximum length possible due limitations in oligonucleotide synthesis ( ∼350 bp) nor lentiviral chroma tiniza tion increased the number of TSS-MPRA inserts whose initiation patterns e80 Nucleic Acids Research, 2023, Vol. 51, No. 15 PAGE 14 OF 20 matched their endogenous patterns, we next asked whether the previously tested DNA inserts were simply too short to contain the necessary features to full y ca pture some of the corresponding endogenous initiation patterns. For example, the 303-nt inserts of the epi-and lenti-long TSS-MPRA constructs only spanned from −185 nt to + 118 nt relati v e to the main TSS of each promoter, too short to capture the potential effects of putati v e nucleosome positioning sequences downstream of promoters ( 54 , 55 ), which are located at + 50 nt to + 200 nt from the TSS ( 56 ).
To test the degree to which e v en longer inserts affect TSS profile fidelity, we chose 3 r epr esentati v e genomic loci each for each of four categories that previousl y anal yzed inserts fell into (12 loci in total): (i) inserts that did not match the endogenous genomic initiation profile, regardless of insert size, (ii) inserts tha t ma tched the genomic initiation profile at 153 bp but not at 303 bp length, (iii) sequences in which the core 150-bp initiation profiles differed between 153 bp and 303 bp insert size and (iv) sequences where the reporterderi v ed initiation profile always mirrored the endogenous initia tion pa ttern, regardless of insert size. For each genomic locus, we PCR-amplified three inserts of increasing size (450 bp , 700 bp , 950 bp), for a total of 36 sequences and cloned them into either the pTSS-MPRA or lenti-TSS-MPRA vector.
Analysis of the TSS profiles of these inserts by TSS-MPRA in both episomal and lentivirally chromatinized settings showed that additional DNA sequence context did not improve the fidelity of MPRA-dri v en transcription initiation in any of the twelve cases we studied. Visualization of each TSS profile (Supplementary Figures  S10-S21; see Supplementary Table S4 for locus characteristics) confirmed our previous observations that longer insert sequences gi v e rise to more non-endogenous transcription initiation e v ents than shorter sequences. Even after focusing on the 153-bp core of each sequence, we did not find evidence that additional DNA sequence leads to more endogenous-like TSS profiles. In fact, we rarely saw any changes in TSS profiles in the core regions with different insert sizes, as exemplified by sequences that ne v er reca pitulate endo genous initiation (Group 1; Supplementary Figures S10-S12) or sequences that always recapitulate endogenous initiation ( Overall, our da ta indica tes tha t increasing insert sizes to up to almost 1 kb did not improve the fidelity of the insert core initia tion pa tterns rela ti v e to the initiation profiles observed at the endogenous loci.

TSS-MPRA can be used to study the effect of motif mutations on initiation patterns and transcription levels
The function of regulatory sequences is thought to be encoded by short DNA sequence motifs that recruit transcription factors. How exactly the molecular machinery interprets the motif grammar inscribed in the genome remains poorly understood. To demonstrate the utility of TSS-MPRA in decoding regulatory grammar, we used it to analyze the effects of mutating core promoter and transcription factor motifs on the initia tion pa tterns and transcription le v els of cis -regulatory elements. We focused on a subset of 21 regulatory motifs enriched in K562-transcribed or accessible regulatory elements (as determined by csRNAseq ( 5 ) and DNase-seq ( 57 ) respecti v el y). Alto gether, the epi-short library contained 47 control sequences that harbored various combinations of core promoter and transcription factors motifs, and 71 mutated sequences where all instances of a gi v en motif in the control sequence were replaced with a sequence that does not match any known transcription factor motif. Each sequence was barcoded 4f old redundantly f or a total of 188 control and 284 mutated inserts.
Despite the small overall number of sequences studied, DN A motif m utations induced clearl y visible and significant initia tion pa ttern changes in 25% of regulatory elements w hen m utating all instances of the respecti v e DNA motif (Figure 5 A). Most of these changes were the result of mutating a small number of motifs (Elk1, YY1, T AT A-box; Figure 5 B), while mutations in other motifs caused either minimal or varied effects. In contrast, we found that motif m utations onl y modestl y affected insert transcription le v els, with mutations in AP-1 and p63 motifs having the largest impacts ( Figure 5 C).
We observed that reporter chroma tiniza tion increased the fraction of inserts with significantly transformed TSS profiles from 25% to 40% (Figure 5 D). Interestingly, we found tha t muta tions in motifs for the Ets factor PU.1 led to larger WIP scores (indicati v e of more significant changes in initiation pattern) only when the reporter library was properly chromatinized (Figure 5 E, H), consistent with its known r ole in chr omatin opening ( 58 , 59 ). Howe v er, similar to the results in our episomal experiments, motif mutations had a minimal effect on transcription le v els of chromatinized inserts ( Figure 5 F). The large impact of T AT A-box mutations on initia tion pa tterns in both episomal and chromatinized contexts corroborates its well-known role in positioning the pre-initiation complex and RNAPII to initiate transcription ( 60 ) (Figure 5 G). These results highlight the value of TSS-MPRA to investiga te cis -regula tory grammar, to gain an understanding of how transcription factor binding sites contribute to both activity levels and initiation site usage of regulatory sequences.

TSS-MPRA can be used to study the effect of singlenucleotide polymorphisms on initiation patterns and transcription levels
Gi v en that greater than 80% of disease-associated genetic variants reside within non-coding regions with likely generegulatory activity, it is essential for their interpretation to better understand how these SNPs impact regulatory function ( 61 ). Recent studies have used MPRAs to evaluate how SNPs in putati v e enhancers affect transcription. To test the degree to which TSS-MPRA can be used to study the influence of SNPs on transcription initiation patterns and le v els, we analyzed the effect of 45 SNPs from the GWAS catalog tha t overlapped regula tory elements transcribed in K562 cells on the transcription initiation landscape of episomal and chromatinized reporters. and core promoter element motifs in episomal plasmids. Scatterplot comparing the changes in initiation patterns (WIP score, y-axis) and transcription le v els (fold change, x-axis) between control and mutated inserts in episomal plasmids. Red dots signify inserts with significantly changed TSS shapes after m utation. ( B ) TSS sha pe changes after motif mutation in episomal constructs. The y-axis r epr esents the mean WIP score between all inserts (and their barcode replicates) containing a particular motif and the corresponding insert with the mutated motif. Colors correspond to motif identities. ( C ) Transcription le v el changes associated with motif mutation in episomal constructs. The y-axis r epr esents the mean fold transcription change of the inserts (and their barcode replicates) containing a particular wild-type or mutated motif (x-axis). ( D ) Tr acking tr anscription initiation changes caused by tr anscription factor and core promoter element motif mutations after lentivir al integr ation into the genome. Scatterplot comparing the changes in initiation patterns (WIP score, y-axis) and transcription le v els (fold change, x-axis) between control and mutated inserts in lentiviral plasmids. Red dots signify inserts with significantly changed TSS shapes after m utation. ( E ) TSS sha pe changes after motif mutation in lentiviral constructs. The y-axis r epr esents the mean WIP score between all inserts (and their barcode replicates) containing a particular motif and the corresponding insert with the mutated motif. Colors correspond to motif identities. ( F ) Transcription le v el changes associated with motif mutation in lentiviral constructs. The y-axis r epr esents the mean fold transcription change of the inserts (and their barcode replicates) containing a particular wild-type or mutated motif (x-axis). Three of the forty-fiv e SNPs tested ( ∼7%) were associa ted with drama tic dif ferences in TSS profile in episomal TSS-MPRA experiments (Figure 6 A). Reporter chromatinization increased the number of SNPs that led to significantly dif ferent initia tion pa tterns to 6 ( ∼14%) (Figure  6 B). As an example, the SNP rs1991401, which is associated with myopia, asthma and blood traits (62)(63)(64)(65), affects the T AT A-box in the promoter of the DDX5 gene. TSS-MPRA results of inserts harboring the rs1991401 variant show a pronounced shift to a dispersed initiation phenotype that is consistent with the loss of T AT A-box function ( Figure  6 C), with little effect on transcript le v els. On the other hand, SNP rs131804, which is also associated with blood traits, causes a mutation 2 bp upstream of the primary TSS in the SCO2 gene, leading to a substantial change in initiation site usage and higher overall levels of TSS-MPRA activity (Figure 6 D).
Our results demonstrate that TSS-aware MPRAs can identify variants associated with extreme changes in initiation patterns, which provides additional context and insight into their potential regulatory impact that may be missed when using conventional MPRA approaches that only measure transcription le v els.

DISCUSSION
The principle of quantifying reporter transcripts by sequencing to determine the effect of sequence changes in the promoter regions has enab led massi v ely parallel analysis of sequence features in promoters and enhancers ( 18 , 20 , 21 ). The first and most-widely used implementation of this approach distinguishes reporter transcripts based on unique barcodes at their 3 ends. One drawback of 3 barcoding when studying promoter elements is that the resulting data carries no information about where transcription initiates, which is an important promoter feature that directly rela tes to regula tory mechanisms ( 7 , 8 , 17 , 25 ). Mor e r ecently, 5 end-aware methods have been developed, which sequence reporter transcript 5 ends to provide data on both transcription initia tion loca tion as well as transcription le v els ( 6 , 25 , 26 ).
With MPRAs increasingly being used to study the functional consequences of genetic variation and the fact that transcription initia tion pa tterns are linked to mechanisms of transcription regulation ( 7 , 8 , 17 , 25 ), it is essential to kno w ho w well promoters in an MPRA replicate their initia tion pa tterns in the genome. The only comparison of this kind to date ( 6 ) suggested that MPRA and endogenous initia tion pa tterns agree. Howe v er, as this analysis was done by comparing averaged initiation signals across many promoters, and in drosophila , where TSS profiles may ( 66 ) or may not be largely focused ( 34 ), the question whether the TSS patterns of individual human promoters in the MPRA match their endogenous counterparts remained unanswered.
To address this issue, we used TSS-MPRA to compare indi vidual MPRA-deri v ed and endogenous TSS profiles of randomly chosen human promoters. For sensiti v e quantification of TSS profile differences, we additionally de v eloped a new computational approach (WIP score) that outperforms the dissimilarity metric most commonly used for this purpose, EMD ( 48 ) (Supplementary Figures S5-S7) on experimental data. Using WIP scores to quantify TSS-MPRA and endogenous profile differences, we found that 37% of MPRA inserts did not reca pitulate endo genous TSS profiles w hen anal yzing short 153 bp promoter inserts in episomal reporters.
While episomal reporter plasmids offer some le v el of chroma tiniza tion ( 67 ), integra ting reporters into the genome via lentiviral transduction is thought to provide a more nati v e chroma tin sta te and has been shown to af fect enhancer activity in the context of an MPRA ( 52 ). To our surprise, lentivir al integr ation of TSS-MPRA constructs into the genome did not change TSS fidelity of the MPRA (38% of endogenous profiles not recapitulated), with TSS patterns in the lentiviral MPRA being very similar to the episomal TSS profiles (Figure 4 C). In contrast, correlation between relati v e insert and endogenous transcription le vels ( Episomal = 0.43, Figure 1 C) was indeed affected by lentiviral chroma tiniza tion ( Lenti = 0.29, Figure 4 D), similar to the effect observed in lentivir ally integr ated enhancer MPRAs ( 52 ).
Another parameter that has not been characterized for promoter MPRAs is the effect of insert size. In enhancer MPRAs, insert size has been shown to affect enhancer activity ( 53 ), but as the activity of the corresponding enhancers in their nati v e environment is unknown, it is difficult to draw meaningful conclusions from this finding. While the same applies to promoter MPRA transcription le v els, endogenous TSS profiles can serve as ground truth to determine how insert size affects fidelity of MPRA TSS profiles. Our initial results comparing 303-bp and 153-bp inserts centered on the same endogenous TSS indicated that while the shorter libraries recapitulated the majority of endogenous TSS profiles (63%), the longer inserts had lower fidelity (25% recapitulated overall, 41% when only assessing the core 153-bp region of the long inserts). Testing the effect of e v en longer inserts of up to 950 bp centered on the TSS on transcription initiation profiles, we found that increasing the amount of sequence context around the TSS in the MPRA did not increase fidelity of the resulting TSS profiles. On the contrary, for inserts with TSS profiles that di v erged from the endogenous profile or where the TSS pattern differed between 153 bp and 303 bp inserts, further increasing the insert length to up to 950 bp further increased the dissimilarity of the overall pattern ( Supplementary Figures S10-S21). This was largely due to additional initiation sites mostly located downstream of the original TSS, rather than changes to the core initiation pattern.
The di v ergent and size-associated changes in episomal TSS-MPRA TSS patterns wer e mirror ed in lentivirally chromatinized longer-insert reporter constructs. Possible causes for this phenomenon may include the lack of DNA methylation of the reporter gene in reporter constructs, w hich normall y suppresses spurious initiation in gene bodies of endogenous genes ( 68 ) and a lack of proper chroma tiniza tion and chroma tin conte xt caused by the lenti viral MPRA backbone that harbors insulators to dampen effects from random integration into chromatin ( 52 ). Another reason for poor TSS fidelity could be the absence of dif ferentia tion-associa ted chroma tin changes in rela ti v ely transient experiments in cell lines: the studies where (large) Figure 6. Assessing the effects of single nucleotide polymorphisms on reporter-dri v en initiation patterns and transcription le v els. ( A ) Allele-specific transcription initia tion dif ferences caused by known GWAS SNPs in episomal plasmids. Scatterplot comparing the changes in initiation patterns (y-axis) and transcription fold change (x-axis) between control and variant inserts in episomal constructs. Red dots signify inserts that had significant changes to their TSS shapes. ( B ) Allele-specific transcription initia tion dif ferences caused by known GWAS SNPs in lentiviral constructs. Sca tterplot comparing the changes in initiation patterns (y-axis) and transcription fold change (x-axis) between control and variant inserts in lentivirally integrated constructs. Red dots signify inserts that had significant changes to their TSS shapes. ( C ) SNP rs1991401 is associated with TSS shape changes. Track showing csRNA-seq (top), TSS-MPRA, and Lenti-TSS-MPRA output of the T (black) and G (red) allele. A blue ' ∧ ' symbol indicates the location of the SNP. ( D ) TSS shape dif ferences associa ted with SNP rs131804. Track showing csRNA-seq (top), TSS-MPRA, and Lenti-TSS-MPRA output of the C (black) and A (red) allele. A blue ' ∧ ' symbol marks the SNP location.
transgenes and enhancer reporters exhibit high transcriptional and epigenetic fidelity were done in pluripotent stem cells (69)(70)(71)(72) and in whole animals (73)(74)(75)(76), where the transgenic DNA undergoes dif ferentia tion-associa ted activity progressions that mirror those of the endogenous sequence ( 77 ). Thus, while additional sequence information surrounding the promoter might be necessary to help dri v e the correct TSS profile, it may only be doing so during cellular dif ferentia tion.
Overall, our results suggest that in the context of MPRAs, factors other than DNA sequence can affect reporter fidelity of transcription initia tion pa tterns. Further work will be necessary to uncover the determinants of the differences in TSS patterns between reporters and endogenous loci.
Our data shows that it is essential to assess promoter TSS pattern fidelity in the MPRA before embarking on mechanistic studies. To this end, we have demonstrated how a TSS-aware MPRA and single-nucleotide resolution dissim-ilarity scoring can be used to select promoters for further stud y whose TSS pa tterns in the MPRA replica te the endogenous TSS profile, and how TSS-MPRA can be used to characterize the effects of promoters promoter SNP on transcription initiation. Further, we anticipate that our simplified TSS-MPRA protocol with its 10-20-fold lower input r equir ements compar ed to STAP-seq ( 7 , 8 ) together with the lenti viral reporter v ector we adapted from lentiMPRA ( 35 ) will enable 5 -aware MPRA in cells that are not easily electroporated with DNA, and will open the door for 5 -aware MPRAs in vivo .