Integration site abundance near epigenetic marks and genomic features. (A) Integration frequency near sites of histone posttranslational modification or bound chromatin proteins. Integration frequency is quantified relative to genome-wide mapping data in CD34+ hematopoietic stem cells studied.23 The integration frequency scale is shown along the bottom of the panel. Increasingly intense shades of yellow indicate negative correlation of the experimental dataset with the matched random control, and increasing shades of blue indicate positive correlation. The scale is generated using the ROC (receiver operator characteristic) area method.18,19 CTCF is a DNA-binding protein proposed to be associated with chromatin boundaries. H2AZ is a histone variant associated preferentially with promoters. For both panels, the asterisks in each tile indicate the significance of any departures from random integration; *P < .05, **P < .01, ***P < .001). The datasets marked “Retro SIN” and “Retro WT” are for gammaretroviral integration in CD34+ cells reported.25 (B) Integration frequency near annotated sequence features is quantified using the ROC area method.18 Increased integration near the indicated feature compared with random distribution is shown in red, decreased integration in blue. For many of the features, the strength of the trend was examined over several genomic length intervals. The interval lengths are shown to the right of the feature name (eg, for GC content, 1 kb indicates intervals of 1 kb around each integration site were used for analysis). Intervals marked “<” indicate measures of integration within the indicated distance of that feature. Intergenic width indicates the length of intervals between transcription units for those sites outside transcription units. The short intergenic regions (gene dense regions) indicated in blue were favored for integration. Effects of gene activity are captured in the expression intensity measure. Affymetrix expression data for lymphoid cells were used to annotate genes, then density of genes with different expression levels used to annotate integration sites as in the gene density analysis. For example, for the top 1/2 expression, the density of genes was analyzed at each integration site or random control, but only the most active 50% of genes was scored. For the top 1/16 expression, the most active 1/16th of genes was used. Because the datasets are large, in a few cases statistically significant differences were achieved for tiles where little color is evident. One anomalous dataset was excluded from the analysis as an extreme outlier (BstYI for patient no. 6).