Project number 5: DNA replication introduces copy number variations that are detectable in RNA-seq

We load the packages used throughout the analysis for data manipulation (dplyr, data.table), string handling (stringr), scaling utilities (scales), and tidy outputs (tibble).

#install.packages('lubridate')
pkgs <- c("dplyr","data.table","stringr","scales","tibble")
for (p in pkgs) if (!requireNamespace(p, quietly=TRUE)) install.packages(p)
library(dplyr); library(data.table); library(stringr); library(scales); library(tibble)

## 
## Adjuntando el paquete: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## 
## Adjuntando el paquete: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

File Handling

We first load a gene-by-sample expression matrix in logTPM scale, where the first column contains gene identifiers and the remaining columns correspond to samples, and a metadata table with sample-level annotations.

expr_df <- read.csv("log_tpm.csv", check.names = FALSE)  # first col = gene_id
meta_df <- read.csv("sample_table.csv", check.names = FALSE)
colnames(expr_df)[1:10]

##  [1] ""          "p1k_00001" "p1k_00002" "p1k_00003" "p1k_00004" "p1k_00005"
##  [7] "p1k_00006" "p1k_00007" "p1k_00008" "p1k_00009"

colnames(meta_df)[1:10]

##  [1] ""                   "sample_id"          "study"             
##  [4] "project"            "condition"          "rep_id"            
##  [7] "Strain Description" "Strain"             "Culture Type"      
## [10] "Evolved Sample"

Mapping

We download the E. coli genome annotation (GFF) from NCBI to obtain genomic coordinates for each gene. These coordinates allow us to map expression measurements onto positions along the chromosome.

gff_url  <- "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.gff.gz"
gff_file <- "GCF_000005845.2_ASM584v2_genomic.gff.gz"
if (!file.exists(gff_file)) download.file(gff_url, gff_file, mode="wb")
gff <- fread(gff_file, sep="\t", header=FALSE, quote="", comment.char="#", fill=TRUE) #fread help us to manage the information from the NCBI File
setnames(gff, c("seqid","source","type","start","end","score","strand","phase","attr")) #We set some known names to the columns to manage more easy the
genes_main <- copy(gff)
table(gff$type) #We make some expoloration of the data in gff

## 
##                    CDS                   exon                   gene 
##                   4340                    216                   4506 
## mobile_genetic_element                  ncRNA  origin_of_replication 
##                     50                    108                      1 
##             pseudogene                 region                   rRNA 
##                    145                      1                     22 
##       sequence_feature                   tRNA 
##                     48                     86

Extraction usefull information:

In this step we transform the raw NCBI GFF annotation into a clean gene-coordinate table that can be joined with the expression matrix. We first restrict the annotation to features of type gene and extract the locus_tag from the attribute field (attr) to obtain a gene identifier compatible with our dataset. Because the GFF may contain multiple sequences (e.g., chromosome and plasmids), we identify the main chromosome as the seqid with the largest coordinate span and store it in genes_main for downstream origin (oriC) detection. We then keep only the essential columns (gene_id, start, end), enforce integer coordinates, and estimate the chromosome length as \(L=\max(end)\). Finally, each gene is represented by a single position \(pos=(start+end)/2\) (midpoint), which provides a unique coordinate per gene for subsequent binning and circular mapping.

gff <- gff[type == "gene"] #We remain just with the gene information
gff[, gene_id := sub(".*locus_tag=([^;]+).*", "\\1", attr)] #We extract from the NCBI attr the identificator of the gene that correspond to the name of the gene in our dataset
seq_spans <- gff[, .(span = max(end, na.rm=TRUE) - min(start, na.rm=TRUE)), by=seqid]
main_seqid <- seq_spans[which.max(span), seqid]
genes_main <- gff[seqid == main_seqid]
gff <- gff[, .(gene_id, start, end)]
gff[, start := as.integer(start)]
gff[, end   := as.integer(end)]    #We extract the positions of the gene
typeof(gff$start)

## [1] "integer"

L <- max(gff$end, na.rm=TRUE) 
#Sets L to the maximum end coordinate on the main chromosome: 
#your estimate of the genome length (useful for building circular coordinates later). 
#L should be approximately the chromosome length (e.g., ~4.6e6 for E. coli K12).
gene_coords <- gff[grepl("^b\\d{4}$", gene_id),
                          .(gene_id, start, end, pos=(start+end)/2)]
gene_coords <- unique(gene_coords, by="gene_id")
head(gene_coords)

##    gene_id start   end    pos
##     <char> <int> <int>  <num>
## 1:   b0001   190   255  222.5
## 2:   b0002   337  2799 1568.0
## 3:   b0003  2801  3733 3267.0
## 4:   b0004  3734  5020 4377.0
## 5:   b0005  5234  5530 5382.0
## 6:   b0006  5683  6459 6071.0

Map expression matrix to genes with coords

At this stage, we align the expression matrix with the genomic coordinate table using a common gene identifier. As a basic integrity check, we confirm that gene IDs in the expression input are not duplicated, since repeated identifiers would break the one-to-one mapping between expression rows and genomic positions. We then build gene_info by joining the expression gene list with gene_coords, yielding a per-gene table that includes genomic coordinates (and later-derived quantities such as bin and angle). In parallel, we convert the expression values into a numeric matrix (expr_mat) and reorder/subset it to produce expr_mat2, ensuring that its row order matches exactly the order of gene_info. This guarantees that downstream binning and circular modeling are performed on expression values consistently matched to the correct genomic locations.

expr_gene_id <- expr_df[[1]] #Expression values themselves have no genomic meaning unless you know which gene each row corresponds to. So we extract the genes id of our dataset.
#This vector becomes the backbone for joining expression ↔ coordinates
anyDuplicated(expr_gene_id) #In our sanity check we have not repeated genes.

## [1] 0

#“take genes in expression order, then attach coordinates if available.”
#We create then "gene_info", that contains the genes of our dataset with its respective position.
gene_info <- tibble(gene_id = expr_gene_id) %>%
  left_join(as.data.frame(gene_coords) %>% mutate(gene_id=as.character(gene_id)),
            by="gene_id")


expr_mat <- as.matrix(expr_df[, -1]) #We transform our data into a matrix to apply it in the linear model.
rownames(expr_mat) <- expr_df[[1]]
storage.mode(expr_mat) <- "numeric"
expr_mat2 <- expr_mat[gene_info$gene_id, , drop=FALSE] #We organize our matrix to keep only rows corresponding to genes with coordinates, reorders it to match exactly the order of gene_info.
head(expr_mat2[, 1:6, drop=FALSE])

##         p1k_00001   p1k_00002    p1k_00003  p1k_00004   p1k_00005   p1k_00006
## b0002 -0.05399314  0.05399314  0.879043050  1.0896003  0.18245342 -0.07887270
## b0003 -0.06197325  0.06197325  1.063329885  1.4778890  0.39257269  0.23467136
## b0004 -0.03697197  0.03697197  0.003131557  0.3567021 -0.09675615 -0.33491258
## b0005 -0.10496714  0.10496714 -1.118144530 -1.2779633 -0.18154598 -0.36405654
## b0006  0.04224167 -0.04224167 -0.123591888 -0.1077775 -0.17317704 -0.08988823
## b0007  0.05465326 -0.05465326  0.327976479  0.2759481  0.10268599  0.11661840

Genes grouped by Bins

To represent the circular chromosome at a coarser (and less noisy) resolution, we partitioned the genome into fixed windows of 100 kb (“bins”). Each gene was assigned to a bin using its genomic midpoint, and because the chromosome is circular we used the wrapped coordinate \(pos\_{mod} = pos \bmod L\) to keep positions inside \([0, L)\).

For each bin we defined a single representative angular coordinate \(\theta\) using the bin midpoint. This midpoint was mapped onto \([0,2\pi)\) as: \[\theta = \frac{2\pi \cdot \text{bin\_mid}}{L}.\]

This bin-level angle is the input for the downstream circular regression model, \[y = \alpha + \beta\cos(\theta) + \gamma\sin(\theta),\] where \(y\) is the mean log-expression per bin. We also tracked the number of genes per bin (\(n\_{genes}\)), which can later be used as regression weights to account for unequal gene counts across windows.

window_size <- 100000
#E. coli genome ≈ 4.6 Mb → ~46 bins This is a tradeoff:
   #Smaller bins → noisier but higher resolution
   #Larger bins → smoother but may wash out signal
L / window_size    #46.41628

## [1] 46.41628

gene_info <- gene_info %>%
  mutate(
    pos_mod   = pos %% L,
    bin       = floor(pos_mod / window_size),
    bin_mid   = ((bin + 0.5) * window_size) %% L,
    theta_bin = 2*pi*bin_mid/L
  )
#y = α + β cos(θ) + γ sin(θ)

idx_by_bin <- split(seq_len(nrow(gene_info)), gene_info$bin)
n_genes_in_window <- sapply(idx_by_bin, length)
# this counts how many genes are in each bin;
# 0  1  2  3  4  5 
#86 85 97 88 93 89 

theta_by_bin <- gene_info %>%
  distinct(bin, theta_bin) %>%
  mutate(bin = as.integer(bin)) %>%
  arrange(bin)

Mean logTPM per bin per sample

To fit the circular model, we first aggregated gene-level expression into bin-level expression. For each sample and each genomic bin, we computed the mean of the logTPM values across all genes assigned to that bin: \[y_{b,s}=\frac{1}{|G_b|}\sum_{g\in G_b} y_{g,s},\] where \(G_b\) is the set of genes in bin \(b\). Because the expression matrix is already in log scale, averaging within bins provides a stable summary of the bin’s typical expression level.

We then reshaped the bin-by-sample matrix into a long-format table with one row per \((bin,\ sample)\) pair, and we attached: (i) the bin’s angular coordinate \(\theta_{bin}\) (from the bin midpoint) and (ii) the number of genes in the bin \(n_{genes}=|G_b|\). This final table (bin_expr) is the direct input for the per-sample circular regression, and \(n_{genes}\) can be used as a weight to account for bins containing different numbers of genes.

y_bin_mat <- sapply(idx_by_bin, function(idx) {
  colMeans(expr_mat2[idx, , drop=FALSE], na.rm=TRUE)
}) #Here we build our dataset with the average expression for each bin. 
y_bin_mat <- t(y_bin_mat)  # And as we want to mantain the order we use the transpose so we have in the columns the experiments and in the lines the bin.
bins <- as.integer(names(idx_by_bin)) #Its easier to use the bins as an integer in the future

bin_expr <- data.frame(
  bin       = rep(bins, times = ncol(y_bin_mat)),
  sample_id = rep(colnames(y_bin_mat), each = length(bins)),
  y         = as.vector(y_bin_mat)
) %>%
  left_join(theta_by_bin, by="bin") %>%
  mutate(n_genes = rep(as.integer(n_genes_in_window), times=ncol(y_bin_mat))) #Here we have our output for the model. With the long format required

head(bin_expr)

##   bin sample_id           y  theta_bin n_genes
## 1   0 p1k_00001 -0.05766221 0.06768299      86
## 2   1 p1k_00001 -0.01140104 0.20304897      85
## 3   2 p1k_00001 -0.03812402 0.33841495      97
## 4   3 p1k_00001 -0.01064519 0.47378094      88
## 5   4 p1k_00001 -0.02505665 0.60914692      93
## 6   5 p1k_00001  0.02339887 0.74451290      89

Fit circular model per sample

For each sample, we fit a null model (fit0) that assumes constant expression across the genome (y=α), we will use this later to compare if using circular model (y = α + β cos(θ) + γ sin(θ)) improves it significantly the fit.

We parametrized the circular model using sine and cosine basis functions, allowing the sinusoidal pattern to be estimated within a linear regression framework. This avoids nonlinear optimization (\(y = \alpha + ACos(\theta-\phi)\)) over amplitude and phase parameters.

The non linear model starts by representing the bin as follows: \[A(\cos(\theta-\phi))=A\cos(\theta)\cos(\phi)+A\sin(\theta)\sin(\phi)\] We defined: \[\beta = A\cos{\phi}\] \[\gamma=A\sin(\phi)\] With this transformation we represent the same model with a different parameterization. From the previous analysis we can set as useful information that: \[A=\sqrt{\beta^2+\gamma^2}\] That give us the amplitude And: \[\phi=\operatorname{atan2}(\gamma, \beta)\] That give us the phase, so the position in the circular chromosome in dependence of \(\gamma\) and \(\beta\). We at this point can use also a theoretical treatment that define the phase duration as C= DNA synthesis phase and D= predivisional phase. The doubling time is \(\tau = \frac{2}{growthrate}\) and the position of a locus is \(n_a= 2^{\frac{p_aC+D}{\tau}}\) where \(n_a\) is the copy number of locus \(a\) positioned at a distance \(p_a\) from the terminus of the chromosome. Due to this we can arrive that: \[log_2\frac{n_{ori}}{n_{ter}}=\frac{C}{\tau}=\frac{C\mu}{ln(2)}\]

In our case we can say that, if expression is proportional to copy number, then in \(log_2\) space the difference between maximal and minimal expression across the chromosome corresponds to: \[y_{max}-y_{min}=2A=\delta\] Thus, the fitted amplitude A provides a quantitative estimate of the replication-associated expression gradient. We can retrive the position due to the fact that we have \(\phi\), where the expression is maximum and minimum in radiants; to apply the model to all the samples, we used the fit_one_sample function that we’ve just created. This function operates on one sample’s binned data, so in the “results” variable for each sample, there are all the parameters of the fitted model explained above.
In order to evaluate if there is statistical evidence to reject the null model “fit0”, we used an anova F test to compute p value (p_full), then corrected with BH method for multiple testing correction (q_full)

fit_one_sample <- function(df, use_weights=TRUE) {
  w <- if (use_weights) df$n_genes else NULL
  
  fit0    <- lm(y ~ 1, data=df, weights=w)
  fitFull <- lm(y ~ cos(theta_bin) + sin(theta_bin), data=df, weights=w)
  
  # Reduced models (for later comparison)
  fitCos <- lm(y ~ cos(theta_bin), data=df, weights=w)
  fitSin <- lm(y ~ sin(theta_bin), data=df, weights=w)
  
  p_full <- anova(fit0, fitFull)$`Pr(>F)`[2]
  p_cos  <- anova(fit0, fitCos )$`Pr(>F)`[2]
  p_sin  <- anova(fit0, fitSin )$`Pr(>F)`[2]
  
  coefs <- coef(summary(fitFull))
  alpha <- coefs["(Intercept)", "Estimate"]
  beta  <- coefs["cos(theta_bin)", "Estimate"]
  gamma <- coefs["sin(theta_bin)", "Estimate"]
  
  # amplitude/phase representation:
  # y(theta) = alpha + A cos(theta - phi)
  A     <- sqrt(beta^2 + gamma^2)
  phi   <- atan2(gamma, beta)              # [-pi, pi]
  delta <- 2*A                              # naive: max-min in log space
  
  theta_max <- (phi %% (2*pi))             # where cos(theta-phi) is max
  theta_min <- (theta_max + pi) %% (2*pi)
  
  tibble(alpha, beta, gamma, A, phi, delta, theta_max, theta_min, p_full, p_cos, p_sin)
}

results <- bin_expr %>%
  group_by(sample_id) %>%
  group_modify(~ fit_one_sample(.x, use_weights=TRUE)) %>%
  ungroup() %>%
  mutate(
    q_full = p.adjust(p_full, method="BH"),
    q_cos  = p.adjust(p_cos,  method="BH"),
    q_sin  = p.adjust(p_sin,  method="BH")
  ) %>%
  as_tibble()

sig_q <- !is.na(results$q_full) & results$q_full < 0.05
head(results)

## # A tibble: 6 × 15
##   sample_id   alpha     beta    gamma       A    phi  delta theta_max theta_min
##   <chr>       <dbl>    <dbl>    <dbl>   <dbl>  <dbl>  <dbl>     <dbl>     <dbl>
## 1 p1k_00001 -0.0169 -0.00572  0.00283 0.00638  2.68  0.0128      2.68     5.82 
## 2 p1k_00002  0.0169  0.00572 -0.00283 0.00638 -0.460 0.0128      5.82     2.68 
## 3 p1k_00003  0.168  -0.00762  0.0792  0.0796   1.67  0.159       1.67     4.81 
## 4 p1k_00004  0.160  -0.00587  0.0747  0.0749   1.65  0.150       1.65     4.79 
## 5 p1k_00005 -0.0992 -0.0201  -0.00811 0.0217  -2.76  0.0434      3.52     0.383
## 6 p1k_00006 -0.106  -0.0196  -0.00540 0.0203  -2.87  0.0407      3.41     0.269
## # ℹ 6 more variables: p_full <dbl>, p_cos <dbl>, p_sin <dbl>, q_full <dbl>,
## #   q_cos <dbl>, q_sin <dbl>

Visual Data exploration

The histogram shows the distribution of amplitude values across all samples. Most amplitudes cluster at low values, and the distribution is clearly right-skewed, with a long tail toward larger amplitudes. The rug marks indicate where significant fits (q_full < 0.05) fall along the x-axis: they are sparse at very low amplitudes and become increasingly frequent as amplitude increases, concentrating mainly in the upper part of the distribution.

This scatter plot shows the fitted \(\beta\) (cosine component) against \(\gamma\) (sine component) across samples, with points colored by q_full. The cloud is densest near the origin, indicating that most samples have small estimated coefficients. As expected, the most significant fits concentrate away from (0,0), while points near the origin are predominantly non-significant. The distribution is not isotropic: the sample cloud is elongated along a negative diagonal, suggesting that \(\beta\) and \(\gamma\) may not be independent in practice and tend to co-vary with opposite signs in many samples.

How are these significant values distributed? In the \((\beta,\gamma)\) plane, significant samples are primarily enriched at larger distances from the origin (equivalently, larger \(A\)), indicating that significance largely tracks effect size. In our data, the median \(A\) increases from 0.0859 in non-significant samples to 0.214 in significant samples (q_full < 0.05; Wilcoxon p-value \(= 5.89\times10^{-38}\)). To assess whether there is also a preferred direction (i.e., a consistent peak location across samples), we examine the angular distribution (phi) among significant samples; the resulting histogram shows clear clustering rather than a uniform spread, suggesting that many significant samples share similar peak directions.

What happens if the models account only for the cosine or only for the sine term? The reduced models typically retain significance only for samples whose signal is sufficiently aligned with that single basis term. In our dataset, the full model identifies 95 significant samples, whereas the cosine-only and sine-only models identify 24 and 39, respectively. Importantly, 37 samples are significant under the full model but not under either reduced model, consistent with signals that require both sine and cosine terms (i.e., peak directions not aligned with a single basis). Conversely, only 5 samples are significant under the sine-only model but not under the full model, which is consistent with borderline cases where the 1-df reduced model can be slightly more powerful than the 2-df full model.

What does the estimated parameter mean? The pair \((\beta,\gamma)\) jointly encodes magnitude and direction: the distance to the origin reflects effect size (strength of the fitted pattern), while the angle (phi) reflects where along the circle the maximum/minimum occurs. Therefore, points farther from (0,0) correspond to stronger fitted patterns, while different quadrants/angles correspond to different peak positions.

Do you think one can derive a quantitative relationship of the parameter from the model and the growth rate? Under the replication-driven assumptions, larger effect sizes (e.g., \(A\) or \(2A\)) should tend to be associated with higher growth rate \(\mu\). } Overall, these comparisons support using q_full as the primary significance criterion in the remainder of the analysis. The full model is phase-invariant (it captures signal regardless of whether it aligns with the cosine or sine basis) and therefore detects substantially more significant samples than either reduced model (95 vs 24/39). The presence of many ‘full-only’ cases (37 samples significant only under q_full) indicates that restricting to a single basis term would miss a large fraction of signals whose peak direction is not aligned with that term. While a small number of borderline cases can be recovered by a 1-df reduced model (5 samples significant only under q_sin), the overall gain in sensitivity and robustness makes q_full the most appropriate and conservative choice for downstream analyses.

## # A tibble: 2 × 2
##   group                            median_A
##   <chr>                               <dbl>
## 1 non-significant (q_full >= 0.05)   0.0859
## 2 significant (q_full < 0.05)        0.214

## # A tibble: 1 × 2
##   test                                         p_value
##   <chr>                                          <dbl>
## 1 Wilcoxon rank-sum test: A ~ (q_full < 0.05) 5.89e-38

## # A tibble: 5 × 2
##   category                                                         n
##   <chr>                                                        <int>
## 1 Significant (full model): q_full < 0.05                         95
## 2 Significant (cos-only): q_cos < 0.05                            24
## 3 Significant (sin-only): q_sin < 0.05                            39
## 4 Full-only: q_full < 0.05 AND q_cos >= 0.05 AND q_sin >= 0.05    37
## 5 Sin-only not full: q_sin < 0.05 AND q_full >= 0.05               5

##                               Cos-only significant (q_cos<0.05)
## Full significant (q_full<0.05) FALSE TRUE
##                          FALSE   940    0
##                          TRUE     71   24

##                               Sin-only significant (q_sin<0.05)
## Full significant (q_full<0.05) FALSE TRUE
##                          FALSE   935    5
##                          TRUE     61   34

Merge metadata

i <- which(names(meta_df) == "")[1]
names(meta_df)[i] <- "sample_id"

merged_results <- dplyr::left_join(
  results,
  dplyr::distinct(meta_df, sample_id, .keep_all = TRUE),
  by = "sample_id"
)

Identification of oriC position

To anchor the circular expression model to a biologically meaningful reference point, we determined the genomic position of the replication origin (oriC). In E. coli, the dnaA gene is located immediately adjacent to oriC and was therefore used as a proxy for the origin position. The dnaA locus was identified from the genome annotation, and its midpoint was computed as:

\[ori_{pos} = \frac{\text{start}_{\text{dnaA}} + \text{end}_{\text{dnaA}}}{2}.\]

This genomic coordinate was then mapped onto the circular genome using:

\[ \theta_{\text{ori}} = \frac{2\pi \cdot (ori_{pos} \equiv L)}{L}\]

where \(L\) is the chromosome length. The resulting angular coordinate defines the position of oriC within the circular regression framework and enables computation of the origin–terminus expression contrast.

  ori_row <- genes_main[grepl("(^|;)gene=dnaA(;|$)|(^|;)Name=dnaA(;|$)", attr)]
if (nrow(ori_row) == 0) {
  print(head(genes_main[grepl("dnaA", attr)]$attr, 10))
  stop("dnaA not found with gene=/Name=. Inspect attr above and adjust grepl().")
}
ori_pos <- (ori_row$start[1] + ori_row$end[1]) / 2
theta_ori <- (2*pi*(ori_pos %% L))/L
cat("theta_ori (radians):", theta_ori, "\n")

## theta_ori (radians): 5.256298

Peak angles and directionality check

At this stage we have already identified, for each sample, the peak location of the fitted replication-related pattern (\(\theta_{max}\)) and we restrict the analysis to samples with a significant full model (\(q_{full} < 0.05\)). The key question is whether these peak angles are randomly distributed along the circular chromosome or show a consistent preferred direction relative to the replication origin. The Rayleigh test strongly rejects the hypothesis of a uniform angular distribution, indicating that the peaks are not random but instead concentrate around a specific direction. However, the directed V-tests show that this concentration is not toward our oriC reference angle \(\theta_{ori}\), but rather toward the opposite direction \(\theta_{ori}+\pi\). In other words, the fitted “peak direction” is systematically shifted by approximately half a chromosome relative to the origin anchor. This empirical result matters for the downstream quantification of the origin–terminus contrast: it implies that a naive definition based on the fitted maximum/minimum locations (e.g., using \(\theta_{max}\) and \(\theta_{min}\)) would not reliably represent an ori-centered replication gradient in this dataset. Therefore, before proceeding to any technical derivation, these directionality tests motivate an explicit ori-anchored contrast that compares expression at \(\theta_{ori}\) versus \(\theta_{ori}+\pi\).

if (!requireNamespace("circular", quietly = TRUE)) install.packages("circular")
if (!requireNamespace("CircStats", quietly = TRUE)) install.packages("CircStats")
library(circular)

## 
## Adjuntando el paquete: 'circular'

## The following objects are masked from 'package:stats':
## 
##     sd, var

library(CircStats)

## Cargando paquete requerido: MASS

## 
## Adjuntando el paquete: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

## Cargando paquete requerido: boot

## 
## Adjuntando el paquete: 'CircStats'

## The following objects are masked from 'package:circular':
## 
##     A1, A1inv, deg, I.0, I.1, I.p, pp.plot, rad, rose.diag, rstable

sig_q <- !is.na(results$q_full) & results$q_full < 0.05 # We use only BH-significant samples from the full model

th <- (results$theta_max[sig_q]) %% (2*pi) # We need to transform the indormation in a circular form.

mu_ori <- (theta_ori) %% (2*pi)
mu_opp <- (mu_ori+pi) %% (2*pi)

ray <- circular::rayleigh.test(circular(th, units = "radians", modulo = "2pi")) #We perform the tests
vt_ori <- CircStats::v0.test(th, mu0 = mu_ori, degree = FALSE)
vt_opp <- CircStats::v0.test(th, mu0 = mu_opp, degree = FALSE)

mu_hat <- atan2(mean(sin(th)), mean(cos(th))) %% (2*pi)
Rbar  <- sqrt(mean(cos(th))^2 + mean(sin(th))^2)

V_ori <- sum(cos(th - mu_ori)); rbar_ori <- V_ori / length(th)
V_opp <- sum(cos(th - mu_opp)); rbar_opp <- V_opp / length(th)

tibble(
  n_sig = length(th),
  theta_ori = mu_ori,
  theta_opp = mu_opp,
  mu_hat = mu_hat,
  Rbar = Rbar,
  rayleigh_p = unname(ray$p.value),
  V_ori = V_ori,
  rbar_ori = rbar_ori,
  p_ori = unname(vt_ori$p.value),
  V_opp = V_opp,
  rbar_opp = rbar_opp,
  p_opp = unname(vt_opp$p.value)
)

## # A tibble: 1 × 12
##   n_sig theta_ori theta_opp mu_hat  Rbar rayleigh_p V_ori rbar_ori p_ori V_opp
##   <int>     <dbl>     <dbl>  <dbl> <dbl>      <dbl> <dbl>    <dbl> <dbl> <dbl>
## 1    95      5.26      2.11   2.82 0.619   1.53e-16 -44.8   -0.471 1.000  44.8
## # ℹ 2 more variables: rbar_opp <dbl>, p_opp <dbl>

Correction step

To obtain a biologically anchored estimate, we redefined the contrast directly on the fitted model as the difference between the predicted expression at the origin and at the antipodal position: \[\Delta_{ori}=y(\theta_{ori})-y(\theta_{ori}+\pi)= 2(\beta\cos{\theta_{ori}}+\gamma\sin{\theta_{ori}})\] which preserves the sign (ori higher vs opposite higher) and enforces a consistent ori-based reference across samples.

results <- results %>%
  mutate(
    delta_ori = 2 * (beta*cos(theta_ori) + gamma*sin(theta_ori)),
    delta_opp = -delta_ori
  )

merged_results <- merged_results %>%
  dplyr::left_join(
    dplyr::select(results, sample_id, delta_ori, delta_opp),
    by = "sample_id"
  )

sig_q <- !is.na(results$q_full) & results$q_full < 0.05

delta_summary <- results %>%
  filter(sig_q) %>%
  summarise(
    n_sig = n(),
    median_delta_naive = median(delta, na.rm=TRUE),     # delta = 2A (max–min)
    median_delta_ori   = median(delta_ori, na.rm=TRUE), # ori-anchored contrast
    cor_naive_vs_ori   = cor(delta, delta_ori, use="complete.obs"),
    prop_delta_ori_pos = mean(delta_ori > 0, na.rm=TRUE),
    prop_delta_ori_neg = mean(delta_ori < 0, na.rm=TRUE)
  )

cat("\n### Delta contrast summary (q_full < 0.05)\n")

## 
## ### Delta contrast summary (q_full < 0.05)

delta_summary

## # A tibble: 1 × 6
##   n_sig median_delta_naive median_delta_ori cor_naive_vs_ori prop_delta_ori_pos
##   <int>              <dbl>            <dbl>            <dbl>              <dbl>
## 1    95              0.428           -0.314           -0.166              0.221
## # ℹ 1 more variable: prop_delta_ori_neg <dbl>

cat("\n### Example rows: naive delta (2A) vs corrected delta_ori\n")

## 
## ### Example rows: naive delta (2A) vs corrected delta_ori

results %>%
  dplyr::filter(sig_q) %>%
  dplyr::select(sample_id, q_full, A, delta, delta_ori) %>%
  dplyr::arrange(q_full) %>%
  head(10)

## # A tibble: 10 × 5
##    sample_id  q_full      A delta delta_ori
##    <chr>       <dbl>  <dbl> <dbl>     <dbl>
##  1 p1k_00494 0.00158 0.222  0.445    -0.437
##  2 p1k_00493 0.00243 0.214  0.428    -0.421
##  3 p1k_00080 0.00324 0.203  0.406    -0.395
##  4 p1k_00475 0.00324 0.192  0.384    -0.380
##  5 p1k_00476 0.00324 0.205  0.409    -0.405
##  6 p1k_00052 0.00841 0.0866 0.173     0.173
##  7 p1k_00139 0.00841 0.193  0.385     0.384
##  8 p1k_00877 0.00841 0.354  0.707    -0.658
##  9 p1k_00288 0.00933 0.151  0.303    -0.254
## 10 p1k_00291 0.00933 0.161  0.321    -0.282

Integration with measured growth rates

To evaluate whether the replication-associated expression gradient scales with bacterial growth, we integrated the growth-rate metadata provided for each sample. Growth rate was taken directly from the metadata column “Growth Rate (1/hr)” and converted to numeric. As indicated in the dataset documentation, samples with growth rate equal to 1 correspond to missing measurements and were excluded from downstream analyses. We further restricted the dataset to samples with finite estimates of the fitted effect size (A), the naive amplitude contrast (\(\Delta=2A\)), and the ori-anchored contrast (\(\Delta_{ori}\)). Finally, we defined the subset of statistically supported patterns using the BH-adjusted significance threshold on the full model (\(q_{full}<0.05\)), which is used in the subsequent association analyses with growth rate.

merged_results2 <- merged_results %>%
  dplyr::mutate(
    growth_rate = as.numeric(`Growth Rate (1/hr)`)
  )

## Keep only samples with measured growth rate (exclude sentinel = 1) and finite contrasts
known <- merged_results2 %>%
  dplyr::filter(
    !is.na(growth_rate),
    growth_rate != 1,
    is.finite(delta_ori),
    is.finite(delta),
    is.finite(A)
  )

## Significant fits (full model BH)
mask_q <- !is.na(known$q_full) & known$q_full < 0.05

## Quick look
known %>%
  dplyr::select(sample_id, growth_rate, q_full, A, delta, delta_ori) %>%
  dplyr::arrange(growth_rate) %>%
  head(10)

## # A tibble: 10 × 6
##    sample_id growth_rate q_full      A delta delta_ori
##    <chr>           <dbl>  <dbl>  <dbl> <dbl>     <dbl>
##  1 p1k_00003           0  0.415 0.0796 0.159 -0.143   
##  2 p1k_00004           0  0.463 0.0749 0.150 -0.134   
##  3 p1k_00007           0  0.190 0.124  0.248 -0.240   
##  4 p1k_00008           0  0.198 0.116  0.233 -0.224   
##  5 p1k_00019           0  0.462 0.0752 0.150 -0.0730  
##  6 p1k_00020           0  0.489 0.0707 0.141 -0.0652  
##  7 p1k_00021           0  0.475 0.0599 0.120  0.000967
##  8 p1k_00022           0  0.553 0.0536 0.107  0.00127 
##  9 p1k_00023           0  0.408 0.0760 0.152 -0.134   
## 10 p1k_00024           0  0.385 0.0789 0.158 -0.130

Correlation analysis between replication signal and growth rate

Using the samples with measured growth rate (known), we tested whether the replication-associated signal scales with growth by correlating growth rate with (i) the naive amplitude contrast \(\Delta=2A\) and (ii) the ori-anchored contrast \(\Delta_{ori}\) defined above. Correlations were computed on all samples with growth-rate measurements and on the subset of samples with a significant circular signal (\(q<0.05\)).

In the left plot, (\(\delta=2A\)) mainly tells you how strong the wave is in each sample. Red points (significant fits) tend to have larger delta, which just means the wave is clearer. But growth rate does not increase in a clear way as delta increases. In the right plot, delta_ori tells you which side is higher. The positive positive side are the ones higher at the origin side, and the negative are higher on the opposite side.

Most non-significant samples sit near 0 (no clear difference). Significant samples spread away from 0, and many are negative, matching the earlier result that the peak is often opposite to the origin proxy. Therefore we keep using delta_ori, because it measures the specific “origin vs opposite” difference, not just the strength of the wave.

## # A tibble: 4 × 4
##   set         metric      pearson spearman
##   <chr>       <chr>         <dbl>    <dbl>
## 1 ALL         delta (=2A) -0.0315 -0.0654 
## 2 ALL         delta_ori    0.383   0.360  
## 3 q_full<0.05 delta (=2A)  0.110  -0.00949
## 4 q_full<0.05 delta_ori    0.816   0.768

Cross-validation of predictive models

To quantify how informative the replication-related metrics are about bacterial growth, we framed the problem as a prediction task: given a sample’s replication signal estimated from the circular model, can we predict its measured growth rate \(\mu\) (1/hr)? We evaluated simple linear regression models because the project’s theoretical expectation suggests an approximately monotonic relationship between copy-number gradients and growth, and linear models are interpretable and allow a clean comparison between alternative replication metrics.

We compared four model specifications:

\(m_A\): \(\mu \sim A\), where \(A=\sqrt{\beta^2+\gamma^2}\) is the fitted amplitude (effect size) of the sinusoid.
\(m_\Delta\): \(\mu \sim \Delta\), where \(\Delta = 2A\) is the naive peak-to-trough contrast of the fitted sinusoid (phase-free).
\(m_{\Delta_{\mathrm{ori}}}\): \(\mu \sim \Delta_{\mathrm{ori}}\), where \[\Delta_{\mathrm{ori}} = y(\theta_{\mathrm{ori}})-y(\theta_{\mathrm{ori}}+\pi)=2\big(\beta\cos\theta_{\mathrm{ori}}+\gamma\sin\theta_{\mathrm{ori}}\big)\] is the ori-anchored contrast (signed), directly addressing the biological question “origin vs opposite/terminus”.
\(m_{\Delta_{\mathrm{ori}}+A}\): \(\mu \sim \Delta_{\mathrm{ori}} + A\), to test whether adding “overall signal strength” (\(A\)) improves prediction beyond the ori-anchored contrast.

Because \(\Delta = 2A\), models \(m_A\) and \(m_\Delta\) are expected to behave identically (up to rescaling), and are effectively two views of the same phase-free effect-size information.

Then, to estimate out-of-sample performance, we used repeated 5-fold cross-validation (50 repetitions). In each repetition, samples were randomly split into 5 folds; each fold was predicted by a model trained on the other 4 folds, and predictions were pooled across folds to compute performance metrics. We report:

RMSE: \(\sqrt{\frac{1}{n}\sum_i(\mu_i-\hat{\mu}_i)^2}\)
MAE: \(\frac{1}{n}\sum_i|\mu_i-\hat{\mu}_i|\)
\(R^2\): \(1-\frac{\sum_i(\mu_i-\hat{\mu}_i)^2}{\sum_i(\mu_i-\bar{\mu})^2}\)

We evaluated performance on two datasets: First , all samples with valid growth-rate measurements (excluding the sentinel value \(\mu=1\) used for missing growth rates), and the second one sig_q: the subset with a statistically supported circular signal (\(q_{\mathrm{full}}<0.05\)). In the full dataset, the ori-anchored contrast \(\Delta_{\mathrm{ori}}\) provides measurable predictive value (positive \(R^2\)), whereas phase-free effect-size predictors (\(A\) or \(\Delta\)) perform worse than predicting the mean growth rate (negative \(R^2\)). This indicates that, once phase is ignored, “how strong the wave is” does not reliably map to growth across heterogeneous conditions; anchoring the contrast to ori carries substantially more growth-relevant information.

Restricting to samples with a clear circular replication signal strengthens the relationship: \(\Delta_{\mathrm{ori}}\) explains a substantial fraction of out-of-sample variance (\(R^2 \approx 0.59\)). Adding \(A\) slightly worsens performance, suggesting that once the contrast is correctly oriented (ori vs opposite), additional “overall amplitude” does not add independent predictive information and may introduce noise given the small sample size.

These results are consistent with the earlier directionality analysis: in this dataset the fitted peak location is not reliably aligned with \(\theta_{\mathrm{ori}}\), so phase-free metrics (\(A\) or \(\Delta\)) are not appropriate proxies for an origin–terminus gradient. In contrast, \(\Delta_{\mathrm{ori}}\) directly measures the fitted expression difference between ori and the opposite genomic position, preserving direction and remaining interpretable even when the peak is shifted. Therefore, in the remainder of the analysis we treat \(\Delta_{\mathrm{ori}}\) as the primary replication signal when relating the fitted sinusoid to growth rate.

# datasets
mask_q <- !is.na(known$q_full) & known$q_full < 0.05
datasets <- list(
  ALL = known,
  sig_q = known[mask_q, , drop = FALSE]
)

# models (keep it small and interpretable)
models <- list(
  delta_ori   = growth_rate ~ delta_ori,
  delta       = growth_rate ~ delta,
  A           = growth_rate ~ A,
  delta_ori_A = growth_rate ~ delta_ori + A
)

# repeated 5-fold CV
cv_rep <- function(df, form, k = 5, R = 50, seed = 1) {
  set.seed(seed)
  n <- nrow(df)
  out <- vector("list", R)

  for (r in seq_len(R)) {
    fold <- sample(rep(seq_len(k), length.out = n))
    y <- df$growth_rate
    pred <- rep(NA_real_, n)

    for (f in seq_len(k)) {
      fit <- lm(form, data = df[fold != f, , drop = FALSE])
      pred[fold == f] <- predict(fit, newdata = df[fold == f, , drop = FALSE])
    }

    res <- y - pred
    rmse <- sqrt(mean(res^2, na.rm = TRUE))
    mae  <- mean(abs(res), na.rm = TRUE)
    r2   <- 1 - sum(res^2, na.rm = TRUE) / sum((y - mean(y, na.rm = TRUE))^2, na.rm = TRUE)

    out[[r]] <- tibble(RMSE = rmse, MAE = mae, R2 = r2)
  }

  dplyr::bind_rows(out)
}

# run CV
cv_all <- list()
for (ds in names(datasets)) {
  for (m in names(models)) {
    cv_all[[paste(ds, m, sep = "__")]] <-
      cv_rep(datasets[[ds]], models[[m]]) %>%
      dplyr::mutate(dataset = ds, model = m, n = nrow(datasets[[ds]]))
  }
}
cv_df <- dplyr::bind_rows(cv_all)

# summary table (main deliverable)
cv_summary <- cv_df %>%
  dplyr::group_by(dataset, model, n) %>%
  dplyr::summarise(
    RMSE = mean(RMSE, na.rm = TRUE),
    MAE  = mean(MAE,  na.rm = TRUE),
    R2   = mean(R2,   na.rm = TRUE),
    .groups = "drop"
  ) %>%
  dplyr::arrange(dataset, RMSE)

cv_summary

## # A tibble: 8 × 6
##   dataset model           n  RMSE   MAE      R2
##   <chr>   <chr>       <int> <dbl> <dbl>   <dbl>
## 1 ALL     delta_ori     353 0.244 0.179  0.134 
## 2 ALL     delta_ori_A   353 0.245 0.179  0.128 
## 3 ALL     A             353 0.265 0.193 -0.0152
## 4 ALL     delta         353 0.265 0.193 -0.0152
## 5 sig_q   delta_ori      27 0.214 0.145  0.594 
## 6 sig_q   delta_ori_A    27 0.224 0.159  0.552 
## 7 sig_q   A              27 0.373 0.300 -0.241 
## 8 sig_q   delta          27 0.373 0.300 -0.241

# RMSE distributions (compact visual comparison)
library(ggplot2)
library(dplyr)

cv_df$model <- factor(cv_df$model,
                      levels = c("A","delta","delta_ori","delta_ori_A"))

p <- cv_df %>%
  filter(dataset %in% c("ALL","sig_q")) %>%
  mutate(dataset = ifelse(dataset=="ALL",
                          "ALL samples",
                          "Significant (q < 0.05)")) %>%
  ggplot(aes(x=model, y=RMSE, fill=model)) +
  geom_boxplot(alpha=0.8, outlier.alpha=0.3) +
  facet_wrap(~dataset, scales="free_y") +
  scale_fill_brewer(palette="Blues") +
  theme_minimal(base_size = 13) +
  theme(
    legend.position="none",
    axis.text.x = element_text(angle=45, hjust=1),
    panel.grid.minor = element_blank()
  ) +
  labs(
    title = "Model comparison via repeated cross-validation",
    y = "RMSE",
    x = "Model"
  )

print(p)

Final model fitting

Based on the cross-validation results, we retained the simplest and best-performing specification and fitted the final model on the full dataset \(growth_rate \sim \Delta_{ori}\).

Across all samples with measured growth rate (\(n=353\)), \(\Delta_{ori}\) shows a clear positive association with growth (slope \(=0.660\), \(p=9.35\times 10^{-14}\)), explaining a modest but non-negligible fraction of variance (\(R^2=0.146\)). This is expected in a heterogeneous compendium where many factors beyond replication-related gene dosage contribute to growth.

We then fitted the same model on a higher-confidence subset restricted to samples with a statistically significant circular signal (\(q_{full}<0.05\); \(n=27\)). In this subset the relationship strengthens substantially (slope \(=0.881\), \(p=2.19\times 10^{-7}\)), and the model explains roughly two thirds of the observed growth-rate variance (\(R^2=0.665\)). Overall, these final fits confirm that the ori-anchored replication contrast \(\Delta_{ori}\) is an interpretable predictor of growth, and that its explanatory power is highest when the replication-associated sinusoidal signal is clearly detectable in the transcriptome.

# Final model (ALL samples with growth rate)
fit_all <- lm(growth_rate ~ delta_ori, data = known)

# High-confidence subset (q_full < 0.05)
mask_q <- !is.na(known$q_full) & known$q_full < 0.05
fit_sigq <- lm(growth_rate ~ delta_ori, data = known[mask_q, , drop = FALSE])

summary(fit_all)

## 
## Call:
## lm(formula = growth_rate ~ delta_ori, data = known)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.63878 -0.10467  0.00721  0.11575  0.78458 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.63794    0.01369   46.58  < 2e-16 ***
## delta_ori    0.66021    0.08508    7.76 9.35e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2434 on 351 degrees of freedom
## Multiple R-squared:  0.1464, Adjusted R-squared:  0.144 
## F-statistic: 60.22 on 1 and 351 DF,  p-value: 9.345e-14

summary(fit_sigq)

## 
## Call:
## lm(formula = growth_rate ~ delta_ori, data = known[mask_q, , 
##     drop = FALSE])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.30299 -0.09349 -0.01470  0.02464  0.49837 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.74656    0.04079  18.301 5.43e-16 ***
## delta_ori    0.88111    0.12503   7.047 2.19e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2018 on 25 degrees of freedom
## Multiple R-squared:  0.6652, Adjusted R-squared:  0.6518 
## F-statistic: 49.66 on 1 and 25 DF,  p-value: 2.187e-07

Out-of-fold predictions and regression diagnostics

As an extra check on the final model (\(growth\_rate \sim \Delta_{ori}\)), we generated out-of-fold (OOF) predictions with a single 5-fold split (one prediction per sample, always from a model trained without that sample). This complements the repeated CV summary by showing how the errors look, not only how big they are.

The cloud follows the expected upward trend, so \(\Delta_{ori}\) carries real signal. However, predictions are pulled toward the average: very low growth tends to be overestimated and very high growth tends to be underestimated. This “regression to the mean” is typical for a simple one-feature linear model.

Residuals are centered around zero with no strong curved pattern, which supports a roughly linear relationship at this resolution. The spread is not perfectly constant and a few points sit far from zero, suggesting remaining variability driven by factors not included in the model (condition-to-condition differences, noise, or other biology beyond replication signal).

Residuals are broadly symmetric but not perfectly tight, consistent with a model that captures a real effect while leaving substantial unexplained variation in the full dataset.

Overall, the diagnostics match what we saw in cross-validation: \(\Delta_{ori}\) is a meaningful predictor, but a single linear term cannot fully explain growth across all experimental conditions.

set.seed(123)

df <- known
k  <- 5

# Assign folds once, produce one OOF prediction per sample
fold_id <- sample(rep(1:k, length.out = nrow(df)))
pred_oof <- rep(NA_real_, nrow(df))

for (f in 1:k) {
  fit <- lm(growth_rate ~ delta_ori, data = df[fold_id != f, , drop=FALSE])
  pred_oof[fold_id == f] <- predict(fit, newdata = df[fold_id == f, , drop=FALSE])
}

resid_oof <- df$growth_rate - pred_oof

rmse_oof <- sqrt(mean(resid_oof^2, na.rm=TRUE))
mae_oof  <- mean(abs(resid_oof), na.rm=TRUE)
r2_oof   <- 1 - sum(resid_oof^2, na.rm=TRUE) / 
  sum((df$growth_rate - mean(df$growth_rate, na.rm=TRUE))^2, na.rm=TRUE)

tibble(
  model = "OOF 5-fold | growth_rate ~ delta_ori",
  RMSE = rmse_oof,
  MAE  = mae_oof,
  R2   = r2_oof
)

## # A tibble: 1 × 4
##   model                                 RMSE   MAE    R2
##   <chr>                                <dbl> <dbl> <dbl>
## 1 OOF 5-fold | growth_rate ~ delta_ori 0.244 0.179 0.138

# Plots
cols <- col_q_fun(pmin(pmax(df$q_full, 0), 1))

par(mfrow=c(1,3), mar=c(4,4,2,1))

plot(df$growth_rate, pred_oof,
     pch=16, col=cols,
     xlab="Observed growth rate (1/hr)",
     ylab="OOF predicted growth rate (1/hr)",
     main="OOF predicted vs observed")
abline(0, 1, lty=2, lwd=2)

plot(pred_oof, resid_oof, pch=16, col="grey50",
     xlab="OOF fitted", ylab="Residual",
     main="Residuals vs fitted (OOF)")
abline(h=0, lty=2)

hist(resid_oof, breaks=40, col="grey90", border="grey60",
     main="Residuals (OOF)", xlab="Residual")

par(mfrow=c(1,1))

Classification performance: ROC and AUC analysis

To evaluate whether the replication signal can separate fast vs slow growth, we turned growth rate into a binary label: fast is the top 25% of observed growth rates, slow the remaining 75%.

We then used the OOF predicted growth rates as a continuous score and computed an ROC curve across all thresholds. The AUC summarizes discrimination: it is the probability that a randomly chosen fast sample gets a higher score than a randomly chosen slow sample (\(AUC=0.5\) is random, \(AUC \to 1\) is strong separation).

The ROC curve lies consistently above the diagonal, indicating genuine out-of-sample discriminative signal. However, the separation is only moderate (AUC ≈ 0.69): many fast and slow samples overlap in predicted scores, so achieving high sensitivity requires accepting a non-trivial false-positive rate. Overall, \(\Delta_{ori}\) captures enough replication-related information to rank growth phenotypes better than chance, but it is not a strong standalone classifier in this dataset.

## Type 'citation("pROC")' for a citation.

## 
## Adjuntando el paquete: 'pROC'

## The following object is masked from 'package:circular':
## 
##     var

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

## 
## AUC (fast = top 25% by observed growth): 0.6884576

Prediction of growth rates for unlabeled samples.

Samples with missing growth-rate measurements were identified using the sentinel value \(growth_rate = 1\). For these samples, we predicted growth rate from the ori-anchored replication contrast \(\Delta_{ori}\) using the final regression model fitted on all labeled samples (\(growth_rate \sim \Delta_{ori}\)). We also computed 95% prediction intervals to reflect sample-level uncertainty. For unlabeled samples that also show a statistically supported circular signal (\(q_{full}<0.05\)), we additionally generated predictions from the model fitted only on the significant subset (same specification, trained on high-confidence samples). The output table reports the final prediction, its interval, and which model generated it.

## # A tibble: 20 × 8
##    sample_id q_full delta_ori     A pred_final pred_final_lo pred_final_hi
##    <chr>      <dbl>     <dbl> <dbl>      <dbl>         <dbl>         <dbl>
##  1 p1k_00486 0.0222     0.542 0.297       1.22         0.770          1.68
##  2 p1k_00485 0.0224     0.542 0.296       1.22         0.770          1.68
##  3 p1k_00462 0.0330     0.504 0.319       1.19         0.739          1.64
##  4 p1k_00461 0.0336     0.500 0.317       1.19         0.737          1.64
##  5 p1k_00471 0.0312     0.493 0.308       1.18         0.731          1.63
##  6 p1k_00826 0.157      0.812 0.454       1.17         0.673          1.67
##  7 p1k_00472 0.0378     0.468 0.300       1.16         0.711          1.61
##  8 p1k_00840 0.190      0.753 0.416       1.13         0.637          1.63
##  9 p1k_00841 0.181      0.750 0.418       1.13         0.636          1.63
## 10 p1k_00830 0.299      0.680 0.407       1.09         0.592          1.58
## 11 p1k_00831 0.300      0.670 0.399       1.08         0.586          1.57
## 12 p1k_00839 0.341      0.644 0.366       1.06         0.570          1.56
## 13 p1k_00845 0.361      0.613 0.369       1.04         0.551          1.53
## 14 p1k_00835 0.336      0.605 0.359       1.04         0.545          1.53
## 15 p1k_00834 0.336      0.596 0.356       1.03         0.540          1.52
## 16 p1k_00865 0.0202     0.319 0.306       1.03         0.591          1.46
## 17 p1k_00838 0.377      0.584 0.340       1.02         0.533          1.51
## 18 p1k_00851 0.310      0.582 0.367       1.02         0.531          1.51
## 19 p1k_00776 0.0360     0.308 0.161       1.02         0.582          1.45
## 20 p1k_00850 0.317      0.573 0.363       1.02         0.526          1.51
## # ℹ 1 more variable: pred_source <chr>

## # A tibble: 1 × 6
##   n_unknown n_sig_q n_all pred_min pred_med pred_max
##       <int>   <int> <int>    <dbl>    <dbl>    <dbl>
## 1       682      68   614    0.161    0.577     1.22

Visualization of circular expression profiles

To make the circular model easier to interpret, we plotted three representative samples: (i) a fast-growing sample with a significant circular replication signal, (ii) a slow / weak-signal sample, and (iii) one unlabeled sample (missing growth rate) when available.

For each sample we show three panels:

Expression vs genomic angle (\(\theta\)): binned mean logTPM across the genome with the fitted sinusoid, plus reference lines at \(\theta_{ori}\) and \(\theta_{ori}+\pi\) (opposite side), and the fitted \(\theta_{max}\) / \(\theta_{min}\).
Polar view: the same information mapped to a circle to emphasize genome continuity.
Compact summary: signal strength and orientation (\(A\), \(\Delta=2A\), \(\Delta_{ori}\)), model significance (\(q_{full}\)), observed growth rate when available, and the predicted growth rate with a 95% prediction interval. Predictions follow the same rule used before: if \(q_{full}<0.05\) we use the \(sig\_q\) model; otherwise we use the ALL model.

These plots provide an intuitive link between the fitted circular expression gradient and how \(\Delta_{ori}\) is used to predict growth.

Report Advanced Genomics

Joel Frayle Moreno- Francesco Mazza

2026-02-11