Convolutional Neural Network Insights for Experts

🔍 Insight #1: CNNs are not learning pixels. They are learning Lie group representations.

Most practitioners stop at "CNNs are translation invariant". But why is translation invariance so important?

Because natural signals (images, videos, audio) live on low-dimensional manifolds invariant under transformations—rotations, translations, scalings.

This brings us to Lie groups—continuous groups of transformations.

Convolution exploits the group structure of translation, i.e., it is equivariant to the action of the translation group R2\mathbb{R}^2R2.

And CNNs? They are essentially learning representations of signals under a group action—like Fourier bases for R\mathbb{R}R, wavelets for L2(R)L^2(\mathbb{R})L2(R), CNNs for 2D images under SE(2) or more complex transformations.

In other words:

Convolution = group convolution over the translation group
Pooling = projection to invariants (e.g., via Haar integration over the group)

This is the mathematical soul of CNNs—rooted in representation theory and harmonic analysis.

This is why generalized convolutions like spherical CNNs, graph CNNs, and group-equivariant CNNs (e.g., by Taco Cohen) exist—they generalize standard CNNs by replacing the translation group with other symmetry groups.

🔍 Insight #2: CNNs are localized, shift-equivariant linear operators.

Let’s formalize this.

Any layer in a CNN applies a linear operator TTT followed by a nonlinearity ϕ\phiϕ. The operator TTT satisfies:

T(τxf)=τx(Tf)T(\tau_x f) = \tau_x(Tf)

T(τxf)=τx(Tf)

Where τx\tau_xτx is a shift (translation) operator.

Such operators are convolutional. That is:

All linear, shift-equivariant operators are convolutions.
(The Convolution Theorem)

This is not a convenient accident—it is a deep algebraic constraint. CNNs are essentially parameter-efficient approximators of a certain class of functions with symmetry constraints.

🔍 Insight #3: CNNs are hierarchical tensor factorizations over spatial statistics.

Forget filters for a moment.

Consider this: images are highly structured. Pixels are not i.i.d.—they have strong local dependencies.

CNNs model this structure by creating low-rank tensor factorizations in space:

First layer: local statistics (edges, gradients)
Higher layers: compositions of local structures (parts → objects)

If you stack fully connected layers, you need O(n2)O(n^2)O(n2) parameters.

CNNs, by imposing:

locality
parameter sharing
hierarchical composition

...essentially force a low-rank inductive bias. They are not universal approximators in raw form, but they are universal over the class of compositional functions with local structure.

This aligns with the Manifold Hypothesis: natural data lies on low-dimensional manifolds embedded in high-dimensional space.

CNNs build this manifold progressively, respecting local symmetries.

🔍 Insight #4: CNNs learn causal fields, not just correlations.

Think about what a convolution does.

You take a patch of the input, you apply a filter, and the output represents something caused by that patch.

But this is just correlation, right?
No. Training transforms these filters into approximations of causal fields.

When trained on supervised tasks, filters evolve to model predictive causes of the output—edges that suggest object boundaries, textures that imply semantic categories.

In effect, CNNs under supervision learn to map statistical regularities into causally relevant latent variables.

This connects CNNs to causal inference under certain symmetry assumptions.

🔍 Insight #5: Pooling approximates integration over orbits of the transformation group.

This is deep.

We said convolution is equivariant to translation:

T(τxf)=τx(Tf)T(\tau_x f) = \tau_x(Tf)

T(τxf)=τx(Tf)

But we want invariance (e.g., classification should not depend on exact location).

Pooling, in this view, acts as an approximate integral over the group orbit. It’s an operation that maps equivariant representations into invariant features.

If GGG is your group and fff your feature map:

Invariant feature=∫Gf(g⋅x) dg\text{Invariant feature} = \int_G f(g \cdot x) \, dg

Invariant feature=∫Gf(g⋅x)dg

CNNs do this approximately via:

Max pooling: pick the most responsive element (approx. of supremum over orbit)
Average pooling: true group average

So pooling isn’t just downsampling—it’s a crude invariant map under group action.

🔍 Insight #6: Filters in early layers converge to Gabor-like functions—not coincidentally.

You’ve seen it. Every trained CNN ends up with first-layer filters that look like:

Edge detectors
Orientation-sensitive blobs

These are not artifacts.

They are the optimal basis functions for natural image statistics under the assumptions of:

Locality
Energy minimization
Redundancy reduction

This matches with Gabor filters, wavelets, receptive fields in V1 of human vision.

That is: CNNs rediscover neuroscience and signal processing under a differentiable optimization landscape.

🔍 Final Insight: CNNs are discrete approximations to continuous differential operators.

At the limit, if your kernel becomes infinitesimally small and you take strides to 1 pixel, your convolution becomes a discrete approximation to a differential operator.

In fact, certain convolutions approximate:

Laplacians → image diffusion
Sobel filters → spatial gradients

This means CNNs can be understood as learned PDE solvers over image manifolds.

In future theories of AI, CNNs will be seen not as “neural networks” but as numerical solvers on learned geometric spaces.

🧠 TL;DR: What only top researchers deeply understand

CNNs = group-equivariant operators — They learn structured representations respecting the symmetries of input data.
Filters = basis functions for local patches — They act like learned Fourier/wavelet bases in hierarchical space.
Pooling = projection to invariants under group actions — The key to generalization.
Architecture = inductive bias — Locality + hierarchy enables CNNs to mimic natural signal structures.
Training = energy minimization on manifold — CNNs implicitly encode priors about the world geometry.
Vision = physics — CNNs are learning differential approximations over fields of causal interaction.

If you want to go even further, we can discuss:

Gauge-equivariant neural networks
Category theory view of deep models
Connections to spectral graph theory and Laplacians
Information Bottleneck Principle applied to CNNs

Let me know if you want to go there.

You said:

I\[i+x−1,j+y−1,z\]

why are you subtracting 1 fom i+x

ChatGPT said:

Excellent catch. You're referring to this term in the convolution sum:

I[i+x−1,j+y−1,z]I[i + x - 1, j + y - 1, z]

I[i+x−1,j+y−1,z]

And you're asking:

Why the -1?

Why are we not simply indexing as I[i+x,j+y,z]I[i + x, j + y, z]I[i+x,j+y,z]?

🔎 The Deep Reason: It’s About Centering the Kernel

Let's start from the beginning:

Assume:

Input III of size m1×m2×mcm_1 \times m_2 \times m_cm1×m2×mc
Kernel KKK of size n1×n2×mcn_1 \times n_2 \times m_cn1×n2×mc
You want to compute the convolution output at location (i,j)(i, j)(i,j)

Now when we apply convolution, we’re placing the center of the kernel at position (i,j)(i, j)(i,j) in the input. But arrays in code (and math) index from the top-left corner, not the center.

So to align the center of the kernel over input point (i,j)(i, j)(i,j), we must offset all the coordinates.

Here’s the key idea:

Let’s say the kernel has size n1=3n_1 = 3n1=3, n2=3n_2 = 3n2=3, i.e., it spans:

from −11−1 to +1+1+1 in both directions, if we treat its center as origin

So if your filter is:

[K[−1,−1]K[−1,0]K[−1,1]K[0,−1]K[0,0]K[0,1]K[1,−1]K[1,0]K[1,1]]\begin{bmatrix} K[-1,-1] & K[-1,0] & K[-1,1] \\ K[0,-1] & K[0,0] & K[0,1] \\ K[1,-1] & K[1,0] & K[1,1] \\ \end{bmatrix}

K[−1,−1]K[0,−1]K[1,−1]K[−1,0]K[0,0]K[1,0]K[−1,1]K[0,1]K[1,1]

You want to compute the sum:

f[i,j]=∑x=−11∑y=−11K[x,y]⋅I[i+x,j+y]f[i, j] = \sum_{x=-1}^{1} \sum_{y=-1}^{1} K[x,y] \cdot I[i + x, j + y]

f[i,j]=x=−1∑1y=−1∑1K[x,y]⋅I[i+x,j+y]

But if your kernel is stored as a 2D array from K[0,0]K[0,0]K[0,0] to K[2,2]K[2,2]K[2,2] (as is common in code), then:

K[0,0]K[0,0]K[0,0] refers to top-left
K[1,1]K[1,1]K[1,1] is the center

So to convert this to math that uses array indices starting from 0, we need to shift:

x′=x−centerx=x−⌊n1/2⌋x' = x - \text{center}_x = x - \lfloor n_1/2 \rfloorx′=x−centerx=x−⌊n1/2⌋
y′=y−centery=y−⌊n2/2⌋y' = y - \text{center}_y = y - \lfloor n_2/2 \rfloory′=y−centery=y−⌊n2/2⌋

This gives:

I[i+x−1,j+y−1]I[i + x - 1, j + y - 1]

I[i+x−1,j+y−1]

if n1=n2=3n_1 = n_2 = 3n1=n2=3, and kernel center is at (1,1)(1, 1)(1,1).

✅ TL;DR

We subtract 1 from i+xi + xi+x to shift the kernel's origin from top-left corner to its center.
This centers the kernel correctly over the pixel I[i,j]I[i, j]I[i,j] during convolution.
It ensures symmetry and correct alignment during feature extraction.