Methodology

Technical details of the interpretable splicing prediction model

1 Input Features

The model takes a 90-nucleotide sequence (70nt exon + 10nt flanking on each side) and extracts three types of features:

Sequence One-Hot Encoding

Each nucleotide is encoded as a 4-dimensional vector:

A → [1,0,0,0]
C → [0,1,0,0]
G → [0,0,1,0]
T → [0,0,0,1]

Shape: 90 × 4 matrix

Structure One-Hot Encoding

RNA secondary structure is predicted using ViennaRNA (RNAfold) and encoded:

. (unpaired) → [1,0,0]
( (left pair) → [0,1,0]
) (right pair) → [0,0,1]

Shape: 90 × 3 matrix

Wobble Pair Indicators

G-U (wobble) base pairs are weaker than Watson-Crick pairs. A binary indicator marks positions involved in wobble pairing:

1 if position is in a G-U pair, 0 otherwise

Shape: 90 × 1 vector

Total Input: 90 × 8 features per sequence (4 sequence + 3 structure + 1 wobble)

2 Model Architecture

The model uses a dual-branch architecture designed for interpretability. Each branch computes "energy" contributions to inclusion or skipping.

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                    Input: 90nt sequence                          │
└─────────────────────────────────────────────────────────────────┘
                              │
              ┌───────────────┼───────────────┐
              ▼               ▼               ▼
     ┌─────────────┐  ┌─────────────┐  ┌─────────────┐
     │  Sequence   │  │  Structure  │  │   Wobble    │
     │  (90 × 4)   │  │  (90 × 3)   │  │  (90 × 1)   │
     └─────────────┘  └─────────────┘  └─────────────┘
              │               │               │
              ▼               │               │
     ┌─────────────┐          │               │
     │   Conv1D    │          │               │
     │ 20 filters  │          │               │
     │  width=6    │          ▼               │
     └─────────────┘  ┌─────────────┐         │
              │       │   Conv1D    │         │
              │       │  8 filters  │         │
              │       │  width=30   │         │
              │       └─────────────┘         │
              │               │               │
              └───────────────┼───────────────┘
                              │
                              ▼
              ┌───────────────────────────────┐
              │   Position-Specific Biases    │
              │   (90 inclusion + 90 skip)    │
              └───────────────────────────────┘
                              │
              ┌───────────────┴───────────────┐
              ▼                               ▼
     ┌─────────────────┐             ┌─────────────────┐
     │   E_inclusion   │             │    E_skipping   │
     │   (sum over     │             │   (sum over     │
     │   positions)    │             │   positions)    │
     └─────────────────┘             └─────────────────┘
              │                               │
              └───────────────┬───────────────┘
                              │
                              ▼
              ┌───────────────────────────────┐
              │      ΔE = E_inc - E_skip      │
              └───────────────────────────────┘
                              │
                              ▼
              ┌───────────────────────────────┐
              │   Residual Tuner (MLP)        │
              │   Adds non-linear correction  │
              └───────────────────────────────┘
                              │
                              ▼
              ┌───────────────────────────────┐
              │         σ(ΔE + residual)      │
              │         → PSI (0-1)           │
              └───────────────────────────────┘

Sequence Branch

  • • Conv1D with 20 filters
  • • Kernel width: 6 nucleotides
  • • Detects sequence motifs
  • • No activation (linear)

Structure Branch

  • • Conv1D with 8 filters
  • • Kernel width: 30 nucleotides
  • • Captures structural context
  • • Wider receptive field

Position Biases

  • • 90 inclusion biases
  • • 90 skipping biases
  • • Learned per-position
  • • Position-specific effects

Residual Tuner

  • • Small MLP network
  • • Non-linear corrections
  • • Captures interactions
  • • Improves accuracy

3 Interpretability Features

Unlike black-box models, this architecture provides built-in interpretability through position-specific energy contributions.

Force Plot

The force plot shows how each position contributes to the final PSI prediction:

Positive values

Promote exon inclusion (higher PSI)

Negative values

Promote exon skipping (lower PSI)

Energy Calculation

For each position, the model computes:

Contribution = Einclusion(pos) - Eskipping(pos)

Total ΔE = Σ Contribution(pos) for all positions

PSI = σ(ΔE + residual)

Where σ is the sigmoid function mapping to [0, 1]

Advantages

  • Additive decomposition: Total effect is sum of position effects
  • No post-hoc explanation: Interpretability is built into the model
  • Biologically meaningful: Separate inclusion/skipping pathways
  • Identifies key positions: Find which nucleotides matter most

4 Training Details

Dataset

Name ES7_HeLa
Size ~150,000 sequences
Cell Type HeLa cells
Method MPRA
Libraries A, B, C

Training Setup

Loss Function Binary KL Divergence
Optimizer Adam
Batch Size 256
Validation 10% holdout
Early Stopping Yes (patience=10)

About MPRA

Massively Parallel Reporter Assay is a high-throughput technique that allows simultaneous measurement of splicing outcomes for thousands of synthetic exon sequences. This provides ground-truth PSI values for model training.

5 Structure Prediction

RNA secondary structure is predicted using ViennaRNA, a widely-used package for RNA structure prediction.

ViennaRNA (RNAfold)

What it does

Predicts the minimum free energy (MFE) secondary structure for an RNA sequence. This represents the most thermodynamically stable folding configuration.

Example Output

# Sequence

GGTAGTACGCCAATTCGCCG...CTACATATACTACT

# Structure (dot-bracket notation)

...(((....)))........(((...)))......

# Minimum Free Energy

-12.30 kcal/mol

Why structure matters

RNA secondary structure affects splicing by:

  • • Sequestering splice sites in stem-loops
  • • Exposing or hiding regulatory motifs
  • • Bringing distant elements into proximity
  • • Affecting protein binding accessibility