Methodology
Technical details of the interpretable splicing prediction model
On This Page
1 Input Features
The model takes a 90-nucleotide sequence (70nt exon + 10nt flanking on each side) and extracts three types of features:
Sequence One-Hot Encoding
Each nucleotide is encoded as a 4-dimensional vector:
Shape: 90 × 4 matrix
Structure One-Hot Encoding
RNA secondary structure is predicted using ViennaRNA (RNAfold) and encoded:
Shape: 90 × 3 matrix
Wobble Pair Indicators
G-U (wobble) base pairs are weaker than Watson-Crick pairs. A binary indicator marks positions involved in wobble pairing:
Shape: 90 × 1 vector
Total Input: 90 × 8 features per sequence (4 sequence + 3 structure + 1 wobble)
2 Model Architecture
The model uses a dual-branch architecture designed for interpretability. Each branch computes "energy" contributions to inclusion or skipping.
Architecture Overview
┌─────────────────────────────────────────────────────────────────┐
│ Input: 90nt sequence │
└─────────────────────────────────────────────────────────────────┘
│
┌───────────────┼───────────────┐
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Sequence │ │ Structure │ │ Wobble │
│ (90 × 4) │ │ (90 × 3) │ │ (90 × 1) │
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
▼ │ │
┌─────────────┐ │ │
│ Conv1D │ │ │
│ 20 filters │ │ │
│ width=6 │ ▼ │
└─────────────┘ ┌─────────────┐ │
│ │ Conv1D │ │
│ │ 8 filters │ │
│ │ width=30 │ │
│ └─────────────┘ │
│ │ │
└───────────────┼───────────────┘
│
▼
┌───────────────────────────────┐
│ Position-Specific Biases │
│ (90 inclusion + 90 skip) │
└───────────────────────────────┘
│
┌───────────────┴───────────────┐
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ E_inclusion │ │ E_skipping │
│ (sum over │ │ (sum over │
│ positions) │ │ positions) │
└─────────────────┘ └─────────────────┘
│ │
└───────────────┬───────────────┘
│
▼
┌───────────────────────────────┐
│ ΔE = E_inc - E_skip │
└───────────────────────────────┘
│
▼
┌───────────────────────────────┐
│ Residual Tuner (MLP) │
│ Adds non-linear correction │
└───────────────────────────────┘
│
▼
┌───────────────────────────────┐
│ σ(ΔE + residual) │
│ → PSI (0-1) │
└───────────────────────────────┘
Sequence Branch
- • Conv1D with 20 filters
- • Kernel width: 6 nucleotides
- • Detects sequence motifs
- • No activation (linear)
Structure Branch
- • Conv1D with 8 filters
- • Kernel width: 30 nucleotides
- • Captures structural context
- • Wider receptive field
Position Biases
- • 90 inclusion biases
- • 90 skipping biases
- • Learned per-position
- • Position-specific effects
Residual Tuner
- • Small MLP network
- • Non-linear corrections
- • Captures interactions
- • Improves accuracy
3 Interpretability Features
Unlike black-box models, this architecture provides built-in interpretability through position-specific energy contributions.
Force Plot
The force plot shows how each position contributes to the final PSI prediction:
Promote exon inclusion (higher PSI)
Promote exon skipping (lower PSI)
Energy Calculation
For each position, the model computes:
Contribution = Einclusion(pos) - Eskipping(pos)
Total ΔE = Σ Contribution(pos) for all positions
PSI = σ(ΔE + residual)
Where σ is the sigmoid function mapping to [0, 1]
Advantages
- Additive decomposition: Total effect is sum of position effects
- No post-hoc explanation: Interpretability is built into the model
- Biologically meaningful: Separate inclusion/skipping pathways
- Identifies key positions: Find which nucleotides matter most
4 Training Details
Dataset
| Name | ES7_HeLa |
| Size | ~150,000 sequences |
| Cell Type | HeLa cells |
| Method | MPRA |
| Libraries | A, B, C |
Training Setup
| Loss Function | Binary KL Divergence |
| Optimizer | Adam |
| Batch Size | 256 |
| Validation | 10% holdout |
| Early Stopping | Yes (patience=10) |
About MPRA
Massively Parallel Reporter Assay is a high-throughput technique that allows simultaneous measurement of splicing outcomes for thousands of synthetic exon sequences. This provides ground-truth PSI values for model training.
5 Structure Prediction
RNA secondary structure is predicted using ViennaRNA, a widely-used package for RNA structure prediction.
ViennaRNA (RNAfold)
What it does
Predicts the minimum free energy (MFE) secondary structure for an RNA sequence. This represents the most thermodynamically stable folding configuration.
Example Output
# Sequence
GGTAGTACGCCAATTCGCCG...CTACATATACTACT
# Structure (dot-bracket notation)
...(((....)))........(((...)))......
# Minimum Free Energy
-12.30 kcal/mol
Why structure matters
RNA secondary structure affects splicing by:
- • Sequestering splice sites in stem-loops
- • Exposing or hiding regulatory motifs
- • Bringing distant elements into proximity
- • Affecting protein binding accessibility