Scientific Methodology
An interpretable deep learning model for predicting RNA alternative splicing outcomes
On This Page
What is PSI?
PSI (Percent Spliced In) is a measure of how often an exon is included in the mature mRNA transcript during alternative splicing. It ranges from 0 to 1:
Exon always included
50/50 inclusion
Exon always skipped
Alternative splicing is a key regulatory mechanism that allows a single gene to produce multiple protein variants. Understanding and predicting splicing outcomes is crucial for studying gene regulation and disease mechanisms.
How It Works
Input the Sequence
Enter a 70-nucleotide exon sequence (A, C, G, T only)
Add Flanking Sequences
The model adds 10 nucleotides on each side from the original experimental context
Predict RNA Structure
ViennaRNA predicts the secondary structure and identifies wobble base pairs
Neural Network Prediction
The deep learning model predicts PSI based on sequence, structure, and wobble features
Who Should Use This Tool?
Researchers
Studying alternative splicing mechanisms and regulation
Synthetic Biologists
Designing synthetic exons with specific splicing behavior
Clinicians
Investigating potential splicing effects of genetic variants
Educators & Students
Learning about splicing regulation and computational biology
1 Input Features
The model takes a 90-nucleotide sequence (70nt exon + 10nt flanking on each side) and extracts three types of features:
Sequence One-Hot Encoding
Each nucleotide is encoded as a 4-dimensional vector:
Shape: 90 × 4 matrix
Structure One-Hot Encoding
RNA secondary structure is predicted using ViennaRNA (RNAfold) and encoded:
Shape: 90 × 3 matrix
Wobble Pair Indicators
G-U (wobble) base pairs are weaker than Watson-Crick pairs. A binary indicator marks positions involved in wobble pairing:
Shape: 90 × 1 vector
Total Input: 90 × 8 features per sequence (4 sequence + 3 structure + 1 wobble)
2 Model Architecture
The model uses a dual-branch architecture designed for interpretability. Each branch computes "energy" contributions to inclusion or skipping.
Architecture Overview
┌─────────────────────────────────────────────────────────────────┐
│ Input: 90nt sequence │
└─────────────────────────────────────────────────────────────────┘
│
┌───────────────┼───────────────┐
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Sequence │ │ Structure │ │ Wobble │
│ (90 × 4) │ │ (90 × 3) │ │ (90 × 1) │
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
▼ │ │
┌─────────────┐ │ │
│ Conv1D │ │ │
│ 20 filters │ │ │
│ width=6 │ ▼ │
└─────────────┘ ┌─────────────┐ │
│ │ Conv1D │ │
│ │ 8 filters │ │
│ │ width=30 │ │
│ └─────────────┘ │
│ │ │
└───────────────┼───────────────┘
│
▼
┌───────────────────────────────┐
│ Position-Specific Biases │
│ (90 inclusion + 90 skip) │
└───────────────────────────────┘
│
┌───────────────┴───────────────┐
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ E_inclusion │ │ E_skipping │
│ (sum over │ │ (sum over │
│ positions) │ │ positions) │
└─────────────────┘ └─────────────────┘
│ │
└───────────────┬───────────────┘
│
▼
┌───────────────────────────────┐
│ ΔE = E_inc - E_skip │
└───────────────────────────────┘
│
▼
┌───────────────────────────────┐
│ Residual Tuner (MLP) │
│ Adds non-linear correction │
└───────────────────────────────┘
│
▼
┌───────────────────────────────┐
│ σ(ΔE + residual) │
│ → PSI (0-1) │
└───────────────────────────────┘
Sequence Branch
- • Conv1D with 20 filters
- • Kernel width: 6 nucleotides
- • Detects sequence motifs
- • No activation (linear)
Structure Branch
- • Conv1D with 8 filters
- • Kernel width: 30 nucleotides
- • Captures structural context
- • Wider receptive field
Position Biases
- • 90 inclusion biases
- • 90 skipping biases
- • Learned per-position
- • Position-specific effects
Residual Tuner
- • Small MLP network
- • Non-linear corrections
- • Captures interactions
- • Improves accuracy
3 Interpretability Features
Unlike black-box models, this architecture provides built-in interpretability through position-specific energy contributions.
Force Plot
The force plot shows how each position contributes to the final PSI prediction:
Promote exon inclusion (higher PSI)
Promote exon skipping (lower PSI)
Energy Calculation
For each position, the model computes:
Contribution = Einclusion(pos) - Eskipping(pos)
Total ΔE = Σ Contribution(pos) for all positions
PSI = σ(ΔE + residual)
Where σ is the sigmoid function mapping to [0, 1]
Advantages
- Additive decomposition: Total effect is sum of position effects
- No post-hoc explanation: Interpretability is built into the model
- Biologically meaningful: Separate inclusion/skipping pathways
- Identifies key positions: Find which nucleotides matter most
4 Training Details
Dataset
| Name | ES7_HeLa |
| Size | ~150,000 sequences |
| Cell Type | HeLa cells |
| Method | MPRA |
| Libraries | A, B, C |
Training Setup
| Loss Function | Binary KL Divergence |
| Optimizer | Adam |
| Batch Size | 256 |
| Validation | 10% holdout |
| Early Stopping | Yes (patience=10) |
About MPRA
Massively Parallel Reporter Assay is a high-throughput technique that allows simultaneous measurement of splicing outcomes for thousands of synthetic exon sequences. This provides ground-truth PSI values for model training.
5 Structure Prediction
RNA secondary structure is predicted using ViennaRNA, a widely-used package for RNA structure prediction.
ViennaRNA (RNAfold)
What it does
Predicts the minimum free energy (MFE) secondary structure for an RNA sequence. This represents the most thermodynamically stable folding configuration.
Example Output
# Sequence
GGTAGTACGCCAATTCGCCG...CTACATATACTACT
# Structure (dot-bracket notation)
...(((....)))........(((...)))......
# Minimum Free Energy
-12.30 kcal/mol
Why structure matters
RNA secondary structure affects splicing by:
- • Sequestering splice sites in stem-loops
- • Exposing or hiding regulatory motifs
- • Bringing distant elements into proximity
- • Affecting protein binding accessibility
6 Model Performance
| Metric | Value | Description |
|---|---|---|
| Test R2 | ~0.85 | Variance explained on held-out test set |
| Correlation | ~0.92 | Pearson correlation with experimental PSI |
| Test RMSE | ~0.12 | Root mean squared error on test set |
| Training Data | ~150,000 | Synthetic exon sequences from ES7_HeLa |
7 Limitations
- Fixed sequence length: Only accepts exactly 70-nucleotide exon sequences
- Training data: Model was trained on HeLa cell data from the ES7 library
- Cell type specificity: Predictions may not generalize to all cell types or tissues
- Cis-regulatory only: Does not consider trans-acting factors or cellular context