Scientific Methodology

An interpretable deep learning model for predicting RNA alternative splicing outcomes

What is PSI?

PSI (Percent Spliced In) is a measure of how often an exon is included in the mature mRNA transcript during alternative splicing. It ranges from 0 to 1:

PSI = 1.0

Exon always included

PSI = 0.5

50/50 inclusion

PSI = 0.0

Exon always skipped

Alternative splicing is a key regulatory mechanism that allows a single gene to produce multiple protein variants. Understanding and predicting splicing outcomes is crucial for studying gene regulation and disease mechanisms.

How It Works

1

Input the Sequence

Enter a 70-nucleotide exon sequence (A, C, G, T only)

2

Add Flanking Sequences

The model adds 10 nucleotides on each side from the original experimental context

3

Predict RNA Structure

ViennaRNA predicts the secondary structure and identifies wobble base pairs

4

Neural Network Prediction

The deep learning model predicts PSI based on sequence, structure, and wobble features

Who Should Use This Tool?

Researchers

Studying alternative splicing mechanisms and regulation

Synthetic Biologists

Designing synthetic exons with specific splicing behavior

Clinicians

Investigating potential splicing effects of genetic variants

Educators & Students

Learning about splicing regulation and computational biology

1 Input Features

The model takes a 90-nucleotide sequence (70nt exon + 10nt flanking on each side) and extracts three types of features:

Sequence One-Hot Encoding

Each nucleotide is encoded as a 4-dimensional vector:

A → [1,0,0,0]
C → [0,1,0,0]
G → [0,0,1,0]
T → [0,0,0,1]

Shape: 90 × 4 matrix

Structure One-Hot Encoding

RNA secondary structure is predicted using ViennaRNA (RNAfold) and encoded:

. (unpaired) → [1,0,0]
( (left pair) → [0,1,0]
) (right pair) → [0,0,1]

Shape: 90 × 3 matrix

Wobble Pair Indicators

G-U (wobble) base pairs are weaker than Watson-Crick pairs. A binary indicator marks positions involved in wobble pairing:

1 if position is in a G-U pair, 0 otherwise

Shape: 90 × 1 vector

Total Input: 90 × 8 features per sequence (4 sequence + 3 structure + 1 wobble)

2 Model Architecture

The model uses a dual-branch architecture designed for interpretability. Each branch computes "energy" contributions to inclusion or skipping.

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                    Input: 90nt sequence                          │
└─────────────────────────────────────────────────────────────────┘
                              │
              ┌───────────────┼───────────────┐
              ▼               ▼               ▼
     ┌─────────────┐  ┌─────────────┐  ┌─────────────┐
     │  Sequence   │  │  Structure  │  │   Wobble    │
     │  (90 × 4)   │  │  (90 × 3)   │  │  (90 × 1)   │
     └─────────────┘  └─────────────┘  └─────────────┘
              │               │               │
              ▼               │               │
     ┌─────────────┐          │               │
     │   Conv1D    │          │               │
     │ 20 filters  │          │               │
     │  width=6    │          ▼               │
     └─────────────┘  ┌─────────────┐         │
              │       │   Conv1D    │         │
              │       │  8 filters  │         │
              │       │  width=30   │         │
              │       └─────────────┘         │
              │               │               │
              └───────────────┼───────────────┘
                              │
                              ▼
              ┌───────────────────────────────┐
              │   Position-Specific Biases    │
              │   (90 inclusion + 90 skip)    │
              └───────────────────────────────┘
                              │
              ┌───────────────┴───────────────┐
              ▼                               ▼
     ┌─────────────────┐             ┌─────────────────┐
     │   E_inclusion   │             │    E_skipping   │
     │   (sum over     │             │   (sum over     │
     │   positions)    │             │   positions)    │
     └─────────────────┘             └─────────────────┘
              │                               │
              └───────────────┬───────────────┘
                              │
                              ▼
              ┌───────────────────────────────┐
              │      ΔE = E_inc - E_skip      │
              └───────────────────────────────┘
                              │
                              ▼
              ┌───────────────────────────────┐
              │   Residual Tuner (MLP)        │
              │   Adds non-linear correction  │
              └───────────────────────────────┘
                              │
                              ▼
              ┌───────────────────────────────┐
              │         σ(ΔE + residual)      │
              │         → PSI (0-1)           │
              └───────────────────────────────┘

Sequence Branch

  • • Conv1D with 20 filters
  • • Kernel width: 6 nucleotides
  • • Detects sequence motifs
  • • No activation (linear)

Structure Branch

  • • Conv1D with 8 filters
  • • Kernel width: 30 nucleotides
  • • Captures structural context
  • • Wider receptive field

Position Biases

  • • 90 inclusion biases
  • • 90 skipping biases
  • • Learned per-position
  • • Position-specific effects

Residual Tuner

  • • Small MLP network
  • • Non-linear corrections
  • • Captures interactions
  • • Improves accuracy

3 Interpretability Features

Unlike black-box models, this architecture provides built-in interpretability through position-specific energy contributions.

Force Plot

The force plot shows how each position contributes to the final PSI prediction:

Positive values

Promote exon inclusion (higher PSI)

Negative values

Promote exon skipping (lower PSI)

Energy Calculation

For each position, the model computes:

Contribution = Einclusion(pos) - Eskipping(pos)

Total ΔE = Σ Contribution(pos) for all positions

PSI = σ(ΔE + residual)

Where σ is the sigmoid function mapping to [0, 1]

Advantages

  • Additive decomposition: Total effect is sum of position effects
  • No post-hoc explanation: Interpretability is built into the model
  • Biologically meaningful: Separate inclusion/skipping pathways
  • Identifies key positions: Find which nucleotides matter most

4 Training Details

Dataset

Name ES7_HeLa
Size ~150,000 sequences
Cell Type HeLa cells
Method MPRA
Libraries A, B, C

Training Setup

Loss Function Binary KL Divergence
Optimizer Adam
Batch Size 256
Validation 10% holdout
Early Stopping Yes (patience=10)

About MPRA

Massively Parallel Reporter Assay is a high-throughput technique that allows simultaneous measurement of splicing outcomes for thousands of synthetic exon sequences. This provides ground-truth PSI values for model training.

5 Structure Prediction

RNA secondary structure is predicted using ViennaRNA, a widely-used package for RNA structure prediction.

ViennaRNA (RNAfold)

What it does

Predicts the minimum free energy (MFE) secondary structure for an RNA sequence. This represents the most thermodynamically stable folding configuration.

Example Output

# Sequence

GGTAGTACGCCAATTCGCCG...CTACATATACTACT

# Structure (dot-bracket notation)

...(((....)))........(((...)))......

# Minimum Free Energy

-12.30 kcal/mol

Why structure matters

RNA secondary structure affects splicing by:

  • • Sequestering splice sites in stem-loops
  • • Exposing or hiding regulatory motifs
  • • Bringing distant elements into proximity
  • • Affecting protein binding accessibility

6 Model Performance

Metric Value Description
Test R2 ~0.85 Variance explained on held-out test set
Correlation ~0.92 Pearson correlation with experimental PSI
Test RMSE ~0.12 Root mean squared error on test set
Training Data ~150,000 Synthetic exon sequences from ES7_HeLa

7 Limitations

  • Fixed sequence length: Only accepts exactly 70-nucleotide exon sequences
  • Training data: Model was trained on HeLa cell data from the ES7 library
  • Cell type specificity: Predictions may not generalize to all cell types or tissues
  • Cis-regulatory only: Does not consider trans-acting factors or cellular context