Fast-SeqFunc: Design Document

Overview

Fast-SeqFunc is a Python package designed for efficient sequence-function modeling of proteins and nucleotide sequences. It provides a simple, high-level API that handles sequence embedding methods and automates model selection and training through the PyCaret framework.

The primary purpose of Fast-SeqFunc is to quickly detect whether there is meaningful "signal" in sequence-function data. By enabling rapid model development, researchers can determine early if predictive relationships exist, opportunistically use these models for ranking candidate sequences, and make informed decisions about investing in more complex modeling approaches when signal is detected.

Design Goals

Simplicity: Provide a clean, intuitive API for training sequence-function models
Flexibility: Support multiple sequence types with custom alphabet capabilities
Automation: Leverage PyCaret to automate model selection and hyperparameter tuning
Performance: Enable efficient processing through lazy loading and clean architecture
Signal Detection: Rapidly determine if predictive relationships exist in the data
Decision Support: Help users make informed choices about modeling approaches based on signal strength
Candidate Ranking: Enable efficient prioritization of sequences for experimental testing

Architecture

Core Components

The package is structured around these key components:

Core API (core.py)
High-level functions for training, prediction, and model management
Handles data loading and orchestration between embedders and models
Embedders (embedders.py)
OneHotEmbedder: One-hot encoding for protein, DNA, RNA, and custom alphabets
Factory function get_embedder to create embedder instances
Alphabets (alphabets.py)
Alphabet class for representing character sets and tokenization rules
Support for standard alphabets (protein, DNA, RNA) and custom alphabets
Handles mixed-length tokens and various sequence formats
Models (models.py)
SequenceFunctionModel: Main model class integrating with PyCaret
Handles training, prediction, evaluation, and persistence
CLI (cli.py)
Command-line interface built with Typer
Commands for training, prediction, and embedding comparison
Synthetic Data (synthetic.py)
Functions for generating synthetic sequence-function data
Various task generators for different use cases

Data Flow

User provides sequence-function data (sequences + target values)
Data is validated and preprocessed
Sequences are embedded using the selected embedding method
PyCaret explores various ML models on the embeddings
Best model is selected, fine-tuned, and returned
Results and model artifacts are saved

API Design

High-Level API

from fast_seqfunc import train_model, predict, load_model

# Train a model
model_info = train_model(
    train_data="train_data.csv",
    test_data="test_data.csv",
    sequence_col="sequence",
    target_col="function",
    embedding_method="one-hot",
    model_type="regression",
    optimization_metric="r2",
)

# Make predictions
predictions = predict(model_info, new_sequences)

# Save and load models
with open("model.pkl", "wb") as f:
    pickle.dump(model_info, f)

with open("model.pkl", "rb") as f:
    loaded_model = pickle.load(f)

Command-Line Interface

The CLI provides commands for training, prediction, and embedding comparison:

# Train a model
fast-seqfunc train train_data.csv --sequence-col sequence --target-col function --embedding-method one-hot

# Make predictions
fast-seqfunc predict model.pkl new_sequences.csv --output-path predictions.csv

# Compare embedding methods
fast-seqfunc compare-embeddings train_data.csv --test-data test_data.csv

Key Design Decisions

1. Embedding Strategy

One-Hot Encoding: Primary embedding method for all sequence types
Custom Alphabets: Support for user-defined alphabets through the Alphabet class
Auto-Detection: Auto-detection of sequence type (protein, DNA, RNA)
Gap Handling: Configurable padding for sequences of different lengths

2. Alphabet Design

Flexible Tokenization: Support for character-based, delimited, and regex-based tokenization
Standard Alphabets: Built-in support for protein, DNA, and RNA
Token Mapping: Bidirectional mapping between tokens and indices
Sequence Padding: Automatic handling of variable-length sequences

3. Model Integration

PyCaret Integration: Leverage PyCaret for automated model selection
Model Type Flexibility: Support for regression and classification tasks
Performance Evaluation: Built-in metrics calculation based on model type
Serialization: Simple model saving and loading

4. Synthetic Data Generation

Task Generators: Functions for creating various sequence-function relationships
Customization: Configurable difficulty, noise, and relationship types
Data Types: Support for protein, DNA, RNA, and integer sequences

Implementation Details

OneHotEmbedder

Supports protein, DNA, RNA, and custom sequences
Auto-detects sequence type when configured to 'auto'
Handles padding and truncating with configurable gap character
Provides both flattened and 2D one-hot encodings

Alphabet Class

Represents sets of tokens with various tokenization strategies
Provides factory methods for standard biological alphabets
Supports custom token sets with arbitrary delimiters
Handles sequence padding and token-to-index mappings

SequenceFunctionModel

Integrates with PyCaret for model training
Handles different model types (regression, classification)
Provides model evaluation methods
Supports serialization for saving/loading

Synthetic Data Generation

Generate random sequences with controlled properties
Create sequence-function datasets with known relationships
Support for various task types (count, position, pattern, etc.)
Configurable noise and complexity levels

Dependencies

Core dependencies:
pandas: Data handling
numpy: Numerical operations
pycaret: Automated ML
scikit-learn: Model evaluation metrics
loguru: Logging
typer: CLI
lazy-loader: Lazy imports

Future Enhancements

Advanced Embedders:
Implement CARP integration for protein embeddings
Implement ESM2 integration for protein embeddings
Caching Mechanism:
Add disk caching for embeddings to improve performance on repeated runs
Enhance PyCaret Integration:
Add more customization options for model selection
Support for custom models
Expand Data Loading:
Support for FASTA file formats
Support for more complex dataset structures
Add Visualization:
Built-in visualizations for model performance
Sequence importance analysis

Conclusion

Fast-SeqFunc provides a streamlined approach to sequence-function modeling with a focus on simplicity and automation. The architecture balances flexibility with ease of use, allowing users to train models with minimal code while providing options for custom alphabets and sequence types.

The current implementation focuses on one-hot encoding with strong support for custom alphabets, while laying the groundwork for more advanced embedding methods in the future.