Fast-SeqFunc Documentation

fast-seqfunc is a Python library for building sequence-function models quickly and easily, leveraging PyCaret and machine learning techniques to predict functional properties from biological sequences.

Getting Started

Quickstart Tutorial - Learn the basics of training and using sequence-function models
Regression Tutorial - Learn how to predict continuous values from sequences
Classification Tutorial - Learn how to classify sequences into discrete categories

Installation

Install fast-seqfunc using pip:

pip install fast-seqfunc

Or directly from GitHub for the latest version:

pip install git+https://github.com/ericmjl/fast-seqfunc.git

Key Features

Easy-to-use API: Train models and make predictions with just a few lines of code
Automatic Model Selection: Uses PyCaret to automatically compare and select the best model
Sequence Embedding: Currently supports one-hot encoding with more methods coming soon
Regression and Classification: Support for both continuous values and categorical outputs
Comprehensive Evaluation: Built-in metrics and visualization utilities

Why Fast-SeqFunc?

The primary motivation behind Fast-SeqFunc is to quickly answer a crucial question in sequence-function modeling: Is there detectable signal in my data?

In biological sequence-function problems, determining whether a predictive relationship exists is a critical first step before investing significant resources in complex modeling approaches. Fast-SeqFunc allows you to:

Rapidly detect signal: Quickly build baseline models to determine if your sequence data contains predictive information
Make early decisions: Identify promising directions early in your research process
Rank candidates efficiently: Use simple but effective models to score and prioritize candidate sequences for experimental testing
Validate before scaling: Confirm signal exists before investing time in developing more complex neural network models
Iterate strategically: When signal is detected, use that knowledge to guide the development of more sophisticated models

By providing a fast path to baseline model development, Fast-SeqFunc helps you make informed decisions about where to focus your modeling efforts.

Basic Usage

Command-Line Interface

Fast-SeqFunc provides a convenient command-line interface for common tasks:

# Train a model
fast-seqfunc train train_data.csv --sequence-col sequence --target-col function --embedding-method one-hot

# Make predictions with a trained model
fast-seqfunc predict-cmd model.pkl new_sequences.csv --sequence-col sequence --output-dir predictions --predictions-filename predictions.csv

# Compare different embedding methods
fast-seqfunc compare-embeddings train_data.csv --test-data test_data.csv

Python API

You can also use Fast-SeqFunc programmatically in your Python code:

from fast_seqfunc import train_model, predict, save_model

# Train a model
model_info = train_model(
    train_data=train_df,
    sequence_col="sequence",
    target_col="function",
    embedding_method="one-hot",
    model_type="regression"
)

# Make predictions
predictions = predict(model_info, new_sequences)

# Save the model
save_model(model_info, "model.pkl")

Roadmap

Future development plans include:

Additional embedding methods (ESM, CARP, etc.)
Integration with more advanced deep learning models
Enhanced visualization and interpretation tools
Expanded support for various sequence types
Benchmarking against established methods

Contributing

Contributions are welcome! Please feel free to submit a Pull Request or open an issue to discuss improvements or feature requests.