Fast-SeqFunc Quickstart

This guide demonstrates how to use fast-seqfunc for training sequence-function models and making predictions with your own sequence data.

Prerequisites

Python 3.11 or higher
The fast-seqfunc package installed

Setup

The fast-seqfunc package comes with a command-line interface (CLI) that makes it easy to train models and make predictions without writing any code.

To see all available commands, run:

fast-seqfunc --help

For help with a specific command, use:

fast-seqfunc [command] --help

Data Preparation

For this tutorial, we assume you already have a sequence-function dataset with the following format:

sequence,function
ACGTACGT...,0.75
TACGTACG...,0.63
...

You'll need to split your data into training and test sets. You can use any CSV file manipulation tool for this, or you can use the built-in synthetic data generator to create sample data:

# Generate synthetic regression data (DNA sequences with G count function)
fast-seqfunc generate-synthetic g_count --output-dir data --total-count 1000 --split-data

This will create train.csv, val.csv, and test.csv files in the data directory.

Training a Model

With fast-seqfunc, you can train a model with a single command:

# Train and compare multiple models automatically
fast-seqfunc train data/train.csv \
  --val-data data/val.csv \
  --test-data data/test.csv \
  --sequence-col sequence \
  --target-col function \
  --embedding-method one-hot \
  --model-type regression \
  --output-dir outputs

# The model will be saved to outputs/model.pkl by default

The command above will:

Load your training, validation, and test data
Embed the sequences using one-hot encoding
Train multiple regression models using PyCaret
Select the best model based on performance
Evaluate the model on the test data
Save the model and performance metrics

Making Predictions

Making predictions on new sequences is straightforward:

# Make predictions on test data
fast-seqfunc predict-cmd outputs/model.pkl data/test.csv \
  --sequence-col sequence \
  --output-dir prediction_outputs \
  --predictions-filename predictions.csv

# Results will be saved to prediction_outputs/predictions.csv
# A histogram of predictions will be generated (if applicable)

This command will:

Load your trained model
Load the sequences from your test data
Generate predictions for each sequence
Save the results to a CSV file with both the original sequences and the predictions

Comparing Embedding Methods

You can also compare different embedding methods to see which works best for your data:

# Compare different embedding methods on the same dataset
fast-seqfunc compare-embeddings data/train.csv \
  --test-data data/test.csv \
  --sequence-col sequence \
  --target-col function \
  --model-type regression \
  --output-dir comparison_outputs

# Results will be saved to comparison_outputs/embedding_comparison.csv
# Individual models will be saved in comparison_outputs/models/

This command will:

Train models using different embedding methods (one-hot, and others if available)
Evaluate each model on the test data
Compare the performance metrics
Save the results and models

Generating Synthetic Data

Fast-SeqFunc includes a powerful synthetic data generator for different sequence-function relationships:

# See available synthetic data tasks
fast-seqfunc list-synthetic-tasks

# Generate data for a specific task
fast-seqfunc generate-synthetic motif_position \
  --sequence-type dna \
  --motif ATCG \
  --noise-level 0.2 \
  --output-dir data/motif_task

# Generate classification data
fast-seqfunc generate-synthetic classification \
  --sequence-type protein \
  --output-dir data/classification_task

The synthetic data generator can create datasets with various sequence-function relationships, including:

Linear relationships (G count, GC content)
Position-dependent functions (motif position)
Nonlinear relationships (length-dependent functions)
Classification problems (presence/absence of patterns)
And many more!

Interpreting Results for Signal Detection

One of the primary purposes of Fast-SeqFunc is to quickly determine if there is meaningful "signal" in your sequence-function data. Here's how to interpret your results:

Evaluating Signal Presence

Check performance metrics:
For regression: R², RMSE, and MAE values
For classification: Accuracy, F1 score, AUC-ROC
Use visualizations:
Scatter plots of predicted vs. actual values
Residual plots showing systematic patterns or random noise
ROC curves for classification tasks
Benchmarks for determining signal:
Models significantly outperforming random guessing indicate signal
R² values above 0.3-0.4 suggest detectable relationships
AUC-ROC values above 0.6-0.7 indicate useful classification signal

Leveraging Early Signal

When you detect signal:

Prioritize candidates: Use model predictions to rank and select promising sequences for experimental testing
Iterate experimentally: Test top-ranked sequences and use results to refine your model
Decide on complexity: Strong signal warrants investment in more sophisticated models like neural networks
Compare embedding methods: If signal is present, explore if more complex embeddings (ESM, CARP) improve performance

Remember that even modest performance can be valuable for prioritizing experimental candidates and guiding exploration of sequence space.

Next Steps

After mastering the basics, you can:

Try different embedding methods (currently only one-hot is supported, with more coming soon)
Experiment with classification problems by setting --model-type classification
Generate different types of synthetic data to benchmark your approach
Explore the Python API for more advanced customization

For more details, check out the API documentation.