API Reference

This document provides details on the main functions and classes available in the fast-seqfunc package.

Core Functions

`train_model`

from fast_seqfunc import train_model

model_info = train_model(
    train_data,
    val_data=None,
    test_data=None,
    sequence_col="sequence",
    target_col="function",
    embedding_method="one-hot",
    model_type="regression",
    optimization_metric=None,
    **kwargs
)

Trains a sequence-function model using PyCaret.

Parameters:

train_data: DataFrame or path to CSV file with training data.
val_data: Optional validation data (not directly used, reserved for future).
test_data: Optional test data for final evaluation.
sequence_col: Column name containing sequences.
target_col: Column name containing target values.
embedding_method: Method to use for embedding sequences. Currently only "one-hot" is supported.
model_type: Type of modeling problem ("regression" or "classification").
optimization_metric: Metric to optimize during model selection (e.g., "r2", "accuracy", "f1").
**kwargs: Additional arguments passed to PyCaret setup.

Returns:

Dictionary containing the trained model and related metadata.

`predict`

from fast_seqfunc import predict

predictions = predict(
    model_info,
    sequences,
    sequence_col="sequence"
)

Generates predictions for new sequences using a trained model.

Parameters:

model_info: Dictionary from train_model containing model and related information.
sequences: Sequences to predict (list, Series, or DataFrame).
sequence_col: Column name in DataFrame containing sequences.

Returns:

Array of predictions.

`save_model`

from fast_seqfunc import save_model

save_model(model_info, path)

Saves the model to disk.

Parameters:

model_info: Dictionary containing model and related information.
path: Path to save the model.

Returns:

None

`load_model`

from fast_seqfunc import load_model

model_info = load_model(path)

Loads a trained model from disk.

Parameters:

path: Path to saved model file.

Returns:

Dictionary containing the model and related information.

Embedder Classes

`OneHotEmbedder`

from fast_seqfunc.embedders import OneHotEmbedder

embedder = OneHotEmbedder(sequence_type="auto")
embeddings = embedder.fit_transform(sequences)

One-hot encoding for protein or nucleotide sequences.

Parameters:

sequence_type: Type of sequences to encode ("protein", "dna", "rna", or "auto").

Methods:

fit(sequences): Determine alphabet and set up the embedder.
transform(sequences): Transform sequences to one-hot encodings.
fit_transform(sequences): Fit and transform in one step.

Helper Functions

`get_embedder`

from fast_seqfunc.embedders import get_embedder

embedder = get_embedder(method="one-hot")

Get an embedder instance based on method name.

Parameters:

method: Embedding method (currently only "one-hot" is supported).

Returns:

Configured embedder instance.

`evaluate_model`

from fast_seqfunc.core import evaluate_model

results = evaluate_model(
    model,
    X_test,
    y_test,
    embedder,
    model_type,
    embed_cols
)

Evaluate model performance on test data.

Parameters:

model: Trained model.
X_test: Test sequences.
y_test: True target values.
embedder: Embedder to transform sequences.
model_type: Type of model (regression or classification).
embed_cols: Column names for embedded features.

Returns:

Dictionary containing metrics and prediction data with structure:

{
   "metrics": {metric_name: value, ...},
   "predictions_data": {
       "y_true": [...],
       "y_pred": [...]
   }
}

`save_detailed_metrics`

from fast_seqfunc.core import save_detailed_metrics

save_detailed_metrics(
    metrics_data,
    output_dir,
    model_type,
    embedding_method="unknown"
)

Save detailed model metrics to files in the specified directory.

Parameters:

metrics_data: Dictionary containing metrics and prediction data from evaluate_model.
output_dir: Directory to save metrics files.
model_type: Type of model (regression or classification).
embedding_method: Embedding method used for this model.

Returns:

None

Output Files:

JSON file with detailed metrics
CSV file with raw predictions and true values
Visualization plots based on model type:
For regression: scatter plot, residual plot
For classification: confusion matrix