Sequence Regression with Fast-SeqFunc
This tutorial demonstrates how to use fast-seqfunc
for regression problems, where you want to predict continuous values from biological sequences.
Overview
In sequence regression, we want to learn to predict a continuous value (e.g., binding affinity, enzyme efficiency, or protein stability) from a biological sequence (DNA, RNA, or protein). This tutorial will walk you through:
- Setting up your environment
- Preparing sequence-function data
- Training a regression model
- Evaluating model performance
- Making predictions on new sequences
- Visualizing results
Prerequisites
- Python 3.11 or higher
- The following packages:
pip install fast-seqfunc pandas numpy matplotlib seaborn scikit-learn loguru
Setup
First, let's import all necessary packages:
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from fast_seqfunc import train_model, predict, save_model, load_model
from loguru import logger
Working with Sequence-Function Data
Sequence-function data typically consists of biological sequences paired with measurements of a functional property. For this tutorial, we'll create synthetic data:
from fast_seqfunc import generate_dataset_by_task
# Generate a GC content dataset as an example
# (where the function is simply the GC content of DNA sequences)
data = generate_dataset_by_task(
task="gc_content",
count=1000, # Number of sequences to generate
length=50, # Sequence length
noise_level=0.1, # Add some noise to make the task more realistic
)
# Examine the data
print(data.head())
print(f"Data shape: {data.shape}")
print(f"Target distribution: min={data['function'].min():.3f}, "
f"max={data['function'].max():.3f}, "
f"mean={data['function'].mean():.3f}")
Preparing Your Own Data
If you have your own data, it should be structured in a DataFrame with at least two columns: - A column containing the sequences (e.g., "sequence") - A column containing the target values (e.g., "function")
For example:
# Load your own data
# data = pd.read_csv("your_sequence_function_data.csv")
Splitting Data for Training and Testing
It's important to evaluate your model on data it hasn't seen during training:
# Split into train and test sets (80/20 split)
train_size = int(0.8 * len(data))
train_data = data[:train_size].copy()
test_data = data[train_size:].copy()
logger.info(f"Data split: {len(train_data)} train, {len(test_data)} test samples")
# Create output directory for results
output_dir = Path("output")
output_dir.mkdir(parents=True, exist_ok=True)
Training a Regression Model
Now we can train a regression model using fast-seqfunc
:
# Train a regression model
logger.info("Training regression model...")
model_info = train_model(
train_data=train_data,
test_data=test_data,
sequence_col="sequence", # Column containing sequences
target_col="function", # Column containing target values
embedding_method="one-hot", # Method to convert sequences to numerical features
model_type="regression", # Specify regression task
optimization_metric="r2", # Metric to optimize (r2, rmse, mae)
)
# Display test results
if model_info.get("test_results"):
logger.info("Test metrics from training:")
for metric, value in model_info["test_results"].items():
logger.info(f" {metric}: {value:.4f}")
# Save the model for later use
model_path = output_dir / "regression_model.pkl"
save_model(model_info, model_path)
logger.info(f"Model saved to {model_path}")
Understanding Embedding Methods
The embedding_method
parameter determines how sequences are converted to numerical features:
"one-hot"
: Each position in the sequence is encoded as a one-hot vector indicating which amino acid or nucleotide is present.
Future versions of fast-seqfunc
will include more advanced embedding methods such as ESM2 for proteins and CARP for nucleic acids.
Making Predictions
After training, you can use your model to make predictions on new sequences:
# Generate some new data for prediction
new_data = generate_dataset_by_task(
task="gc_content",
count=200,
length=50,
)
# Make predictions
predictions = predict(model_info, new_data["sequence"])
# Create results DataFrame
results_df = new_data.copy()
results_df["predicted"] = predictions
results_df.to_csv(output_dir / "regression_predictions.csv", index=False)
print(results_df.head())
Evaluating Regression Performance
Let's evaluate our model more thoroughly:
# Calculate regression metrics
true_values = test_data["function"]
predicted_values = predict(model_info, test_data["sequence"])
# Calculate metrics
mse = mean_squared_error(true_values, predicted_values)
rmse = np.sqrt(mse)
r2 = r2_score(true_values, predicted_values)
mae = mean_absolute_error(true_values, predicted_values)
# Print metrics
print(f"Test Set Performance:")
print(f" MSE: {mse:.4f}")
print(f" RMSE: {rmse:.4f}")
print(f" R²: {r2:.4f}")
print(f" MAE: {mae:.4f}")
Interpreting Signal and Making Decisions
A critical purpose of Fast-SeqFunc is to determine if there's meaningful signal in your sequence-function data. Let's look at how to interpret these results and make decisions about next steps:
# Define a threshold for "meaningful signal"
r2_threshold = 0.1 # This is a modest threshold for biological data
signal_detected = r2 > r2_threshold
print(f"Signal detected in data: {signal_detected}")
if signal_detected:
print("Recommendations:")
print(" 1. Use model for candidate ranking in next experiments")
print(" 2. Consider exploring more advanced embedding methods")
print(f" 3. Signal strength (R²={r2:.4f}) suggests {'strong' if r2 > 0.3 else 'moderate' if r2 > 0.1 else 'weak'} relationship")
else:
print("Recommendations:")
print(" 1. Review experiment design and data quality")
print(" 2. Consider adding more data or different sequence features")
print(" 3. Explore if other modeling approaches detect signal")
Using Models for Candidate Ranking
If you detected signal, you can use your model to rank new candidate sequences:
# Generate or import candidate sequences
import random
from fast_seqfunc.synthetic import generate_random_sequences
# Generate 1000 random candidates
candidate_sequences = generate_random_sequences(
1000, length=50, alphabet_type="dna"
)
# Predict their values
candidate_predictions = predict(model_info, candidate_sequences)
# Create DataFrame for ranking
candidates_df = pd.DataFrame({
"sequence": candidate_sequences,
"predicted_value": candidate_predictions
})
# Sort by predicted value (ascending or descending based on your goal)
candidates_df.sort_values("predicted_value", ascending=False, inplace=True)
# Select top candidates for experimental testing
top_candidates = candidates_df.head(10)
print("Top 10 candidates for experimental testing:")
print(top_candidates)
# Save ranked candidates
candidates_df.to_csv(output_dir / "ranked_candidates.csv", index=False)
This approach allows you to prioritize which sequences to test next in your experiments, potentially saving significant time and resources.
Visualizing Results
Visualizations help in understanding model performance:
# Create a prediction vs. actual scatter plot
plt.figure(figsize=(10, 8))
plt.scatter(true_values, predicted_values, alpha=0.6)
plt.plot([true_values.min(), true_values.max()],
[true_values.min(), true_values.max()],
'r--', lw=2)
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Actual vs. Predicted Values")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig(output_dir / "regression_scatter_plot.png", dpi=300)
# Plot residuals
residuals = true_values - predicted_values
plt.figure(figsize=(10, 8))
plt.scatter(predicted_values, residuals, alpha=0.6)
plt.axhline(y=0, color='r', linestyle='--', lw=2)
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.title("Residual Plot")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig(output_dir / "regression_residuals.png", dpi=300)
# Plot distribution of residuals
plt.figure(figsize=(10, 8))
sns.histplot(residuals, kde=True)
plt.xlabel("Residuals")
plt.ylabel("Frequency")
plt.title("Distribution of Residuals")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig(output_dir / "regression_residuals_distribution.png", dpi=300)
logger.info("Visualizations saved to output directory")
Working with Different Sequence Types
fast-seqfunc
automatically detects and handles different types of biological sequences:
- DNA (containing A, C, G, T)
- RNA (containing A, C, G, U)
- Proteins (containing amino acid letters)
# Example with protein sequences
from fast_seqfunc import generate_random_sequences
# Generate random protein sequences
protein_sequences = generate_random_sequences(
length=30,
count=100,
alphabet="ACDEFGHIKLMNPQRSTVWY", # Protein alphabet
fixed_length=True,
)
# Create a dummy function (e.g., number of hydrophobic residues)
hydrophobic = "AVILMFYW"
function_values = [
sum(seq.count(aa) for aa in hydrophobic) / len(seq)
for seq in protein_sequences
]
# Create dataset
protein_data = pd.DataFrame({
"sequence": protein_sequences,
"function": function_values
})
# Now you could train a model on this protein data using the same workflow
Advanced Model Training Options
fast-seqfunc
uses PyCaret behind the scenes, which allows for customization:
# Example with more options
advanced_model_info = train_model(
train_data=train_data,
test_data=test_data,
sequence_col="sequence",
target_col="function",
embedding_method="one-hot",
model_type="regression",
optimization_metric="r2",
# Additional PyCaret setup options:
n_jobs=-1, # Use all available CPU cores
fold=5, # 5-fold cross-validation
normalize=True, # Normalize features
polynomial_features=True, # Generate polynomial features
feature_selection=True, # Perform feature selection
)
Loading and Reusing Models
You can load saved models for reuse:
# Load a previously saved model
loaded_model_info = load_model(model_path)
# Use the loaded model for predictions
new_predictions = predict(loaded_model_info, new_data["sequence"])
# Verify the predictions match those from the original model
np.allclose(predictions, new_predictions)
Conclusion
You've now learned how to:
1. Prepare sequence-function data
2. Train a regression model using fast-seqfunc
3. Make predictions on new sequences
4. Evaluate model performance
5. Visualize results
For more advanced features and applications, check out the API reference and additional tutorials.
Next Steps
- Try different regression tasks (e.g., "motif_position", "interaction")
- Experiment with different model parameters
- Apply these techniques to your own sequence-function data