Sequence Classification with Fast-SeqFunc
This tutorial demonstrates how to use fast-seqfunc
for classification problems, where you want to predict discrete categories from biological sequences.
Overview
In sequence classification, we want to learn to predict discrete categories (e.g., protein function, gene families, or binding/non-binding sequences) from biological sequences. This tutorial will walk you through:
- Setting up your environment
- Preparing sequence classification data
- Training binary and multi-class classification models
- Evaluating model performance
- Making predictions on new sequences
- Visualizing classification results
Prerequisites
- Python 3.11 or higher
- The following packages:
pip install fast-seqfunc pandas numpy matplotlib seaborn scikit-learn loguru
Setup
First, let's import all necessary packages:
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import (
classification_report,
confusion_matrix,
roc_curve,
auc,
precision_recall_curve,
average_precision_score
)
from fast_seqfunc import train_model, predict, save_model, load_model
from loguru import logger
Working with Classification Data
For classification tasks, each sequence is associated with a discrete class label. Let's create synthetic data for this tutorial:
from fast_seqfunc import generate_dataset_by_task
# Generate a binary classification dataset
# (sequences with or without specific patterns)
binary_data = generate_dataset_by_task(
task="classification",
count=1000, # Number of sequences to generate
length=30, # Sequence length
noise_level=0.1, # Add some noise to make the task more realistic
)
# Generate a multi-class classification dataset
multi_data = generate_dataset_by_task(
task="multiclass",
count=1000, # Number of sequences to generate
length=30, # Sequence length
noise_level=0.1, # Add some noise
)
# Examine the data
print("Binary Classification Dataset:")
print(binary_data.head())
print(f"Binary class distribution:\n{binary_data['function'].value_counts()}")
print("\nMulti-class Classification Dataset:")
print(multi_data.head())
print(f"Multi-class distribution:\n{multi_data['function'].value_counts()}")
Preparing Your Own Data
If you have your own data, it should be structured in a DataFrame with at least two columns: - A column containing the sequences (e.g., "sequence") - A column containing the class labels (e.g., "class" or "function")
For example:
# Load your own data
# data = pd.read_csv("your_classification_data.csv")
# If your classes are text labels, you might want to convert them to integers
# from sklearn.preprocessing import LabelEncoder
# label_encoder = LabelEncoder()
# data['class_encoded'] = label_encoder.fit_transform(data['class'])
Binary Classification Example
Let's start with a binary classification problem:
# For this tutorial, we'll use our binary dataset
data = binary_data
# Split into train and test sets (80/20 split)
train_size = int(0.8 * len(data))
train_data = data[:train_size].copy()
test_data = data[train_size:].copy()
logger.info(f"Data split: {len(train_data)} train, {len(test_data)} test samples")
logger.info(f"Class distribution in training data:\n{train_data['function'].value_counts()}")
# Create output directory for results
output_dir = Path("output")
output_dir.mkdir(parents=True, exist_ok=True)
Training a Binary Classification Model
Now we can train a classification model:
# Train a classification model
logger.info("Training binary classification model...")
model_info = train_model(
train_data=train_data,
test_data=test_data,
sequence_col="sequence", # Column containing sequences
target_col="function", # Column containing class labels
embedding_method="one-hot", # Method to convert sequences to numerical features
model_type="classification", # Specify classification task
optimization_metric="auc", # Metric to optimize (auc, accuracy, f1, etc.)
)
# Display test results
if model_info.get("test_results"):
logger.info("Test metrics from training:")
for metric, value in model_info["test_results"].items():
logger.info(f" {metric}: {value:.4f}")
# Save the model for later use
model_path = output_dir / "binary_classification_model.pkl"
save_model(model_info, model_path)
logger.info(f"Model saved to {model_path}")
Making Predictions
With our trained model, we can now make predictions:
# Generate some new data for prediction
new_data = generate_dataset_by_task(
task="classification",
count=200,
length=30,
)
# Make predictions
predictions = predict(model_info, new_data["sequence"])
# Create results DataFrame
results_df = new_data.copy()
results_df["predicted_class"] = predictions
results_df.to_csv(output_dir / "binary_classification_predictions.csv", index=False)
print(results_df.head())
Evaluating Binary Classification Performance
Let's evaluate our model more thoroughly:
# Calculate classification metrics
true_values = test_data["function"]
predicted_values = predict(model_info, test_data["sequence"])
# Print classification report
print("\nClassification Report:")
print(classification_report(true_values, predicted_values))
# Create confusion matrix
cm = confusion_matrix(true_values, predicted_values)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
xticklabels=["Class 0", "Class 1"],
yticklabels=["Class 0", "Class 1"])
plt.xlabel("Predicted Class")
plt.ylabel("True Class")
plt.title("Confusion Matrix")
plt.tight_layout()
plt.savefig(output_dir / "binary_confusion_matrix.png", dpi=300)
# Try to get probability estimates (if model supports it)
try:
# For this tutorial, we'll use a simplified approach since confidence scores are not available
# In a real implementation, you'd need access to the raw model's predict_proba method
logger.warning("ROC and PR curves require probability estimates which are not supported in this version")
logger.warning("We're skipping these visualizations in this tutorial")
"""
# Below is example code that would work if your model provides probability estimates:
# Get class probabilities (if available)
# y_prob = model.predict_proba(X_test)[:, 1] # probability of positive class
# Plot ROC curve
# fpr, tpr, _ = roc_curve(true_values, y_prob)
# roc_auc = auc(fpr, tpr)
# plt.figure(figsize=(8, 6))
# plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
# plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
# plt.xlabel('False Positive Rate')
# plt.ylabel('True Positive Rate')
# plt.title('Receiver Operating Characteristic (ROC)')
# plt.legend(loc="lower right")
# plt.savefig(output_dir / "binary_roc_curve.png", dpi=300)
"""
except Exception as e:
logger.warning(f"Could not generate ROC/PR curves: {e}")
logger.warning("This is normal if the model doesn't support probability output")
Multi-Class Classification Example
Now let's work with a multi-class problem:
# Switch to multi-class data
data = multi_data
# Split into train and test sets (80/20 split)
train_size = int(0.8 * len(data))
train_data = data[:train_size].copy()
test_data = data[train_size:].copy()
logger.info(f"Multi-class data split: {len(train_data)} train, {len(test_data)} test samples")
logger.info(f"Class distribution in training data:\n{train_data['function'].value_counts()}")
Training a Multi-Class Model
Training a multi-class model is very similar to binary classification:
# Train a multi-class classification model
logger.info("Training multi-class classification model...")
multi_model_info = train_model(
train_data=train_data,
test_data=test_data,
sequence_col="sequence",
target_col="function",
embedding_method="one-hot",
model_type="classification", # Can also use "multi-class" explicitly
optimization_metric="f1", # F1 with 'weighted' average is good for imbalanced classes
)
# Display test results
if multi_model_info.get("test_results"):
logger.info("Multi-class test metrics from training:")
for metric, value in multi_model_info["test_results"].items():
logger.info(f" {metric}: {value:.4f}")
# Save the model
multi_model_path = output_dir / "multiclass_model.pkl"
save_model(multi_model_info, multi_model_path)
logger.info(f"Multi-class model saved to {multi_model_path}")
Evaluating Multi-Class Performance
Evaluation for multi-class problems:
# Calculate multi-class metrics
multi_true_values = test_data["function"]
multi_predicted_values = predict(multi_model_info, test_data["sequence"])
# Print classification report
print("\nMulti-class Classification Report:")
print(classification_report(multi_true_values, multi_predicted_values))
# Create confusion matrix
class_labels = sorted(data["function"].unique())
multi_cm = confusion_matrix(multi_true_values, multi_predicted_values)
plt.figure(figsize=(10, 8))
sns.heatmap(multi_cm, annot=True, fmt="d", cmap="Blues",
xticklabels=class_labels,
yticklabels=class_labels)
plt.xlabel("Predicted Class")
plt.ylabel("True Class")
plt.title("Multi-class Confusion Matrix")
plt.tight_layout()
plt.savefig(output_dir / "multiclass_confusion_matrix.png", dpi=300)
# Create a normalized confusion matrix for better visualization
# with unbalanced classes
multi_cm_normalized = multi_cm.astype('float') / multi_cm.sum(axis=1)[:, np.newaxis]
plt.figure(figsize=(10, 8))
sns.heatmap(multi_cm_normalized, annot=True, fmt=".2f", cmap="Blues",
xticklabels=class_labels,
yticklabels=class_labels)
plt.xlabel("Predicted Class")
plt.ylabel("True Class")
plt.title("Normalized Multi-class Confusion Matrix")
plt.tight_layout()
plt.savefig(output_dir / "multiclass_normalized_confusion_matrix.png", dpi=300)
Visualizing Sequence Features by Class
For classification tasks, it can be useful to visualize sequence properties by class:
# Calculate sequence length by class
data["seq_length"] = data["sequence"].str.len()
plt.figure(figsize=(10, 6))
sns.boxplot(x="function", y="seq_length", data=data)
plt.title("Sequence Length Distribution by Class")
plt.xlabel("Class")
plt.ylabel("Sequence Length")
plt.tight_layout()
plt.savefig(output_dir / "seq_length_by_class.png", dpi=300)
# For DNA/RNA sequences, calculate nucleotide composition by class
if any(nuc in data["sequence"].iloc[0].upper() for nuc in "ACGT"):
data["A_percent"] = data["sequence"].apply(lambda x: x.upper().count("A") / len(x) * 100)
data["C_percent"] = data["sequence"].apply(lambda x: x.upper().count("C") / len(x) * 100)
data["G_percent"] = data["sequence"].apply(lambda x: x.upper().count("G") / len(x) * 100)
data["T_percent"] = data["sequence"].apply(lambda x: x.upper().count("T") / len(x) * 100)
# Melt the data for easier plotting
plot_data = pd.melt(
data,
id_vars=["function"],
value_vars=["A_percent", "C_percent", "G_percent", "T_percent"],
var_name="Nucleotide",
value_name="Percentage"
)
# Plot nucleotide composition by class
plt.figure(figsize=(12, 8))
sns.boxplot(x="function", y="Percentage", hue="Nucleotide", data=plot_data)
plt.title("Nucleotide Composition by Class")
plt.xlabel("Class")
plt.ylabel("Percentage (%)")
plt.tight_layout()
plt.savefig(output_dir / "nucleotide_composition_by_class.png", dpi=300)
Working with Imbalanced Classes
When dealing with imbalanced class distributions (where one class is much more frequent than others), you can use special techniques:
# Example: Create an imbalanced dataset
from sklearn.utils import resample
# Assume class 0 is much more frequent than class 1
class_0 = binary_data[binary_data['function'] == 0]
class_1 = binary_data[binary_data['function'] == 1]
# Downsample class 0 to match class 1
class_0_downsampled = resample(
class_0,
replace=False, # Don't sample with replacement
n_samples=len(class_1), # Match minority class
random_state=42 # For reproducibility
)
# Combine the downsampled majority class with the minority class
balanced_data = pd.concat([class_0_downsampled, class_1])
# Now you can train on this balanced dataset
print(f"Original class distribution: {binary_data['function'].value_counts()}")
print(f"Balanced class distribution: {balanced_data['function'].value_counts()}")
Alternatively, you can use class weights in PyCaret:
# Train with class weights (handled automatically by PyCaret)
weighted_model_info = train_model(
train_data=train_data,
test_data=test_data,
sequence_col="sequence",
target_col="function",
embedding_method="one-hot",
model_type="classification",
optimization_metric="f1", # F1 is good for imbalanced classes
# Additional PyCaret settings for imbalanced data:
fix_imbalance=True, # Automatically fix class imbalance
)
Loading and Using a Classification Model
You can load a saved model and use it for predictions:
# Load a previously saved classification model
loaded_model_info = load_model(model_path)
# Use the model to classify new sequences
sequences_to_classify = [
"ACGTACGTACGTACGTACGTACGTACGTAC",
"GATAGATAGATAGATAGATAGATAGATA",
"CTACCTACCTACCTACCTACCTACCTAC"
]
# Make predictions
predictions = predict(loaded_model_info, sequences_to_classify)
# Print results
for seq, pred in zip(sequences_to_classify, predictions):
print(f"Sequence (first 10 chars): {seq[:10]}... | Predicted class: {pred}")
Advanced Model Training Options
fast-seqfunc
uses PyCaret behind the scenes, allowing customization:
# Example with more options
advanced_model_info = train_model(
train_data=train_data,
test_data=test_data,
sequence_col="sequence",
target_col="function",
embedding_method="one-hot",
model_type="classification",
optimization_metric="f1",
# Additional PyCaret setup options:
n_jobs=-1, # Use all available CPU cores
fold=5, # 5-fold cross-validation
normalize=True, # Normalize features
feature_selection=True, # Perform feature selection
# Classification-specific options:
fix_imbalance=True, # For imbalanced datasets
remove_outliers=True, # Remove outliers
)
Conclusion
You've now learned how to: 1. Prepare sequence classification data 2. Train binary and multi-class classification models 3. Evaluate classification performance 4. Make predictions on new sequences 5. Handle special cases like imbalanced classes
For more advanced features and applications, check out the API reference and additional tutorials.
Next Steps
- Try different classification tasks (e.g., protein function prediction)
- Experiment with different model types and parameters
- Apply these techniques to your own sequence classification data