Additional Predictors: Usage Examples
This document provides examples of how to use additional predictor columns with Fast-SeqFunc.
Example 1: Basic Usage with Python API
The following example demonstrates how to train a model with protein sequences and additional predictors (pH and temperature) that may affect protein function:
import pandas as pd
from fast_seqfunc import train_model, predict
# Sample data with sequences, additional predictors, and function values
data = pd.DataFrame({
'sequence': ['MKALIVLGL', 'MKHPIVLLL', 'MKLIVPMGL', 'MKAIVLELL'],
'pH': [6.5, 7.0, 7.5, 8.0],
'temperature': [25, 30, 35, 40],
'activity': [0.45, 0.62, 0.78, 0.34]
})
# Split into train and test sets
train_data = data.iloc[:3]
test_data = data.iloc[3:]
# Train model using sequence and additional predictors
model_info = train_model(
train_data=train_data,
test_data=test_data,
sequence_col='sequence',
target_col='activity',
additional_predictor_cols=['pH', 'temperature'],
embedding_method='one-hot',
model_type='regression',
optimization_metric='r2'
)
# Make predictions on new data
new_data = pd.DataFrame({
'sequence': ['MKAIVLELL', 'MKLIVLELL'],
'pH': [7.0, 7.5],
'temperature': [37, 38]
})
predictions = predict(model_info, new_data)
print(f"Predicted activities: {predictions}")
Example 2: Using the CLI
You can also use additional predictors via the command-line interface:
# Train a model with additional predictors
fast-seqfunc train protein_data.csv \
--sequence-col sequence \
--target-col activity \
--additional-predictors pH,temperature \
--embedding-method one-hot \
--model-type regression
# Make predictions
fast-seqfunc predict model.pkl new_sequences.csv \
--output-path predictions.csv
The input CSV files should contain the sequence column, the target column, and any additional predictor columns specified.
Example 3: Handling Categorical Predictors
Additional predictors can be numeric or categorical. Fast-SeqFunc will automatically handle the encoding of categorical predictors:
import pandas as pd
from fast_seqfunc import train_model
# Data with both numeric and categorical predictors
data = pd.DataFrame({
'sequence': ['MKALIVLGL', 'MKHPIVLLL', 'MKLIVPMGL', 'MKAIVLELL'],
'pH': [6.5, 7.0, 7.5, 8.0],
'buffer_type': ['phosphate', 'tris', 'phosphate', 'tris'],
'cell_line': ['HEK293', 'CHO', 'HEK293', 'CHO'],
'activity': [0.45, 0.62, 0.78, 0.34]
})
# Train model with both numeric and categorical predictors
model_info = train_model(
train_data=data,
sequence_col='sequence',
target_col='activity',
additional_predictor_cols=['pH', 'buffer_type', 'cell_line'],
embedding_method='one-hot',
model_type='regression'
)
Example 4: Analyzing Feature Importance
With additional predictors, it becomes important to understand their relative importance:
import matplotlib.pyplot as plt
from fast_seqfunc import train_model, feature_importance
# Train model with additional predictors
model_info = train_model(
train_data=data,
sequence_col='sequence',
target_col='activity',
additional_predictor_cols=['pH', 'temperature', 'buffer_type']
)
# Get feature importance
importance = feature_importance(model_info)
# Plot feature importance
plt.figure(figsize=(10, 6))
importance.plot(kind='bar')
plt.title('Feature Importance')
plt.tight_layout()
plt.show()
Example 5: Saving and Loading Models
Models with additional predictors can be saved and loaded just like regular models:
from fast_seqfunc import save_model, load_model
# Save model
save_model(model_info, 'protein_activity_model.pkl')
# Load model
loaded_model = load_model('protein_activity_model.pkl')
# Make predictions using the loaded model
predictions = predict(
loaded_model,
new_data,
sequence_col='sequence'
)