Additional Predictors in Fast-SeqFunc: Design Document
Overview
Fast-SeqFunc is designed to model the relationship between biological sequences and a target function/property. Currently, it supports using only the sequence as the predictor. This design document outlines the necessary changes to expand Fast-SeqFunc to incorporate additional predictor columns alongside the sequence data.
Motivation
In many real-world applications, sequence-function relationships may depend on additional contextual variables or experimental conditions. For example:
- Protein function might depend not only on the amino acid sequence but also on pH, temperature, or salt concentration
- Gene expression levels may depend on the DNA sequence as well as cell type, developmental stage, or treatment conditions
- Binding affinity might depend on sequence and additional information about binding partners
By supporting additional predictor columns, Fast-SeqFunc will become more versatile and applicable to a wider range of biological problems where context matters.
Design Goals
- Maintain API Simplicity: Preserve the current simple API while extending it to handle additional predictors
- Backward Compatibility: Ensure existing code continues to work without changes
- Flexible Integration: Allow for simple integration of sequence embeddings with additional predictors
- CLI Support: Extend the command-line interface to handle additional predictor columns
- Consistent Implementation: Apply changes consistently across both the Python API and CLI
Architecture Changes
Core Components Affected
- Core API (
core.py
): - Extend
train_model
function to accept additional predictor columns -
Modify data processing to incorporate additional predictors with embeddings
-
Models (
models.py
): -
Update
SequenceFunctionModel
to handle additional predictors alongside sequence embeddings -
CLI (
cli.py
): - Add options to specify additional predictor columns
-
Update data processing for CLI commands
-
Documentation:
- Update API reference
- Add examples showing how to use additional predictors
Data Flow Modifications
Current data flow:
- User provides sequences + target values
- Sequences are embedded
- ML models are trained on embeddings
New data flow:
- User provides sequences + additional predictors + target values
- Sequences are embedded
- Embeddings are combined with additional predictors
- ML models are trained on the combined features
API Design
Python API Changes
Current API
model_info = train_model(
train_data=train_data,
test_data=test_data,
sequence_col="sequence",
target_col="function",
embedding_method="one-hot",
model_type="regression",
optimization_metric="r2",
)
Enhanced API
model_info = train_model(
train_data=train_data,
test_data=test_data,
sequence_col="sequence",
target_col="function",
additional_predictor_cols=["pH", "temperature"], # New parameter
embedding_method="one-hot",
model_type="regression",
optimization_metric="r2",
)
Predict Function Changes
Current:
predictions = predict(model_info, new_sequences)
Enhanced:
predictions = predict(
model_info,
new_data, # Now can be DataFrame with sequence and additional predictor columns
sequence_col="sequence"
)
CLI Changes
Current CLI:
fast-seqfunc train train_data.csv --sequence-col sequence --target-col function
Enhanced CLI:
fast-seqfunc train train_data.csv --sequence-col sequence --target-col function --additional-predictors pH,temperature
Implementation Strategy
Feature Combination Method
The implementation will use a simple concatenation approach to combine sequence embeddings with additional predictors. This means that additional predictor columns will be appended to the sequence embedding features to create the final feature matrix for model training.
Data Processing Flow
- Load CSV or DataFrame data as currently implemented
- Validate presence of additional predictor columns if specified
- Embed sequence data using the existing embedding pipeline
- Process additional predictors (scaling, handling missing values, etc.)
- Combine sequence embeddings with additional predictors
- Train models on the combined feature set
Data Validation and Preprocessing
Additional predictors may require preprocessing:
- Handling missing values
- Scaling numerical features
- Encoding categorical features
- Type validation
We'll implement a preprocessing pipeline for additional predictors that handles these tasks automatically.
Model Information Enhancement
The model_info dictionary will be enhanced to include:
{
"model": trained_model,
"model_type": model_type,
"embedder": embedder,
"embed_cols": embed_cols,
"additional_predictor_cols": additional_predictor_cols, # New
"additional_predictor_preprocessing": preprocessing_pipeline, # New
"test_results": test_results,
}
This ensures all information needed for making predictions with additional predictors is preserved.
Serialization Changes
The model serialization format will need to include information about additional predictors. We'll maintain backward compatibility by checking for the presence of additional predictor information when loading existing models.
Code Changes Required
In core.py
- Update
train_model
function signature to accept additional predictors - Modify data processing to handle additional predictors
- Update model information dictionary to include additional predictor metadata
- Modify
predict
function to handle additional predictors in prediction data
In models.py
- Update
SequenceFunctionModel
class to handle additional predictors - Modify the
fit
andpredict
methods to incorporate additional predictors - Update serialization methods to handle additional predictor information
In cli.py
- Add new CLI options for specifying additional predictors
- Update data loading and processing in CLI commands
- Update help text and documentation
Backwards Compatibility
To maintain backward compatibility:
- All new parameters will be optional with sensible defaults
- Existing code paths will work without modification
- Model serialization will be backward compatible
Testing Strategy
Tests will be expanded to cover:
- Training with various combinations of additional predictors
- Making predictions with additional predictors
- Serialization and deserialization of models with additional predictor information
- CLI functionality for additional predictors
- Edge cases (missing values, type mismatches, etc.)
Future Enhancements
- Add automated feature selection for additional predictors
- Support for feature importance analysis that includes additional predictors
- Add specialized visualizations for understanding the impact of additional predictors
Conclusion
Adding support for additional predictor columns will significantly enhance the versatility of Fast-SeqFunc, making it applicable to a wider range of biological problems where context matters alongside sequence. The implementation will maintain the simplicity and user-friendliness of the current API while providing powerful new capabilities.