Custom Alphabets Design Document

Overview

This document outlines the design for enhancing fast-seqfunc with support for custom alphabets, particularly focusing on handling mixed-length characters and various sequence storage formats. This feature enables the library to work with non-standard sequence types, such as chemically modified amino acids, custom nucleotides, or integer-based sequence representations.

Current Implementation

The current implementation in fast-seqfunc handles alphabets in a straightforward manner:

Alphabets are represented as instances of the Alphabet class that encapsulate tokens and tokenization rules.
Sequences can be encoded using various tokenization strategies (character-based, delimited, or regex-based).
The OneHotEmbedder uses alphabets to transform sequences into one-hot encodings for model training.
Pre-defined alphabets are available for common sequence types (protein, DNA, RNA).
Custom alphabets are supported through the Alphabet class.
Sequences of different lengths can be padded to the maximum length with a configurable gap character.

Alphabet Class

The Alphabet class is at the core of the custom alphabets implementation:

class Alphabet:
    """Represent a custom alphabet for sequence encoding.

    This class handles tokenization and mapping between tokens and indices,
    supporting both single character and multi-character tokens.

    :param tokens: Collection of tokens that define the alphabet
    :param delimiter: Optional delimiter used when tokenizing sequences
    :param name: Optional name for this alphabet
    :param description: Optional description
    :param gap_character: Character to use for padding sequences (default: "-")
    """

    def __init__(
        self,
        tokens: Iterable[str],
        delimiter: Optional[str] = None,
        name: Optional[str] = None,
        description: Optional[str] = None,
        gap_character: str = "-",
    )

    @property
    def size(self) -> int:
        """Get the number of unique tokens in the alphabet."""

    def tokenize(self, sequence: str) -> List[str]:
        """Convert a sequence string to tokens.

        :param sequence: The input sequence
        :return: List of tokens
        """

    def pad_sequence(self, sequence: str, length: int) -> str:
        """Pad a sequence to the specified length.

        :param sequence: The sequence to pad
        :param length: Target length
        :return: Padded sequence
        """

    def tokens_to_sequence(self, tokens: List[str]) -> str:
        """Convert tokens back to a sequence string.

        :param tokens: List of tokens
        :return: Sequence string
        """

    def indices_to_sequence(
        self, indices: Sequence[int], delimiter: Optional[str] = None
    ) -> str:
        """Convert a list of token indices back to a sequence string.

        :param indices: List of token indices
        :param delimiter: Optional delimiter to use (overrides the alphabet's default)
        :return: Sequence string
        """

    def encode_to_indices(self, sequence: str) -> List[int]:
        """Convert a sequence string to token indices.

        :param sequence: The input sequence
        :return: List of token indices
        """

    def decode_from_indices(
        self, indices: Sequence[int], delimiter: Optional[str] = None
    ) -> str:
        """Decode token indices back to a sequence string.

        This is an alias for indices_to_sequence.

        :param indices: List of token indices
        :param delimiter: Optional delimiter to use
        :return: Sequence string
        """

    def validate_sequence(self, sequence: str) -> bool:
        """Check if a sequence can be fully tokenized with this alphabet.

        :param sequence: The sequence to validate
        :return: True if sequence is valid, False otherwise
        """

    @classmethod
    def from_config(cls, config: Dict) -> "Alphabet":
        """Create an Alphabet instance from a configuration dictionary.

        :param config: Dictionary with alphabet configuration
        :return: Alphabet instance
        """

    @classmethod
    def from_json(cls, path: Union[str, Path]) -> "Alphabet":
        """Load an alphabet from a JSON file.

        :param path: Path to the JSON configuration file
        :return: Alphabet instance
        """

    def to_dict(self) -> Dict:
        """Convert the alphabet to a dictionary for serialization.

        :return: Dictionary representation
        """

    def to_json(self, path: Union[str, Path]) -> None:
        """Save the alphabet to a JSON file.

        :param path: Path to save the configuration
        """

    @classmethod
    def protein(cls, gap_character: str = "-") -> "Alphabet":
        """Create a standard protein alphabet.

        :param gap_character: Character to use for padding (default: "-")
        :return: Alphabet for standard amino acids
        """

    @classmethod
    def dna(cls, gap_character: str = "-") -> "Alphabet":
        """Create a standard DNA alphabet.

        :param gap_character: Character to use for padding (default: "-")
        :return: Alphabet for DNA
        """

    @classmethod
    def rna(cls, gap_character: str = "-") -> "Alphabet":
        """Create a standard RNA alphabet.

        :param gap_character: Character to use for padding (default: "-")
        :return: Alphabet for RNA
        """

    @classmethod
    def integer(
        cls, max_value: int, gap_value: str = "-1", gap_character: str = "-"
    ) -> "Alphabet":
        """Create an integer-based alphabet (0 to max_value).

        :param max_value: Maximum integer value (inclusive)
        :param gap_value: String representation of the gap value (default: "-1")
        :param gap_character: Character to use for padding in string representation
            (default: "-")
        :return: Alphabet with integer tokens
        """

OneHotEmbedder Implementation

The OneHotEmbedder class works with the Alphabet class to create one-hot encodings:

class OneHotEmbedder:
    """One-hot encoding for protein or nucleotide sequences.

    :param sequence_type: Type of sequences to encode ("protein", "dna", "rna",
        or "auto")
    :param alphabet: Custom alphabet to use for encoding (overrides sequence_type)
    :param max_length: Maximum sequence length (will pad/truncate to this length)
    :param pad_sequences: Whether to pad sequences of different lengths
        to the maximum length
    :param gap_character: Character to use for padding (default: "-")
    """

    def __init__(
        self,
        sequence_type: Literal["protein", "dna", "rna", "auto"] = "auto",
        alphabet: Optional[Alphabet] = None,
        max_length: Optional[int] = None,
        pad_sequences: bool = True,
        gap_character: str = "-",
    )

    @property
    def alphabet(self):
        """Get the alphabet, supporting both old and new API."""

    def fit(self, sequences: Union[List[str], pd.Series]) -> "OneHotEmbedder":
        """Determine alphabet and set up the embedder.

        :param sequences: Sequences to fit to
        :return: Self for chaining
        """

    def transform(
        self, sequences: Union[List[str], pd.Series]
    ) -> Union[np.ndarray, List[np.ndarray]]:
        """Transform sequences to one-hot encodings.

        If sequences are of different lengths and pad_sequences=True, they
        will be padded to the max_length with the gap character.

        If pad_sequences=False, this returns a list of arrays of different sizes.

        :param sequences: List or Series of sequences to embed
        :return: Array of one-hot encodings if pad_sequences=True,
            otherwise list of arrays
        """

    def fit_transform(
        self, sequences: Union[List[str], pd.Series]
    ) -> Union[np.ndarray, List[np.ndarray]]:
        """Fit and transform in one step.

        :param sequences: Sequences to encode
        :return: Array of one-hot encodings if pad_sequences=True,
            otherwise list of arrays
        """

Helper Functions

get_embedder

def get_embedder(method: str, **kwargs) -> OneHotEmbedder:
    """Get an embedder instance based on method name.

    Currently only supports one-hot encoding.

    :param method: Embedding method (only "one-hot" supported)
    :param kwargs: Additional arguments to pass to the embedder
    :return: Configured embedder
    """

infer_alphabet

def infer_alphabet(
    sequences: List[str], delimiter: Optional[str] = None, gap_character: str = "-"
) -> Alphabet:
    """Infer an alphabet from a list of sequences.

    :param sequences: List of sequences to analyze
    :param delimiter: Optional delimiter used in sequences
    :param gap_character: Character to use for padding
    :return: Inferred Alphabet
    """

Usage Examples

Creating Custom Alphabets

# Standard alphabets
protein_alphabet = Alphabet.protein()
dna_alphabet = Alphabet.dna()
rna_alphabet = Alphabet.rna()

# Custom alphabet with standard and modified amino acids
aa_tokens = list("ACDEFGHIKLMNPQRSTVWY") + ["pS", "pT", "pY", "me3K"]
mod_aa_alphabet = Alphabet(
    tokens=aa_tokens,
    name="modified_aa",
    gap_character="X"
)

# Integer alphabet (0-29 with -1 as gap value)
int_alphabet = Alphabet.integer(max_value=29, gap_value="-1")

# Custom alphabet from configuration
alphabet = Alphabet.from_json("path/to/alphabet_config.json")

Using the OneHotEmbedder

# Auto-detect sequence type
embedder = get_embedder("one-hot")
embeddings = embedder.fit_transform(sequences)

# Specify sequence type
embedder = get_embedder("one-hot", sequence_type="protein", pad_sequences=True)
embeddings = embedder.fit_transform(sequences)

# Use custom alphabet
embedder = get_embedder("one-hot", alphabet=mod_aa_alphabet)
embeddings = embedder.fit_transform(sequences)

# Control padding behavior
embedder = get_embedder("one-hot", max_length=10, pad_sequences=True, gap_character="X")
embeddings = embedder.fit_transform(sequences)

Working with Sequences of Different Lengths

# Sequences of different lengths
sequences = ["ACDE", "KLMNPQR", "ST"]
embedder = OneHotEmbedder(sequence_type="protein", pad_sequences=True)
embeddings = embedder.fit_transform(sequences)
# Sequences are padded to length 7: "ACDE---", "KLMNPQR", "ST-----"

# Disable padding (returns a list of arrays of different sizes)
embedder = OneHotEmbedder(sequence_type="protein", pad_sequences=False)
embedding_list = embedder.fit_transform(sequences)

Handling Special Sequence Types

Chemically Modified Amino Acids

# Amino acids with modifications
aa_tokens = list("ACDEFGHIKLMNPQRSTVWY") + ["pS", "pT", "pY", "me3K", "X"]
mod_aa_alphabet = Alphabet(
    tokens=aa_tokens,
    name="modified_aa",
    gap_character="X"
)

# Example sequences with modified AAs
sequences = ["ACDEpS", "KLMme3KNP", "QR"]
embedder = OneHotEmbedder(alphabet=mod_aa_alphabet, pad_sequences=True)
embeddings = embedder.fit_transform(sequences)

Integer-Based Sequences

# Integer representation with comma delimiter
int_alphabet = Alphabet.integer(max_value=29, gap_value="-1")

# Example sequences as comma-separated integers
sequences = ["0,1,2", "10,11,12,25,14", "15,16"]
embedder = OneHotEmbedder(alphabet=int_alphabet, pad_sequences=True)
embeddings = embedder.fit_transform(sequences)

Integration with Model Training

# Create a custom alphabet
alphabet = Alphabet.integer(max_value=10)

# Get the embedder with the custom alphabet
embedder = get_embedder("one-hot", alphabet=alphabet)

# Embed sequences
X_train_embedded = embedder.fit_transform(train_df[sequence_col])

# Create column names for the embedded features
embed_cols = [f"embed_{i}" for i in range(X_train_embedded.shape[1])]

# Create DataFrame for model training
train_processed = pd.DataFrame(X_train_embedded, columns=embed_cols)
train_processed["target"] = train_df[target_col].values

# Now train your model with train_processed...

Alphabet Configuration Format

You can save and load alphabet configurations using JSON files:

{
  "name": "modified_amino_acids",
  "description": "Amino acids with chemical modifications",
  "tokens": ["A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y", "pS", "pT", "pY", "me3K", "-"],
  "delimiter": null,
  "gap_character": "-"
}

For integer-based representations:

{
  "name": "amino_acid_indices",
  "description": "Numbered amino acids (0-25) with comma delimiter",
  "tokens": ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "-1"],
  "delimiter": ",",
  "gap_character": "-"
}

Key Features

Flexible Tokenization: Support for single-character, multi-character, and delimited tokens
Custom Alphabets: Define your own token sets for any sequence type
Gap Handling: Configurable padding for sequences of different lengths
Standard Bioinformatics Alphabets: Built-in support for protein, DNA, and RNA
Integer Sequences: Special support for integer-based sequence representations
Serialization: Save and load alphabet configurations as JSON
Automatic Type Detection: Automatically infer sequence type from content

Conclusion

The custom alphabets implementation in fast-seqfunc provides a flexible, robust solution for handling various sequence types and tokenization schemes. This design enables working with non-standard sequence types, mixed-length characters, and integer-based sequences in a clean, consistent way.