written by Eric J. Ma on 2024-07-14 | tags: scipy2024 python data science quarto tutorial llms anywidget large datasets open source llamabot conference activities
In this blog post, I share my enriching experience at SciPy 2024, from attending insightful tutorials on Quarto, LLMs, Anywidget, and handling large datasets, to delivering talks on fostering an open-source culture and LlamaBot. I also highlight the vibrant lightning talks, the collaborative sprints, and the engaging social activities that made this conference memorable. Not to forget, the delicious Tacoma cuisine that added flavor to the whole experience. Curious to know which tutorial inspired me to recreate my talks just for fun?
Read on... (2400 words, approximately 13 minutes reading time)written by Eric J. Ma on 2024-07-02 | tags: software development data science python pickles code security pandas programming best practices version control computational notebooks dependency management software skills
In this blog post, I share a cautionary tale from my work experience about the pitfalls of using pickle files for data storage, particularly highlighting their dependency on the computing environment and specific package versions. I encountered an issue when a notebook failed to run due to a pickle file not being compatible with the updated version of pandas. This experience led me to advocate for using native formats over pickles for better stability and reproducibility, and underscored the importance of software skills like continuous testing. How can we ensure our data storage methods don't become obsolete with evolving dependencies?
Read on... (532 words, approximately 3 minutes reading time)written by Eric J. Ma on 2024-06-30 | tags: docathon documentation team productivity event planning work culture knowledge sharing writing tips collaboration project management continuous improvement
In this blog post, I reflect on two years of running quarterly docathons at work, dedicated two-day events focused on writing high-quality documentation. I discuss what docathons are, their purpose, and the simple yet effective way to organize them, emphasizing the importance of food, documentation, and optional workshops. Based on both cost and the invaluable benefits of improved documentation practices, I also discuss the significant return on investment these docathons have yielded. How can such a straightforward event substantially enhance the quality of documentation and team engagement? Read on to find out.
Read on... (2010 words, approximately 11 minutes reading time)written by Eric J. Ma on 2024-06-26 | tags: interviews technical hiring communication code review hiring technical skills job interviews collaboration
In this blog post, I share insights from a co-op performance calibration, highlighting the crucial difference between conversational and communication skills in the hiring process. I recount an experience where a candidate's excellent conversational abilities masked their technical skills, leading to a dilemma on whether to hire them. Drawing from Andy Grove's 'High Output Management,' I emphasize the importance of using interviews to gauge technical abilities effectively, advocating for code reviews as a high-bandwidth method to assess candidates' skills. This approach minimizes the risk of being misled by mere conversational charm. How can we better distinguish between conversational prowess and genuine communication skills in interviews?
Read on... (642 words, approximately 4 minutes reading time)written by Eric J. Ma on 2024-06-18 | tags: data science reproducibility portability open source data management software skills data access version control data patterns technology guardrails
In this blog post,
I explore the importance of reproducibility and portability in data science,
focusing on data access patterns.
I introduce pins
,
an open-source tool that enables data scientists to reference data
from a central source of truth and manage data versions explicitly.
By using pins
, we can avoid common pitfalls like non-reproducible analyses
and streamline the process of accessing and versioning data.
This approach not only enhances productivity
but also ensures that data is accessed in a consistent and error-free manner.
Curious about how pins
and analogous tools can robustify your data science workflow?
written by Eric J. Ma on 2024-06-08 | tags: protein structure autoregressive models neural networks von mises distribution protein backbone generation mixture models dihedral angles machine learning scientific research probability distribution
In this blog post, I do a deep dive into the paper 'The Continuous Language of Protein Structure' by Billera et al., which explores generating protein backbones using autoregressive models and the von Mises mixture model for sampling dihedral angles. This approach challenges the traditional discrete output of autoregressive models by producing continuous values, specifically for modeling protein structures. I discuss the technical and scientific premises, the role of the von Mises distribution, and the potential issue of non-identifiability in mixture models. How does this method open new avenues in protein structural modeling? Read on to find out.
Read on... (1452 words, approximately 8 minutes reading time)written by Eric J. Ma on 2024-06-01 | tags: cuda jax conda environment variables cudnn python gpu dynamic libraries nvidia software installation
In this blog post, I share how to resolve CUDA backend initialization issues when installing JAX with CUDA, specifically addressing outdated cuDNN versions. I detail a method using Conda environments to manage CUDA installations and set environment variables correctly, offering two solutions: configuring LD_LIBRARY_PATH through Conda's activate.d and deactivate.d scripts, or directly within a Python session using a .env file. Both approaches aim to ensure that JAX utilizes the correct CUDA libraries, but each has its tradeoffs regarding portability. Curious about which method might work best for your setup?
Read on... (1344 words, approximately 7 minutes reading time)written by Eric J. Ma on 2024-05-27 | tags: deep learning multi-modal learning data fusion protein sequences biomedical texts gradient descent semantic alignment masked language modelling model architecture embedding conversion
In this blog post, I explore multi-modality deep learning based on two papers from the biomedical world, in which we explore the definition of data modalities, what fusion is and how it takes place within a model, and possible training objectives. In this post, I also considers how to utilize these models with only one input modality available, highlighting the potential for protein function prediction and sequence generation. How can multi-modal deep learning transform our approach to complex data analysis?
Read on... (2028 words, approximately 11 minutes reading time)written by Eric J. Ma on 2024-05-16 | tags: pymol jupyter notebooks python scripting protein visualization pdb files data science gpt-4 matplotlib plotting bioinformatics automation
In this blog post, I share my journey of learning to script PyMOL directly from Jupyter notebooks, a skill I picked up with the help of GPT-4. I detail the process of installing PyMOL, setting up the environment, and scripting to process and visualize protein structures, specifically nanobody VHH structures. I also demonstrate how to plot these structures in a grid using matplotlib
. But the more important lesson here is how quickly I was able to pick it up, thanks to GPT-4! Are you able to leverage LLMs as a skill-learning hack?
written by Eric J. Ma on 2024-05-12 | tags: crispr-cas protein language models genomic data machine learning bioinformatics sequence generation protein engineering gene editing dataset curation computational biology data science generative model generative artificial intelligence
In this blog post, I do a deep dive into a fascinating paper on designing CRISPR-Cas sequences using machine learning. The authors develop a generative model to produce novel protein sequences, validated in the lab, aiming to circumvent intellectual property restrictions. They curate a vast dataset, the CRISPR-Cas Atlas, and employ various models and filters to ensure sequence viability. My review highlights the methodology, emphasizing the importance of filtering and the challenges of using 'magic numbers' without justification. How many sequences are enough to train a generative model, and what makes laboratory experiments faster? Curious to find out more?
Read on... (3017 words, approximately 16 minutes reading time)