Eric J Ma's Website

« 5 6 7 8 9 »

Dashboard-ready data is often machine learning-ready data

written by Eric J. Ma on 2024-02-18 | tags: data science machine learning data engineering python packages chemical screening predictive models data curation

In this blog post, I discuss the overlap between dashboard-ready and machine-learning-ready data. I share an example from a chemical screening campaign, where the same data used for a dashboard can also be used for machine learning models. I explore the reasons behind this from both a statistical and business perspective. How can you gain leverage in your data for both purposes?

Read on... (426 words, approximately 3 minutes reading time)
Success Factors for Data Science Teams in Biotech

written by Eric J. Ma on 2024-02-07 | tags: talks conferences slas2024 data science biotech

In this blog post, I shared my insights from the SLAS 2024 conference on how a data science team can deliver long-lasting impact in a biotech research setting. I discussed the importance of bounding work, identifying high-value use cases, possessing technical and interpersonal skills, and having the right surrounding context. I also shared some personal experiences and lessons learned from my work at Moderna. How can these insights help your data science team become successful agents of high-value delivery? Read on to find out!

Read on... (2490 words, approximately 13 minutes reading time)
An (incomplete and opinionated) survey of LLM tooling

written by Eric J. Ma on 2024-02-01 | tags: language model open source api python vector retrieval prompt experimentation ui command line interfaces llms framework zotero

In this blog post, I explore the rapidly evolving landscape of large language model (LLM) tooling, discussing APIs, self-hosting, API switchboards, Python-based LLM Application SDKs, vector-based retrieval, prompt experimentation, evaluation, UI builders, and command-line interfaces. I share my experiences building LlamaBot and offer principles for making smart tech stack choices in this ever-changing field. How can you navigate this dynamic ecosystem and make the best decisions for your LLM projects? Read on to find out!

Read on... (1611 words, approximately 9 minutes reading time)
Exploratory data analysis isn’t open-ended

written by Eric J. Ma on 2024-01-28 | tags: data science eda exploratory data analysis pandas matplotlib seaborn correlations generative models therapeutics biological sequences metadata visualization

In this blog post, I challenge the traditional approach to exploratory data analysis (EDA) in data science. I argue that EDA should be directed and purposeful, not aimless. I share key principles for effective EDA, including falsifying our assumptions, having a clear end purpose, and embracing iteration when purposes are invalidated. I also emphasize the importance of practice and domain expertise in developing this skill. How can we make EDA more purposeful and effective in our data science work? Read on to find out!

Read on... (776 words, approximately 4 minutes reading time)
Your embedding model can be different from your text generation model

written by Eric J. Ma on 2024-01-15 | tags: embedding models retrieval augmented generation semantic search text generation vector databases llamabot documentstore sentence transformer

In this blog post, I debunked the misconception that embedding models must match the text generation model in retrieval augmented generation (RAG). I explained how these models are decoupled, with the choice of embedding affecting only the quality of content retrieved, not the text generation. I also shared my preference for SentenceTransformer due to its cost-effectiveness and performance. Finally, I updated LlamaBot to reflect this understanding, allowing for more flexible model composition. Curious about how this could change your approach to RAG? Read on!

Read on... (526 words, approximately 3 minutes reading time)
GitHub Actions secrets need to be explicitly declared

written by Eric J. Ma on 2024-01-11 | tags: llamabot mistral gpt-4 api key environment variables github actions repository secret workflow step

In this blog post, I share my experience of debugging GitHub Actions for LlamaBot. I encountered a challenge with setting the Mistral API key as an environment variable in my GitHub action. After hours of frustration, I discovered that GitHub Actions can only read a secret if it's explicitly included in a workflow. I explain how to include it in a workflow step. Curious about how to securely manage your API keys in GitHub Actions? Read on!

Read on... (212 words, approximately 2 minutes reading time)
Evolving LlamaBot

written by Eric J. Ma on 2024-01-10 | tags: llamabot api chromadb openai mistral anthropic claude mixtral simplebot chatbot querybot llm large language model

In this blog post, I discuss the major changes I've made to LlamaBot, a project I've been working on in my spare time. I've integrated LiteLLM for text-based models, created a new DocumentStore class, and reimagined the SimpleBot interface. I've also experimented with the Mixin pattern to create more complex bots and switched to character-based lengths for more user-friendly calculations. How did these changes improve the functionality and efficiency of LlamaBot? Read on to find out!

Read on... (2451 words, approximately 13 minutes reading time)
Lessons Learned Optimizing AlphaFold at Moderna

written by Eric J. Ma on 2023-12-13 | tags: alphafold moderna technology cloudcomputing aws infrastructure code optimization refactoring scaling challenges continuous integration

In this blog post, I share my experience of refactoring the AlphaFold execution script at Moderna, which led to significant cost savings and efficiency. I discuss the challenges faced, including hitting AWS' GPU instance availability limits, and the lessons learned, such as the importance of static analysis tools, CI/CD caching, reading code before writing, and working openly. Curious about the technical details and the lessons I learned from this experience?

Read on... (1333 words, approximately 7 minutes reading time)
Classes? Functions? Both?

written by Eric J. Ma on 2023-12-12 | tags: data science programming style function-based programming class-based programming data processing object-oriented data structures neural network data transformation callable objects

In this blog post, I discuss the choice between class- or function-based programming for data scientists. I argue that objects are best for grouping data, while functions are ideal for processing data. However, configurable functions that need to be reused can be implemented both ways. I lean towards a functional programming style, using classes to organize related data. But sometimes, like with callable objects, I adopt a different approach. Curious about when to use each style in your data science projects? Read on!

Read on... (1177 words, approximately 6 minutes reading time)
Elevating Team Performance: Feedback Strategies for Data Science Leaders

written by Eric J. Ma on 2023-12-11 | tags: data science team management culture feedback coaching code review asynchronous feedback technical feedback team morale continuous improvement

In this blog post, I share my experiences and insights on providing effective feedback in a data science team. I discuss the importance of positivity, specificity, self-reflection, effusiveness, in situ technical feedback, connecting accomplishments to broader impacts, and uplifting when mistakes occur. These strategies foster a supportive environment, promote continuous improvement, and align team members with the broader mission. How can these feedback strategies improve your team's dynamics and performance? I hope my experiences shared here can give you inspiration!

Read on... (1773 words, approximately 9 minutes reading time)
« 5 6 7 8 9 »