Eric J Ma's Website

« 1 2 3 »

Accurately extract text from research literature PDFs with Nougat-OCR and Docling

written by Eric J. Ma on 2024-12-20 | tags: docling nougat llms document parsing gpu

In this blog post, I explore the challenges of extracting structured text from PDFs, especially when dealing with equations, tables, and figures. I discuss two tools, Nougat-OCR by Facebook Research and Docling by IBM, which I found effective for this task. Nougat-OCR excels at handling equations and tables, while Docling excels on extracting figures. By combining these tools, we can develop a workflow that captures all critical components of a PDF. Want to know how to retain valuable knowledge from complex PDFs?

Read on... (839 words, approximately 5 minutes reading time)
How to thrive, and not just survive, during organizational change

written by Eric J. Ma on 2024-12-17 | tags: professional growth leadership relationships networking organizational change professional development

In this blog post, I explore how to navigate and thrive during organizational changes. I share personal insights and practical strategies, such as focusing on meaningful relationships with colleagues, consistently delivering great work, and proactively building your career path. I also emphasize the importance of staying present and cultivating a 'career committee' of trusted advisors. Change is inevitable in any organization, but how we respond can transform these shifts into growth opportunities. Curious about how to build your own resilience in changing times?

Read on... (592 words, approximately 3 minutes reading time)
5 retrieval strategies to boost your RAG system's performance

written by Eric J. Ma on 2024-12-16 | tags: retrieval augmented generation keyword search fuzzy search vector search knowledge graph large language models

In this blog post, I provide an overview of retrieval methods for Retrieval-Augmented Generation (RAG), exploring various methods like human-curated, exact keyword search, fuzzy keyword search, vector similarity search, and knowledge graph-based retrieval. Each method is dissected to reveal its unique strengths and ideal use cases, providing insights into how they can enhance RAG systems' performance. Curious about how these strategies can be combined for even more robust results?

Read on... (1515 words, approximately 8 minutes reading time)
How LlamaBot's new agent features simplify complex task automation

written by Eric J. Ma on 2024-12-15 | tags: agents llamabot automation interface python analysis workflow

In this blog post, I explore the innovative features of LlamaBot's new AgentBot, designed to simplify complex task automation. These agents operate with goal-oriented non-determinism, decision-making flow control, and natural language interfaces, making them powerful yet user-friendly. I also provide real-world examples, including a detailed walkthrough of a stock market analysis. Curious about how these agents can streamline your workflows and enhance the flexibility of your LLM applications?

Read on... (1344 words, approximately 7 minutes reading time)
A modest proposal for data catalogues at biotechs

written by Eric J. Ma on 2024-11-22 | tags: strategy cloud adoption data catalog data discovery data scientist biotech data governance social graph

Building data platforms at biotechs often fails because we ask scientists to change their workflow and manually catalog data. This leads to poor adoption, wasted engineering effort, and continued data accessibility problems. Instead of building new systems, I propose automatically capturing data sharing patterns that already exist. This approach:

  • Reduces implementation costs by 60-80% compared to traditional platforms
  • Requires zero change in scientist behavior
  • Creates an automatically-maintained data catalog
  • Enables rapid data discovery through social connections
  • Can be implemented incrementally, showing value within 3-6 months

Read on... (2513 words, approximately 13 minutes reading time)
Deploying Ollama on Modal

written by Eric J. Ma on 2024-11-14 | tags: modal deployment open source api cloud gpu software models ollama large language models

In this blog post, I share my journey of deploying Ollama to Modal, enhancing my understanding of Modal's capabilities. I detail the script used, the setup of the Modal app, and the deployment process, which includes ensuring the Ollama service is ready and operational. I also implement an OpenAI-compatible endpoint that makes it easy to use the deployment with existing tools and libraries. This exploration not only expanded my technical skills but also created a practical solution for using open-source models in production. Curious about how this deployment could streamline your projects?

Read on... (1762 words, approximately 9 minutes reading time)
Disposable environments for ad-hoc analyses

written by Eric J. Ma on 2024-11-08 | tags: python tooling data science notebook reproducibility juv uv environment management scripts analysis

In this blog post, I explore the innovative 'juv' package, which simplifies Python environment management for Jupyter notebooks by embedding dependencies directly within the notebook file. This approach eliminates the need for separate environment files, making notebooks easily shareable and reducing setup complexity. I also discuss integrating 'juv' with 'pyds-cli' to streamline ad-hoc data analyses within organizations, enhancing reproducibility and reducing environment conflicts. Curious about how this could change your data science workflow?

Read on... (1328 words, approximately 7 minutes reading time)
Introducing new (local) LlamaBot logging features

written by Eric J. Ma on 2024-11-02 | tags: llamabot logging software development llm large language models web development version control

In this blog post, I share the latest updates to LlamaBot, including automatic logging of LLM interactions, version-controlled prompt logging, and a new web-based UI for visualizing these logs. These features enhance prompt analysis and make prompt engineering more intuitive and data-driven. Additionally, logs can now be exported in OpenAI fine-tuning format for easier sharing and integration. If you're keen on refining your LLM interactions and prompt crafting, these tools might be just what you need. Curious to see how these new features can streamline your workflow?

Read on... (595 words, approximately 3 minutes reading time)
The Human Dimension to Clean, Distributable, and Documented Data Science Code

written by Eric J. Ma on 2024-10-25 | tags: data science coding documentation readability distribution best practices cognition tool making

This blog post, which is my pyOpenSci Fall Training keynote, explores the importance of creating clean, distributable, and well-documented data science code, emphasizing the human dimension of coding practices. In it, I discuss key concepts such as readability, cognitive load, and the toolmaker's mindset, and provide practical insights on how to make code more accessible and impactful for both the creator and other users. I also touch on the role of AI in coding and documentation.

Read on... (3795 words, approximately 19 minutes reading time)
Cursor did a one-shot rewrite of a Panel app I built

written by Eric J. Ma on 2024-10-20 | tags: coding fastapi htmx automation design prototype development patterns assistance productivity

In this blog post, I share my experience using Cursor to transform a sluggish Panel app into a sleek HTMX + FastAPI application. I detail how AI-assisted coding not only sped up the process but also enhanced my understanding of web development. Despite my limited skills in this area, the AI tools helped me prototype faster and learn valuable lessons. I also reflect on the importance of critically assessing AI-generated code to improve one's coding instincts. Curious to see how it turned out and what insights I gained?

Read on... (690 words, approximately 4 minutes reading time)
« 1 2 3 »