## ๐ How LLMs Can Accelerate Data Science [Eric J. Ma](https://ericmjl.github.io/) `ericmjl.github.io/how-llms-can-accelerate-data-science/` --- ## ๐๐ปโโ๏ธ About me - 7 years of GPT-4 prompt engineering experience. - Proficient in telepathy, can communicate with future technologies. - Proficient in sending time-traveling emails. *I jest* --- ## ๐๐ปโโ๏ธ About me (take 2) > "Sr. Principal Data Scientist, DSAI (Research) @ Moderna" - Sr. Principal Data Scientist: Data science team lead - DSAI: Computer nerd - Research: Science nerd --- ## ๐ Our team exists to... > ...make discovery science run at the speed of thought and to quantify the previously unquantified. --- ## 2๏ธโฃ ๐งต --- ## ๐งต Thread 1๏ธโฃ: LLMs are incredibly empowering ---- ### ๐ค Generative AI @ Moderna - Fully embraced with democratized access to GPT4-128K. - Internal academy builds community. - Flowering of use citizen-generated use cases. ---- ### ๐ง LLMs are - your most knowledgeable drafting agent - prone to mistakes *Trust but verify!* --- ## ๐งต Thread 2๏ธโฃ: Data science necessarily involves writing software ---- ### ๐ Data science requires coding
---- ### ๐งโ๐ป Coding requires software skills [![Data Science Software Skills](images/data-science-software-skills.webp)](https://ericmjl.github.io/blog/2024/4/5/how-to-grow-software-development-skills-in-a-data-science-team/) ---- ### ๐ฅ๏ธ Software skills can be grown
--- ## 2๏ธโฃ ๐งต (recap) - LLMs are incredibly empowering - Data science necessarily involves writing software --- ## โจ LLMs Enable DS Teams to Draft Great Software ---- ### ๐ Draft New Code ```python from sklearn.ensemble import RandomForestRegressor import pandas as pd # Read the file `data.csv` and create a Random Forest # regressor model to predict `output` from the rest # of the columns. ``` The more specific, the better the LLM response. ---- ### ๐ Draft Docstrings ```python import pandera as pa from .schemas import raw_data_schema @pa.check_input(raw_data_schema) def preprocess(df: pd.DataFrame): """Preprocess dataframe. :param df: pandas DataFrame """ # Copilot is usually great at drafting out docstrings ``` ---- ### ๐งช Draft Tests ```python @prompt def draft_tests(function_body): """I have the following function {{ function_body }} Draft me three unit tests. """ ``` --- ## ๐ช LLMs Gives Your Data Science Team Superpowers ---- ### ๐ Refactor entire notebooks ```python @prompt def refactor(notebook_json): """Given the following Jupyter notebook code: {{ notebook_json }} Help me propose a suite of functions that can be refactored out from the notebook. """ ``` ---- ### ๐ Debug Stack Trace ```python @prompt def debug(stack_trace): """Given the following stack trace: {{ stack_trace }} Help me debug what's going on here by proposing hypotheses as to why I am encountering this error. """ ``` ---- ### ๐ Draft Long-Form Documentation ```python @prompt def draft_documentation( presentation_transcript, diataxis_spec, documentation_type ): """I have the following meeting transcript: {{ presentation_transcript }} Here is the Diataxis spec for {{ documentation_type }}: {{ diataxis_spec }} Please draft for me the {{ documentation_type }} using the transcript as source material. """ ``` Check out the Diataxis framework [here](https://diataxis.fr)! ---- ## โ๏ธ Write commit messages ```text commit 5ffff99bae4611bdcea09187d839e5b40b9ef630 Author: Eric Ma
Date: Mon Apr 15 15:55:18 2024 -0400 feat(index): update professional title and add about sections - Update the professional title to "Senior Principal Data Scientist" at Moderna. - Add a humorous "About me" section with claims of GPT-4 prompt engineering experience, telepathy, and time-traveling emails. - Introduce a more serious "About me (take 2)" section detailing roles and expertise. - Highlight involvement in Generative AI at Moderna, emphasizing democratized access to GPT, community building through an internal academy, and the development of citizen-generated use cases. ``` ---- ## โ๏ธ If you want the tool... ```bash pip install -U llamabot llamabot git install-commit-message-hook # Ensure you have an OpenAI API Key ``` ---- ### ๐๏ธ Draft Database Queries ```python @prompt def nl2sql(nl_request, schema): """Given the following database schema: {{ schema }} And the following natural language request: {{ nl_request }} Return for me the SQL necessary to satisfy the natural language request. """ ``` ---- ### โ Ask Questions of Code Repos ```bash llamabot repo chat https://github.com/webpro/reveal-md.git \ --checkout main \ --model-name gpt-4-0125-preview \ --panel ``` ---- ---- ### ๐ Learn a New Domain ```python @prompt def learn(concept, education_level): """You are an expert in {{ concept }}. I am looking to understand {{ concept }}. Explain it to me the way Richard Feynman would explain it to a student in {{ education_level }}. """ ``` ---- ---- ### ๐ Add Emojis to Your Talk ```python @prompt def add_emojis(talk_md): """Given the following Markdown slide deck: {{ talk_md }} Add emojis to all of the title headers while preserving the original text. """ ``` ---- ## ๐งฉ The Hardest Routine Parts of Data Science 1. Taking the time to document our work. 2. Checking correctness of our work. 3. Rapidly ramping up into a new knowledge domain. *LLMs can accelerate all of the above.* --- ## ๐ A Productive New World - LLMs are an insanely great productivity tool. - Supercharge your learning and automate the boring stuff. --- ## ๐ค Select reading - [Grow DS team software development skills](https://ericmjl.github.io/blog/2024/4/5/how-to-grow-software-development-skills-in-a-data-science-team/) - [Survey of LLM tooling](https://ericmjl.github.io/blog/2024/2/1/an-incomplete-and-opinionated-survey-of-llm-tooling/) - [Organize and motivate a DS team](https://ericmjl.github.io/blog/2024/3/23/how-to-organize-and-motivate-a-biotech-data-science-team/) - [Keep technically sharp as a DS team lead](https://ericmjl.github.io/blog/2024/2/25/how-to-keep-sharp-with-technical-skills-as-a-data-science-team-lead/) - Data Science Bootstrap Notes [website](https://ericmjl.github.io/data-science-bootstrap-notes/get-bootstrapped-on-your-data-science-projects/) and [ebook](https://leanpub.com/dsbootstrap/) --- ## โญ๏ธ Thank You