Eric J Ma's Website

scikit-learn tutorial

written by Eric J. Ma on 2016-02-09 | tags: data science software carpentry data carpentry


This past Monday, I led a hands-on session at the Broad Institute, showing how to use the scikit-learn API, as well as common coding patterns for running machine learning algorithms on the data.

First off, I was totally surprised at how many people signed up for the event - it "sold out" (tickets were free) within 3 hours of opening registration. I was quite floored - and I know it's not because I'm some famous dude who knows ML algorithms. Rather, it told me and my co-organizers that there is most certainly great demand here for this topic, and it should be run again.

I found it to be a great opportunity to put some of my Software Carpentry/Data Carpentry tools to use. For example, we used sticky notes to indicate class progress (we used blue for "all done", and red for "need help"). I also tried to ensure that participants could walk away from the tutorial knowing how to do something that they could use immediately in their research.

The examples I used (hosted openly on Github) were based on transforming sequence information into a sequence feature matrix, which is then fed into a selected machine learning algorithm. I think this seemed to suit the crowd, but I would love to use different examples as well, for example microbiome data or transcriptomics data.

Feedback given by the participants was overall quite positive. To some, I could have explained things a bit more clearly, which I could recognize as an area I will need to improve on for delivering a second round. Others loved the hands-on instruction; in their feedback it was just the right delivery format for the workshop.

One participant asked me, "Why do you host these workshops?" (I had done one on statistics with some fellow BE colleagues.) I hadn't thought much about that question, so my instinctive response was, "For fun - I'm finding ways to share knowledge, for the benefit of the community." I stand by that thought, as I think it's important for knowledge-bearers to make copies of their knowledge, for the sake of giving back to the community. Yet, stemming from knowledge of myself, I also think that sharing programming knowledge around is an insurance policy against personal stagnation. Put in simple terms, if others know how to wield the tools that I know how to do, then I had better continue levelling up my skill to stand out. It also gets boring and sometimes frustrating person after a while being the "go-to" person for things ML-related or computing-related; it means I can't devote time to exploring new things. So sharing is fun, and is an insurance policy against boredom and frustration - why not share then? :)

All in all, looking forward to reporting on the workshop at the next Broad NextGen meeting, and hosting another iteration at a later time!


I send out a newsletter with tips and tools for data scientists. Come check it out at Substack.

If you would like to sponsor the coffee that goes into making my posts, please consider GitHub Sponsors!

Finally, I do free 30-minute GenAI strategy calls for organizations who are seeking guidance on how to best leverage this technology. Consider booking a call on Calendly if you're interested!