Order of magnitude is more than accurate enough

written by Eric J. Ma on 2019-07-07

data science estimation statistics

When I was in Science One at UBC in 2006, our Physics professor, Mark Halpern, said a quotable statement that has stuck for many years.

Order of magnitude is more than accurate enough.

At the time, that statement rocked the class, myself included. We were classically taught that significant digits are significant, and that we have to keep track of them. But Mark’s quote seemed to throw all of that caution and precision in Physics into the wind. Did what we learn in Physics lab class not matter?

Turns out, there was one highly instructive activity that still hasn’t left my mind. We were asked, during a recitation, to estimate how many days the city of Vancouver could be powered for if we took a piece of chalk and converted its entire mass into energy. This clearly required estimation of chalk mass and Vancouver daily energy consumption, both of which we had no way of accurately knowing.

Regardless, I took it upon myself to carry significant digits in our calculation, while my recitation partner, Charles Au, was fully convinced that this wasn’t necessary, and so did all calculations order-of-magnitude. We debated and agreed upon what assumptions we needed to arrive at a solution, and then proceeded to do the same calculations, one with significant digits, the other without.

We reached the same conclusion.

More precisely, I remember obtaining a result along the lines of $6.2 \cdot 10^3$ days, while Charles obtained $10^4$ days. On an order of magnitude, more or less equivalent.

In retrospect, I shouldn’t have been so surprised. Mark is an astrophysicist, and at that scale, 1 or 2 significant digits might not carry the most importance; rather, getting into the right ballpark might be more important. At the same time, the recitation activity was a powerful first-hand experience of that last point: getting into the right ballpark first.

At the same time, I was also missing a second perspective, which then explains my surprise at Mark’s quote. Now that I’ve gone the route of more statistics-oriented work, I see a similar theme showing up. John Tukey said something along these lines:

Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.

The connection to order of magnitude estimates should be quite clear here. If we’re on an order of magnitude correct on the right questions, we can always refine the answer further. If we’re precisely answering the wrong question, God help us.

What does this mean for a data scientist? For one, it means that means approximate methods are usually good enough practically to get ourselves into the right ballpark; we can use pragmatic considerations to decide whether we need a more complicated model or not. It also means that when we’re building data pipelines, minimum viable products, which help us test whether we’re answering the right question, matter more than the fanciest deep learning model.

So yes, to mash those two quotes together:

Order of magnitude estimates on the right question are more useful than precise quantifications on the wrong question.

Did you enjoy this blog post? Let's discuss more!


SciPy 2019 Pre-Conference

written by Eric J. Ma on 2019-07-07

conferences scipy2019

For the 6th year running, I’m out at UT Austin for SciPy 2019! It’s one of my favorite conferences to attend, because the latest in data science tooling is well featured in the conference program, and I get to meet in-person a lot of the GitHub usernames that I interact with online.

I will be involved in three tutorials this year, which I think have become my data science toolkit: Bayesian stats, network science, and deep learning. Really excited to share my knowledge; my hope is that at least a few more people find the practical experience I’ve gained over the years useful, and that they can put it to good use in their own work too. This year is also the first year I’ve submitted a talk on pyjanitor, which is a package that I have developed with others for cleaning data, also excited to share this with the broader SciPy community!

I’m also looking forward to meeting the conference scholarship recipients. Together with Scott Collis and Celia Cintas, we’ve been managing the FinAid process for the past three years, and each year it heartens me to see the scholarship recipients in person.

Finally, this year’s SciPy is quite unique for me, as it is the first year that I’ll be here with colleagues at work! (In prior years, I came alone, and did networking on my own.) I hope they all have as much of a fun time as I have at SciPy!

Did you enjoy this blog post? Let's discuss more!


Bone Marrow Donations

written by Eric J. Ma on 2019-06-30

personal charity leukemia donations blood donation community service

A friend of mine just reached out to me, saying that he’s been diagnosed with leukemia. Thankfully, he’s not subject to the abysmal state of US healthcare (as he lives in a place where healthcare coverage is great), and so he’s on treatment, progressing, and hopefully has a great shot at beating this cancer.

He definitely knows how to speak to a data scientist: using data. The odds of a match for a patient who needs a bone marrow transplant are 500:1. That means on average, only about 1 donor in 500 will be a match. On the other hand, under certain assumptions, every 500 donors who registers will mean one life, on average, can be saved. I did some digging myself: According to the US Health Resources and Services Administration, nearly every single minority ethnic group is underrepresented in donor registry databases.

As things turn out, signing up to be a donor is quite lightweight. Genetic information - specifically, only Human Leukocyte Antigen (HLA) type - is needed, and that can be obtained in a non-invasive fashion. If a match is found, the donor still has the option to withdraw if they have any objections. As such, the process is completely voluntary for the donor. There are two types of donations possible: peripheral blood stem cells (PBSC) and bone marrow, with PBSC donations being lightweight and bone marrow donations being more involved. Digging a bit deeper, it seems like the only sacrifice a donor has to make is that of time and some discomfort.

I’m putting this blog post up as a reminder to myself to register, and to encourage others to do so as well. If you’re in the United States, Be The Match is the organization to get in touch with; if you’re from my home country of Canada, the Canadian Blood Services manages the process.

Did you enjoy this blog post? Let's discuss more!


Graphs and Matrices

written by Eric J. Ma on 2019-06-15

graphs network science data science

Once again, I’m reminded through my research how neat and useful it is to be able to think of matrices as graphs and vice-versa.

I was constructing a symmetric square matrix of values, in which multiple cells of the matrix were empty (i.e. no values present). (Thankfully, the diagonal is guaranteed dense.) From this matrix, I wanted the largest set of rows/columns that formed a symmetric, densely populated square matrix of values, subject to a second constraint that the set of rows/columns also maximally intersected with another set of items.

Having thought about the requirements of the problem, my prior experience with graphs reminded me that every graph has a corresponding adjacency matrix, and that finding the densest symmetric subset of entries in the matrix was equivalent to finding cliques in a graph! My intern and I proceeded to convert the matrix into its graph representation, and a few API calls in networkx later, we found the matrix we needed.

The key takeaway from this experience? Finding the right representation for a problem, we can computationally solve them quickly by using the appropriate APIs!

Did you enjoy this blog post? Let's discuss more!


Mobile Working on the iPad

written by Eric J. Ma on 2019-06-14

productivity mobile

A few years ago, I test-drove mobile work using my thesis as a case study, basically challenging myself with the question: how much of my main thesis paper could I write on iOS (specifically, an iPad Mini)? Back then, iOS turned out to be a superb tool for the writing phase (getting ideas into a text editor), and a horrible one for final formatting before submitting a paper to a journal (inflexible). Now, don’t get me wrong, though - I would still use it as part of my workflow if I were to do it again!

Fast-forward a few years, I now do more programming in a data science context than I do formal writing, and the tooling for software development and data analysis on iOS has improved greatly. I thought I’d challenge myself with an experiment: how much of development and analytics could I do on an iPad, especially the Mini?

This time round, armed with an iPad Pro (11”), I decided to test again how much one can do on iOS, once again.

Software Development

I develop pyjanitor as a tool that I use in my daily work, and as part of my open source software portfolio. When I’m on my MacBook or Pro, or on my Linux desktop at home, I usually work with VSCode for it’s integrated terminal, superb syntax highlighting, git integration, code completion with Kite, and more.

Moving to iOS, VSCode is not available, and that immediately means to rely on Terminal-based tools to get my work done. I ponied up for Blink Shell, and found it to pay off immediately. Having enabled remote access on my Linux tower at home, I was thrilled to learn that Blink supports mosh, and when paired with tmux, it is a superb solution for maintaining persistent shells across spotty internet conditions.

A while ago, I also configured nano with syntax highlighting. As things turned out, syntax highlighting has the biggest effect on my productivity compared to other text editor enhancements (e.g. code completion, git integration, etc.). After I mastered most of nano's shortcut keys, I found I could be productive at coding in just nano itself. Even though missing out on the usual assistive tools meant I was coding somewhat slower, the pace was still acceptable; moreover, relying less on those tools helped me develop a muscle memory for certain API calls. I also found myself becoming more effective because the single window idioms of iOS meant I was focusing on the programming task at hand, rather than getting distracted while looking at docs in a web browser (a surprisingly common happening for me!).

Data Analysis

For data analysis, Jupyter notebooks are the tool of my choice, for their interactive nature, and the ability to weave a narrative throughout the computation. Jupyter Lab is awesome for this task, but it’s poorly supported on mobile Safari. To use Jupyter notebooks in iOS, the best offering at the moment is Juno, with its ability to connect to a Jupyter server accessible through an IP address or URL. This does require payment, though, and I gladly ponied up for that as well.

I run a Jupyter server on my Linux tower at home. Because it has a GPU installed on it, when I am accessing the machine through Juno, my iPad suddenly has access to a full-fledged, fully-configured GPU as part of the compute environment! Coupled with the responsiveness of Juno, this makes for a fairly compelling setup to do Python programming on an iPad.

Pros and Cons of iPad-based Development

Cons

Overall, the experience has been positive, but there have been some challenges, which I would like to detail here.

Remote server required: Firstly, because we are essentially using the iPad as a thin client to a remote server, one must either pay for a remote development server in the cloud, or go through the hassle of setting up a development machine that one can SSH into. This may turn off individuals who either are loathe to rent a computer, or don’t have the necessary experience to setup a remote server on their own.

iOS Multi-Windowing: It’s sometimes non-trivial to check up source code or function signatures (API docs, really) on the sidebar browser window in iOS. Unlike macOS, in which I have a number of shortcut keys that will let me launch and/or switch between apps, the lack of this capability on iOS means I find myself slowed down because I have to use a bunch of swiping gestures to get to where I need to be. (Cmd-tab seems to be the only exception, for it activates the app switcher, but the number of apps in the app switcher remembers is limited.)

Pros

Even with the issues detailed above, there’s still much to love about doing mobile development work on an iOS device.

iOS Speed: On the latest hardware, iOS is speedy. Well, even that is a bit of an understatement. I rarely get lags while typing in Blink and Juno, and even when I do, I can usually pin it down to network latency more than RAM issues.

Focus: This is the biggest win. Because iOS makes it difficult to switch contexts, this is actually an upside for work that involves creating things. Whether it’s someone who is drawing, producing videos, editing photos, writing blog posts, or creating code, the ability to focus on the context at hand is tremendous for producing high quality work. Here, iOS is actually a winner for focused productivity.

Mobility: The second upside would be that of battery life, and hence mobility, by extension. My 12” MacBook is super mobile, yes, but macOS appears to have issues restraining battery drainage when the lid is closed. By contrast, iOS seems to have fewer issues with this. The battery life concerns mean I’m carrying my mouse, charger and dongle with me all the time, and I’ll get the equivalent of range anxiety when I take only my laptop.

Keyboard Experience: The keyboard experience on the Smart Keyboard Folio is surprisingly good! It’s tactile, and is fully covered, so we won’t have issues arise due to dust getting underneath the keys, like my little MacBook had.

Concluding Thoughts

This test has been quite instructive. As usual, tooling is superbly important for productivity; investing in the right tools makes it worthwhile. Granted, none of this comes cheap or for free, naturally. Given the future directions of iOS, I think it’s shaping up to be a real contender for productivity!

Did you enjoy this blog post? Let's discuss more!