Caching Long-Running Function Results

written by Eric J. Ma on 2019-10-18

python tips optimization packages

I found this nifty tool for caching the results of long-running functions: cachier. This is useful when we’re building, say, Python applications for which quick interactions are necessary, or for caching the results of a long database query.

How do we use it? Basically it’s nothing more than a decorator!

Let’s imagine I have a long-running function as below.

def long_running_function(arg1, arg2):
    # stuff happens
    return result

Turns out, if you have a need to cache the result in a lightweight fashion, you can simply add cachier:

from cachier import cachier

@cachier()
def long_running_function(arg1, arg2):
    # stuff happens
    return result

The result is stored in your home directory, so the cache is accessible to you.

One nice thing cachier also offers is the ability to set a time duration after which the cache goes stale. This can be useful in situations where you know that you need to refresh the cache, such as a database query that may go stale because of new data added into it. This is done by specifying the stale_after keyword argument:

from cachier import cachier
from datetime import timedelta

# Re-cache result after 1 week.
@cachier(stale_after=timedelta(weeks=1))
def long_running_function(arg1, arg2):
    # stuff happens
    return result

If you need to reset the cache manually, you can always do:

long_running_function.clear_cache()

There are other advanced features that cachier provides, and so I’d encourage you to go and take a look at it!

Did you enjoy this blog post? Let's discuss more!


Jupyter Server with HTTPS on Personal Server

written by Eric J. Ma on 2019-10-05

jupyter dataops devops data science

Recording this for myself, since I did it once and probably don't have the brain bandwidth to remember this through repetition.

I have known how to run a "public" Jupyter server (password-protected, naturally), but one thing I've struggled with was getting HTTPS working.

Turns out, the letsencrypt instructions aren't that bad on Jupyter's docs. I just was ignorant in the past, and didn't know enough about Linux to get this working right.

The key here is creating a letsencrypt certificate, and making sure file permissions are set correctly.

First off, go to the Certbot page. Select the type of website you're running and operating system. For Jupyter, I chose "None of the Above" and "Ubuntu 18.04 LTS (bionic)" (even though I'm technically on Ubuntu 19). (Here's a shortcut link to the instructions if you're in the same situation.)

On my system (Ubuntu-based), I used the following commands to install certbot:

# Add repository
sudo apt-get update
sudo apt-get install software-properties-common
sudo add-apt-repository universe
sudo add-apt-repository ppa:certbot/certbot
sudo apt-get update

# Install certbot
sudo apt-get install certbot

# Run certbot
sudo certbot certonly --standalone

Follow the instructions. certbot will install into a protected directory. In my case, it was /etc/letsencrypt/live/<mywebsite>/.

Here, a problem will show up. That directory above is not accessible by a Jupyter server run under a user other than root. But a desired property of running Jupyter servers is that we don't have to use sudo to run it. How can we solve this? Basically, by making sure that the certificate is readable by a non-root user.

What I did, then, was to copy the files that were created by certbot into a location under my home directory. For security by obscurity, I'm naturally not revealing its identity. Then, I changed ownership of those files to my username:

pwd  # you should be in the directory where the certbot-created files are located
su -
chown <myusername> *.pem  # changes ownership of those files

Finally, I went into my Jupyter config (~/.jupyter/jupyter_notebook_config.py, this is well-known), and edited the two lines that specified the "certfile" and the "keyfile":

c.NotebookApp.certfile = u'/absolute/path/to/your/certificate/mycert.pem'
c.NotebookApp.keyfile = u'/absolute/path/to/your/certificate/mykey.key'

If this helps you, leave me a note in the comments below. :)

Did you enjoy this blog post? Let's discuss more!


Multiple Coin-Flips vs. One Coin Flip Generalized?

written by Eric J. Ma on 2019-10-05

teaching bayesian statistics

Do people learn better by:

I think both are needed, but I am also torn sometimes by whether it’s more effective to communicate using the former or the latter.

Case in point: In teaching Bayesian statistics, the coin flip is a particular case of the Beta-Binomial model. However, the Beta-Binomial model can be taken from its most elementary form (estimation on one group) through to its most sophisticated form (hierarchically estimating p).

I guess if the goal is to show how broadly applicable a given model class (i.e. the beta-binomial model) is, a teacher would elect to jump between multiple examples that are apparently distinct. However, if the goal is build depth (i.e. going from single group to multiple group estimation), sticking with one example (e.g. of baseball players, classically) would be the better strategy.

Both are needed, just at different times, I think. Thinking through this example, I think, gives me a first-principles way of deciding which approach to go for.

Did you enjoy this blog post? Let's discuss more!


Dokku: Building an internal Heroku at work

written by Eric J. Ma on 2019-09-07

data apps data science devops deployment

At work, we don’t have a service that has the simplicity of Heroku. Part of it is that we’re still behind what’s available for free in my FOSS life (both commercial and FOSS offerings), and cybersecurity tends to be a gatekeeper against adoption of new things, which is a reality I have to face at work.

BUT! I am unwilling to simply bow down to this secnario. “There’s got to be a better way.”

What does that mean? It means if we want a Heroku-like thing internally, we have to hack together workarounds.

Enter Dokku!

What is it? It’s a FOSS implementation of the functionality that Heroku provides. It’s only slightly more involved than Heroku, and gives us a really nice taste of what’s possible with Heroku.

Dokku claims to be the “smallest PaaS implementation you’ve ever seen”, and I fully believe it. The maintainers have done a wonderful thing, making the installation process as simple and clean as possible. I’ve successfully installed it on a bare DigitalOcean droplet and on my home Linux tower. I’ve also successfully installed it in EC2 instances at work, albeit needing a few minor modifications to the script they provide.

Why would I want to use Dokku?

Taking Dokku on my DigitalOcean droplet as an example, what it effectively provides is a self-hosted Heroku.

This means you can get 95% of the convenience that Heroku offers, except done in-house. This can be handy if you’ve got cybersecurity standing in the way of awesome convenience, or if finance isn’t willing to shell out the moolah.

What can we do with Dokku?

Here’s a few neat things that we can do.

  1. We can provision a database to run on the same compute node as the app, and then link them together. If your compute node is “beefy” enough (RAM/CPU/storage-wise) to handle both the database and the app (and I mean, I’m confident that most disposable apps aren’t going to be at a large scale), then it can be pretty handy, because it means we save on latency.
  2. We can deploy apps using either Heroku buildpacks (which look for Procfiles) or using Dockerfiles. Docker containers can be easier to maintain if we have a large and/or complex conda environment, in my opinion, as we can reuse the existing environment spec, but Procfiles are much nicer for smaller projects. This fits with the paradigm of “declaring what you need”, rather than “programming what you need”.
  3. Because Dokku is managing everything through isolated Docker containers, we can actually enter into a Docker container and muck around to debug, without worrying about breaking the broader system. I realize now how neat it is to have containerization, but without a unified front-end interface to manage the containers, networking interfaces, and environment variables, it’s tough to keep everything straight. Dokku provides that front-end interface.

What are you deploying right now?

On my DigitalOcean box, which I use for personal projects, I have deployed both the “Getting Started” Ruby app that Heroku provides as well as a minimal app showcasing a minimal dashboard using Panel.

The easy part was getting Dokku up and running. The hard part, though, was getting URLs and DNSs right. It took some debugging to get that work correctly.

In particular, Dokku uses a concept called virtual hosts (VHOSTS) to route from the Dokku host to individual containers. For example, to get minimal-panel.ericmjl.com up and running correctly, I had to ensure that *.ericmjl.com was routed to my DigitalOcean box.

How have we used this at work?

At work, I just finished prototyping the use of Dokku on EC2. In particular, I was able to deploy both Dockerfile-based and Procfile-based projects. Once again, getting a domain was the most troublesome part of this project; spinning up an EC2 instance and configuring it became easy using a simple Bash script which we executed on each test spin-up machine.

What changes between Heroku and Dokku?

The biggest thing I found is that I need to at least have SSH-access to the compute box that is running Dokku. This is because what we would usually configure on Heroku’s web interface (e.g. environment variables), we would instead configure using dokku’s command-line interface via SSH. Hence, not being afraid of the CLI is important.

What’s your verdict?

If you know Heroku, Dokku gets you 95% of the convenience you’re used to, plus quite a bit more flexibility to customize it to your own compute environment.

Did you enjoy this blog post? Let's discuss more!


How to be a great code sprinter

written by Eric J. Ma on 2019-07-29

software development sprint code sprint open source community community development

This blog post is the second in a series of two on participating in code sprints. The first one is here. In this post, I will write about how a sprinter themselves can also help contribute to a positive sprint experience all-ways.

Read the docs to understand the scope of the project

As a sprinter, we may often have preconceived notions about what a project is about. It helps to have an accurate view on what a project is and isn’t about. This is oftentimes best accomplished by reading the documentation of that project, assuming the docs are well-written. Doing so can help you better align what you think should be done on the project with what the package maintainer sees as priorities for the project.

Be ready to make documentation contributions of any scale

Documentation is oftentimes the hardest thing for a package maintainer to write, because it often entails slowing down to a beginner’s speed (an unnatural speed at this point), while knowing one’s own blind spots on where a beginner would stumble (also challenging to do).

If you are newcomer sprinter, by focusing on the sections of the docs that pertain to “processes” (e.g. getting development environment setup) and slowly working through them and documenting what’s missing, that can go a long way to helping other newcomers get set up as well. Anything that the maintainer leaves out may need to be made explicitly clear - and you can help make it clear!

Package maintainers, and prior contributors, are human. That means that there inadvertently may be errors in language that may have been inserted into the package. Any small patch that fixes the docs, including even small typographical errors, can be very helpful to improving documentation quality.

Don’t be afraid to ask questions...

You will find that asking questions can really accelerate your own progress on the project. This is important for getting unstuck, wherever you might be stuck.

...but also try keep your questions to the non-obvious things.

That said, asking the too-simple questions that can be answered by a Google query is likely going to steal time and attention away from other sprinters who might have more substantial questions on hand.

A pet peeve of mine is asking questions that can be answered in the docs. Asking these questions of the maintainer doesn’t reflect positively on you, the sprinter. Whether or not you intended, what often gets received/communicated to the maintainer is carelessness and a lack of attention to detail, the opposite of both being generally good qualities to possess and project.

There’s a pretty broad balance point between the two, so don’t feel inhibited by fear of not hitting a precise balance between looking for docs and asking questions.

For any feature requests, try to be ready with a proposed implementation

This one I find very important. Having a proposed implementation on hand for a thing that you think should be in the library goes a long way to helping the package maintainer (or other contributors) see what exactly you’re trying to accomplish with that feature. Having a sketch on-hand makes it much easier for the package maintainer to say “yes” to the new feature, and having written the documentation and a proposed suite of tests for that new feature makes it even easier.

If you aren’t able to propose an implementation, then raising an inquiry rather than a request makes a world of difference in how a package maintainer perceives the communication of the issue at hand.

As an example:

The latter are more thoughtful, and communicates much less a sense of entitlement on the part of the sprinter’s request.

We’re all building mental maps of each others’ knowledge

When two colleagues meet for the first time, we have to build a mental model of each others’ strengths. At a sprint, the package maintainer has to multiply this by however many people are sprinting.

If they are making an effort to map your skills against theirs, they may be very verbose, asking lots of questions to clarify what you do and don’t know. It pays to be patient here.

If they don’t have the bandwidth to do so (and this is a charitable description for some maintainers), then they may be glossing over detail. Rather than being stuck, it pays to interrupt them gently and clarify. (Taking notes is a very good way of communicating that you’re treating this process seriously too!)

Give your sprint leader sufficient context

As mentioned above, the sprint leader will oftentimes be context switching from person to person. It’s mentally exhausting, so spoon-feeding a bit more context (such as the thing you’re working on), and condensing your question to the essentials and asking it very precisely can go a long way to helping your sprint leader help you better.

Did you enjoy this blog post? Let's discuss more!