Automation investment and payoff

Posted on 2023-02-11 in story • 4 min read

I often find myself considering whether or not to automate a particular work process. In retrospect, I believe this comes primarily from some experiences very early in my career, as well as an overall lifelong interest in computing and software development.

Recently, I had a pleasant experience regarding the amount of time invested in automating a process relative to the amount of time needed to recognize a return on that investment.

Pitfalls

It’s easy to fall into the trap of applying automation to any task, particularly if it involves a lot of data. At the outset, it seems obvious that this approach will ultimately be more efficient. However, that’s rarely the case.

Part of the problem is the fact that you need to understand the problem fully before you can efficiently automate it. Many times what seems to be a straightforward process ends up having twists and turns and edge cases along the way.

Most importantly, if a task is only going to happen a few times - as opposed to an ongoing, reptitive need - then it is a terrible candidate for automation.

Obligatory xkcd reference

Of course, one of the premier geek references also addresses this conundrum with a handy table to be used as reference:

xkcd comic 125 - Is it worth the time?

The Problem Statement

The business problem at the heart of this story relates to data published regularly to a website. The website is public and is updated monthly with the data of interest. This data is published as a series of PDF files, with approximately ten files - sometimes more, sometimes less - posted in any given month.

Automated processes for extracting the data were already in place, but the downloads from the web site were still a manual process. When I was keeping up with the monthly downloads as a manual process, things weren’t too bad. However, in the last two years the downloads have fallen by the wayside until the end of the year.

When you’re dealing with downloading an entire year’s worth of data, it’s not quite as tolerable to download the 100+ files manually all in one sitting.

The Solution

python, of course! Looking purely at the URL patterns for the site it seemed obvious that the front end was supported by a REST API on the backend that would return data according to the query parameters passed via URL. Reviewing the network traces in the browser confirmed this hunch, as well as provided the schema for the data returned from the API.

The trickiest part was teasing out the unique ID required in the request for the PDF. Once that was in place, I developed a few routines using the wonderful requests package, and fun was had by all.

The Calculation

So now, for the good part. As mentioned, this turned out to be the rare instance where the investment in automation paid off automatically. Let’s crunch the numbers. To do so, we’ll use the pint library to track our various denominations of time.

We’ll also define a unit pdf equal to 100 Joules. The scalar amount is arbitrary, but the SI unit of work was selected because in the abstract sense, I want to accomplish a work task.

In [1]:
import pint

ureg = pint.UnitRegistry()
ureg.define("pdf_file = 100 * Joule = pdf")
q = ureg.Quantity(10, "pdf")
q
Out[1]:
10 pdf_file

It turns out that there were 112 total pdf files to be downloaded to cover the entire current year. From timestamps on the files for the previous year, I saw that it took me 34 minutes to download 24 files when I was doing it manually.

In [2]:
manual_files = ureg.Quantity(24, "pdf")
manual_time = ureg.Quantity(34, "minutes")
manual_time = manual_time.to("seconds")
manual_rate = manual_time / manual_files

print(f"Manual rate is {manual_rate}.")
Manual rate is 85.0 second / pdf_file.

After I had developed the automated process, I was able to download the 112 files in 25 minutes.

In [3]:
automated_files = ureg.Quantity(112, "pdf")
automated_time = ureg.Quantity(25, "minutes").to("seconds")
automated_rate = automated_time / automated_files

print(f"Automated rate is {automated_rate}.")
Automated rate is 13.392857142857142 second / pdf_file.

So now we can calculate the savings realized by developing the automated process, which took 1 hour and 45 minutes in total.

In [4]:
development_time = ureg.Quantity(1, "hour") + ureg.Quantity(45, "minutes")

unit_rate_difference = ureg.Quantity(manual_rate.magnitude - automated_rate.magnitude, "seconds") / ureg.Quantity(1, "pdf")
print(f"Unit rate difference is {unit_rate_difference}")

gross_time_savings = (unit_rate_difference * automated_files)
print(f"Gross time savings is {gross_time_savings.to('minutes')}")

net_time_savings = gross_time_savings - development_time
print(f"Net time savings is {net_time_savings.to('minutes')}")
Unit rate difference is 71.60714285714286 second / pdf_file
Gross time savings is 133.66666666666666 minute
Net time savings is 28.666666666666668 minute

So not only did the investment in automation pay itself back immediately, but it also provided an extra half hour of “free” time. python for the win!

Conclusion

This exercise turned out to be the rare instance where developing automation pays for itself immediately. Even better, this investment will continue to pay itself back over the coming months and years as more data is posted and needs to be extracted, transformed, and loaded.

Why was this a success? As mentioned, one of the downfalls with automating a process too soon is that you don’t understand all the ‘gotchas’. In this case I had already spent multiple years doing these tasks manually.

Secondly, the development went quickly due to my understanding of web development and REST API’s in general. All of my previous dalliances in web scraping had focused on using selenium and a headless browser to simulate a user picking links and navigating through pages. Now that I have a good deal of experience with full-stack development I was able to shortcut the user interface entirely and go straight to the data exposed by the API.