I’ve heard a lot of positive buzz around the Pandas library for basic data processing with Python. After a few weeks of use, I can confidently say that it is a dramatically better tool than Excel for data driven tasks and workflows.
Basics of software architecture
Typically when building a software solution, it is important to have separation of concerns. This makes the solution more modular and flexible to adapt to changing inputs and requirements. Typically, this separation results in divisions between the Data Layer, Business Logic, and User Interface. In Excel, these layers are all mushed together and it is difficult to isolate particular functionality - particularly the processing and business rules that are heaviest in terms of formulas, functions and algorithms.
One major source of errors when working in Excel is in the application of formulas for calculated columns. A typical worksheet usually ends up in a table-like structure, with columns of data alternating with calculated fields based upon that data. However, the worksheet interface makes it very difficult to tell which values are data and which ones are calculated. Even worse, formulas that are almost always applied uniformly to each row of the table need to be updated and managed manually by the user. More often than not, Excel workbooks that I’ve found in the wild had multiple issues where a formula was updated for a subset of the table but not copied uniformly throughout the entire table. Even worse, it’s very common for a user to unknowingly overwrite a calculated field with discrete data. Again, unless the end user checks the formula for each cell, there’s no way to catch this sort of critical error hiding in the weeds.
This issue can be mitigated in part by writing custom user functions. This helps in that the formula is now in a single spot and only needs to be changed in a single location in order to update all calculated instances to the correct algorithm. However, the other issue still remains in that this calculation can easily be overwritten as data, with no help from the user interface. Even worse, writing a custom function means that logic is now moving from a spreadsheet formula into a VBA macro. As soon as you make this leap, you are giving up on the primary benefit of Excel - namely its ubiquitous install base and user familiarity. In other words, the vast majority of users are comfortable and able to audit and update formulas in cell calculations. However, that percentage of users likely drops to single digits when we are talk about a similar comfort level with reviewing and updating VBA code in macros.
Additionally, VBA macro code stays with the workbook that it was written in. It is very challenging to manage VBA code in a central location and deploy it across multiple installations of the “software”. This has been a constant Achilles heel in my experience - A macro or series of macros is developed in a “template”, then tested and rolled out for widespread use. Invariably, a bug or request for additional functionality is raised. This code change then needs to be rolled out uniformly across all of the workbooks that are currently in production. But this is much easier said than done, because those *.xlsm workbooks are now spread across multiple network share folders, email attachments, FTP sites, …
Pandas to the rescue
The name “Pandas” is an abbreviation of sorts for Panel Data. The concept of panel data will be very familiar to Excel users - it is essentially a table of rows and columns. By default, the rows are numbered (starting at 0) and the columns are labeled with headings or field names. Let’s take a look at common tasks in Excel and see how these are easier and more robust when using Pandas.
Import / Export CSV
Comma separated values (CSV) are very common for exchanging data between systems or applications. A very common use case in my industry is topographical survey data. When a project is being planned, a field crew will visit the site and use precision measurement tools to record the existing condition of the site. This data is then typically delivered as a plain text file with comma-separated values. Typically the data will come as hundreds or thousands of data points with field headings such as point number, northing, easting, elevation, point code, and description. This corresponds to attributed 3D geometry that can then be used to develop an existing map of the topography, as well as build a Digital Terrain Model (DTM) that can be used for analysis and design.
When processing CSV in Excel, my observation is that users will typically copy and paste from the text file directly into Excel - because this is easiest. But this represents the first deviation from the desired architecture with separation of concerns. The data is coming from an external source, and Murphy’s Law dictates that it will change and/or be appended in the future. Managing updates, particularly when the size (number of rows) changes, should be the responsibility of the software, not the user. I’ve never had much luck setting this up in Excel. My understanding is that this process would be done via links that auto-update, and formulas would (maybe?) auto-update and copy down for new rows as needed. Again, this might be user error on my part, but I tend to be fairly tech-savvy and I’ve never had much luck getting this set up correctly. With Pandas, it’s super-simple and is oriented towards maximizing the possibility of falling into success. In other words, the easiest way is the “right” way:
Point to a file, configure a header row, change headers if necessary, drop columns on input, drop rows if values are missing.
What a breath of fresh air! Many design engineers spend the majority of their time poring over spreadsheets, chasing stale links, or bemoaning formulas that didn’t get copied correctly across all rows in a given column. I’ve certainly had my share of those moments and that shared pain. I know lots of folks that have a warm place in their heart for Excel, and I can understand that. For me personally, there’s no looking back.