Maybe you don't need pandas

2024-02-01

I don’t want to say that the pandas library is bad and should be avoided at all costs, but in my experience, its API can be confusing and it tends to get used as the Swiss army knife of Python programming because code snippets are copied from a thousand “data science” tutorials online. As a result, code that uses pandas often doesn’t do exactly what people think it does, performs worse than simpler alternatives, and bloats the application with unnecessary dependencies.

In the following, I list a few alternatives for tasks that I have seen pandas being used for without necessity.

Reading and writing CSV

If you just want to read or write CSV files, the standard library has got you covered. Instead of pandas.read_csv, you can use csv.DictReader and csv.DictWriter.

Here is the DictReader example copied from the docs.

import csv

with open('names.csv', newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        print(row['first_name'], row['last_name'])

If you prefer to work with dataclasses, you could do something like this:

import csv
from dataclasses import dataclass
from pathlib import Path
from typing import Optional

@dataclass
class Name:
    first_name: Optional[str]
    last_name: Optional[str]

def read_csv(path: Path) -> Iterator[Name]:
    with open(path, newline='') as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            yield Name(**row)

Not only do you avoid dependencies with this, but you avoid loading the whole file into memory!

Reading and writing Excel files

For reading or writing Excel files, you can have a look at the dependencies that pandas pulls in if you request the excel extra. You probably only need openpyxl.

Here I iterate over the rows of the active work sheet:

from openpyxl import load_workbook

wb = load_workbook(filename = 'excel_file.xlsx')
for row in wb.active.rows:
    print([cell.value for cell in row])

Executing SQL

Pick an SQL client lib for your database, for example, mysql-connector-python for MySQL or psycopg for PostgreSQL (both compliant with PEP 249), get yourself a connection and a cursor, and execute your queries. Just make sure to properly close your resources – psycopg has handy context managers for this starting from version 2.5!

import psycopg

with psycopg.connect(dsn) as conn:
    with conn.cursor() as cur:
        cur.execute(query)
        for record in cur:
            print(record)

If you want to work with an ORM, you can use a library like SQLAlchemy (although I’m personally not a big fan).

Dataframes

If you’re manipulating and calculating with dataframes and you’re comfortable with pandas, by all means, keep using pandas. Especially for explorative scripts, it’s absolutely fine.

On the other hand, if you need dataframes and have never worked with pandas, or if you’re dealing with large amounts of data, maybe give polars a try. It’s a Rust package that comes with Python bindings. The API looks nice, it offers lazy execution, and great performance, according to some benchmarks.