Maybe you don't need pandas
I don’t want to say that the pandas library is bad and should be avoided at all costs, but in my experience, its API can be confusing and it tends to get used as the Swiss army knife of Python programming because code snippets are copied from a thousand “data science” tutorials online. As a result, code that uses pandas often doesn’t do exactly what people think it does, performs worse than simpler alternatives, and bloats the application with unnecessary dependencies.
In the following, I list a few alternatives for tasks that I have seen pandas being used for without necessity.
Reading and writing CSV
If you just want to read or write CSV files, the standard library has got you covered. Instead of pandas.read_csv
, you can use csv.DictReader
and csv.DictWriter
.
Here is the DictReader
example copied from the docs.
import csv
with open('names.csv', newline='') as csvfile:
= csv.DictReader(csvfile)
reader for row in reader:
print(row['first_name'], row['last_name'])
If you prefer to work with dataclasses
, you could do something like this:
import csv
from dataclasses import dataclass
from pathlib import Path
from typing import Optional
@dataclass
class Name:
str]
first_name: Optional[str]
last_name: Optional[
def read_csv(path: Path) -> Iterator[Name]:
with open(path, newline='') as csvfile:
= csv.DictReader(csvfile)
reader for row in reader:
yield Name(**row)
Not only do you avoid dependencies with this, but you avoid loading the whole file into memory!
Reading and writing Excel files
For reading or writing Excel files, you can have a look at the dependencies that pandas
pulls in if you request the excel
extra. You probably only need openpyxl
.
Here I iterate over the rows of the active work sheet:
from openpyxl import load_workbook
= load_workbook(filename = 'excel_file.xlsx')
wb for row in wb.active.rows:
print([cell.value for cell in row])
Executing SQL
Pick an SQL client lib for your database, for example, mysql-connector-python
for MySQL or psycopg
for PostgreSQL (both compliant with PEP 249), get yourself a connection and a cursor, and execute your queries. Just make sure to properly close your resources – psycopg has handy context managers for this starting from version 2.5!
import psycopg
with psycopg.connect(dsn) as conn:
with conn.cursor() as cur:
cur.execute(query)for record in cur:
print(record)
If you want to work with an ORM, you can use a library like SQLAlchemy
(although I’m personally not a big fan).
Dataframes
If you’re manipulating and calculating with dataframes and you’re comfortable with pandas
, by all means, keep using pandas
. Especially for explorative scripts, it’s absolutely fine.
On the other hand, if you need dataframes and have never worked with pandas
, or if you’re dealing with large amounts of data, maybe give polars
a try. It’s a Rust package that comes with Python bindings. The API looks nice, it offers lazy execution, and great performance, according to some benchmarks.