Just python things

10 Apr 2026
22 minute read

DISCLAIMER: This article was initially written as a set of slides for workshop and presentation purposes for educational use. It has changed over time and will likely continue to evolve.

A collection of Python features, quirks and gotchas, and how to use and avoid them.

Just Python Things

A collection of Python features, quirks and gotchas, and how to use and avoid them.

Jan Eefting

Python Philosophy

The Zen of Python

>>> import this
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Readability counts.
Errors should never pass silently.
...

19 aphorisms. One import away.

Run import this in any Python interpreter and you get the Zen of Python, 19 aphorisms written by Tim Peters in 1999 as a kind of unofficial design philosophy for the language. They’re half serious, half tongue-in-cheek, and surprisingly useful as a mental checklist when you’re making design decisions.

You don’t need to memorise all 19. But a few of them come up constantly in code reviews, architectural discussions, and the kind of arguments that happen in pull request comments at 5pm on a Friday.

Explicit is better than implicit is the one I quote most. It’s the reason type hints are worth writing even though Python doesn’t enforce them. It’s the reason get_user_by_id(user_id=42) is better than get_user(42). It’s the reason a function called process is a red flag.

Errors should never pass silently is the one most frequently violated. The pattern of catching an exception, doing nothing with it, and carrying on is responsible for a truly staggering number of production incidents. If you catch an exception, do something with it. Either log it, re-raise it, return a meaningful error. Don’t just pass.

There should be one obvious way to do it is the aspirational one. Python actually has multiple ways to do most things, which is both its greatest strength and the reason every Python codebase looks slightly different. The Zen is aspirational, not descriptive.

The ones that matter most

  • Explicit is better than implicit - name your variables, use type hints, don’t be clever
  • Readability counts - code is read far more than it’s written
  • Errors should never pass silently - unless explicitly silenced
  • In the face of ambiguity, refuse the temptation to guess
  • There should be one obvious way to do it - though that way may not be obvious at first

These aren’t rules enforced by the language. They’re the reason Python code written by different people tends to look similar.

How Python Works

Interpreted language

Python source code is not compiled to machine code directly.

your_script.py  ->  bytecode (.pyc)  ->  Python VM  ->  execution

The CPython interpreter handles all of this at runtime.

When you run a Python script, CPython, the reference implementation of Python, written in C, compiles your source code to bytecode, an intermediate representation, and then executes that bytecode on the Python Virtual Machine.

This is why Python has a reputation for being slower than compiled languages like C or Go - there’s an extra layer of interpretation at runtime. It’s also why Python is portable: the same .py file runs on Windows, macOS, and Linux without recompilation, as long as the interpreter is installed.

You’ll sometimes see .pyc files appearing in __pycache__ directories. Those are cached bytecode files Python generates so it doesn’t have to re-parse your source code every time. They’re safe to delete and will be regenerated automatically.

CPython, PyPy, and friends

  • CPython is what you almost certainly have installed. The reference implementation, maintained by the Python Software Foundation.
  • PyPy is an alternative interpreter with a JIT compiler. Can be significantly faster for long-running CPU-bound work. Less compatible with C extensions like NumPy.
  • Others are Jython (JVM), IronPython (.NET), MicroPython (embedded), each exist for specific use cases.

For data work: CPython + NumPy/Pandas is the standard. The C extensions compensate for CPython’s interpreted overhead.

From source to bytecode

The dis module lets you inspect what Python actually compiles your code to.

import dis

def add(a, b):
    return a + b

dis.dis(add)
  2           RESUME                   0
  3           LOAD_FAST                0 (a)
              LOAD_FAST                1 (b)
              BINARY_OP                0 (+)
              RETURN_VALUE

The dis module disassembles Python bytecode into human-readable instructions. You don’t need to understand every instruction, but it’s illuminating to look at occasionally as it tells you exactly what Python is doing when it runs your code.

LOAD_FAST loads a local variable onto the stack. Local variable lookups are faster than global lookups (LOAD_GLOBAL), which is one practical reason to avoid reaching for global state inside functions.

BINARY_OP handles the + operation. Because everything in Python is an object, Python has to look up the __add__ method on the left operand to find out how to perform the addition. This method lookup is part of why Python arithmetic is slower than native C arithmetic.

The Python VM is a stack machine. Meaning it pushes values onto a stack, applies operations, and pops the result. This is a simple, platform-independent instruction set that the interpreter can execute anywhere.

Duck typing

“If it walks like a duck and quacks like a duck, it’s a duck.”

class Dog:
    def quack(self):
        print("Woof!")

class Duck:
    def quack(self):
        print("Quack!")

def make_it_quack(thing):
    thing.quack()

make_it_quack(Dog())   # Woof!
make_it_quack(Duck())  # Quack!

Python doesn’t check the type. It checks whether the behaviour exists.

In statically typed languages like Java or C#, a function that takes a Duck only accepts Duck objects or subclasses. Python doesn’t care about the declared type. It only cares whether the object has the method or attribute you’re trying to use at the moment you use it. If it does, it works. If it doesn’t, you get an AttributeError at runtime.

This is enormously flexible. You can write a function that works with any object that has a .read() method so it can take file objects, StringIO, HTTP response bodies, custom objects, all without them sharing any common base class.

The downside is that you find out about type mismatches at runtime, not at development time. This is where type hints and tools like mypy or Pyright make a difference: you get the flexibility of duck typing and a little bit of the safety of type checking.

The practical guidance: if your function needs an object with a .write() method, document that. Don’t check isinstance(obj, File) unless you genuinely need to differentiate behaviour.

Let the duck be a duck.

Duck typing - Protocols

from typing import Protocol

class Quackable(Protocol):
    def quack(self) -> None: ...

def make_it_quack(thing: Quackable) -> None:
    thing.quack()

Protocol formalises duck typing. No inheritance needed so any class with a matching quack() method satisfies the protocol. Static type checkers understand this.

Python 3.8 introduced Protocol as a way to formally describe duck typing contracts. You define what methods an object needs, and any class that implements them satisfies the protocol without needing to have inherited it, registered with it, or even knowing it exists. This is called structural subtyping.

For shared utilities or library code, Protocol is worth knowing. It lets you be explicit about what a function needs from its arguments without locking callers into an inheritance hierarchy.

The Global Interpreter Lock (GIL)

Only one thread can execute Python bytecode at a time.

import threading

results = []

def square(n):
    results.append(n * n)  # looks thread-safe, is not

threads = [threading.Thread(target=square, args=(i,)) for i in range(10)]
for t in threads: t.start()
for t in threads: t.join()

The GIL protects the interpreter internals. It does not protect your data structures.

The Global Interpreter Lock is a mutex that CPython uses to protect its internal data structures from concurrent access. It ensures only one thread runs Python bytecode at any given time, which sidesteps a class of memory corruption bug, but also means Python threads don’t give you true CPU parallelism.

This surprises people. You can create threads in Python. They work. But for CPU-bound work like heavy computations or data processing, multiple threads won’t use multiple CPU cores. They’ll take turns holding the GIL.

The GIL also doesn’t protect your own data structures. You can still have race conditions. results.append(n * n) is often safe in practice because append is fast. But just because it works most of the time doesn’t mean you should rely on it.

The practical rules:

  • I/O-bound work (network, file, database): threads work fine - the GIL is released during I/O
  • CPU-bound work: use multiprocessing - each process has its own interpreter and its own GIL
  • High-concurrency I/O: use asyncio - cooperative, single-threaded, avoids GIL entirely

Python 3.13 introduced an experimental free-threaded build (no GIL). It’s not the default yet, but the direction is clear.

Threading vs Multiprocessing vs Asyncio

ThreadingMultiprocessingAsyncio
Use whenI/O-boundCPU-boundI/O-bound, high concurrency
True parallelism?No (GIL)YesNo (single thread)
MemorySharedSeparateShared
OverheadLowHigh (process startup)Very low
ComplexityMediumMediumHigher

There’s no universally correct choice. Know the trade-offs.

Memory management and garbage collection

Python manages memory automatically. Two mechanisms:

  1. Reference counting - every object tracks how many references point to it
  2. Cyclic garbage collector - cleans up reference cycles that counting can’t handle
x = [1, 2, 3]   # reference count: 1
y = x            # reference count: 2
del x            # reference count: 1
del y            # reference count: 0 - object freed immediately

When you write x = [1, 2, 3], Python creates a list object and makes x point to it. The list’s reference count is 1. y = x bumps it to 2. del x brings it back to 1. del y brings it to 0, and Python deallocates the object immediately. This is deterministic and immediate, so no waiting for a garbage collection cycle.

The problem is cycles. If object A holds a reference to object B, and B holds a reference back to A, neither will ever hit zero, even if nothing else in the programme can reach them. Python’s cyclic garbage collector periodically scans for these and cleans them up.

For most data work: just let Python manage memory. The one practical point though, if you’re holding a large DataFrame or array in memory and you’re done with it, del df and let it go out of scope. Don’t hold onto large objects longer than you need to.

Memory - practical tools

import gc
import sys

df = load_large_dataset()
process(df)

del df           # drop the reference - object may be freed
gc.collect()     # force collection of any cycles

# Check reference count
x = [1, 2, 3]
sys.getrefcount(x)  # always at least 1 (the getrefcount call itself adds one)
  • del removes a reference, not necessarily the object
  • The object is freed when its reference count reaches 0 gc.collect() is rarely needed - but useful in memory-sensitive loops

Awesome Python Features

Mutable default arguments

One of the most famous Python gotchas. It is consistent behaviour though, once you understand it.

def append_to(element, to=[]):
    to.append(element)
    return to

print(append_to(1))  # [1]
print(append_to(2))  # [2, 1]  <- surprise
print(append_to(3))  # [3, 2, 1]  <- now you're upset

Default argument values are evaluated once, when the function is defined, not each time the function is called. The list [] is created once and attached to the function object. Every call that uses the default shares that same list.

This is not a bug. It’s consistent with how Python works: the function object is created at definition time, and its defaults are part of that object. But it regularly bites people who expect a fresh list on every call.

The fix is the standard Python idiom: use None as the default and create the mutable object inside the function.

The gotcha applies equally to dicts, sets, and any other mutable type as a default. Immutable defaults like integers, strings, tuples, are safe because they can’t be modified in place.

Mutable defaults - the fix

# Wrong
def append_to(element, to=[]):
    to.append(element)
    return to

# Right
def append_to(element, to=None):
    if to is None:
        to = []
    to.append(element)
    return to

Rule: Never use a mutable type (list, dict, set) as a default argument. Use None and create it inside.

*args and **kwargs - unpacking

def log(level: str, *args, **kwargs) -> None:
    message = " ".join(str(a) for a in args)
    context = ", ".join(f"{k}={v}" for k, v in kwargs.items())
    print(f"[{level}] {message} | {context}")

log("ERROR", "connection", "failed", host="db01", port=5432)
# [ERROR] connection failed | host=db01, port=5432

*args captures extra positional arguments into a tuple. **kwargs captures extra keyword arguments into a dict. The names are convention - *things and **options work equally well. The * and ** are what matter.

You can also unpack when calling a function:

def process(source, target, verbose=False):
    ...

config = {"source": "input.csv", "target": "output.csv", "verbose": True}
process(**config)  # same as process(source="input.csv", target="output.csv", verbose=True)

coords = [48.8566, 2.3522]
print(*coords)     # same as print(48.8566, 2.3522)

This is useful for forwarding configuration dicts to functions, writing wrappers that pass arguments through, and anywhere you want to avoid threading individual parameters through multiple layers.

Positional-only and keyword-only parameters

def connect(host, port, /, *, timeout=30, retries=3):
    ...

connect("localhost", 5432)                  # OK
connect("localhost", 5432, timeout=10)      # OK
connect(host="localhost", port=5432)        # TypeError - / means positional only
connect("localhost", 5432, 10)              # TypeError - * means keyword only after
  • / marks everything before it as positional-only
  • * marks everything after it as keyword-only
  • Useful in library APIs: you can rename positional params later without breaking callers

List comprehensions and generator expressions

# List comprehension - builds the whole list in memory
squares = [x ** 2 for x in range(1_000_000)]

# Generator expression - computes one value at a time
squares = (x ** 2 for x in range(1_000_000))

print(sum(squares))  # both give the same answer - generator uses a fraction of the memory

List comprehensions build the entire result in memory before you can use any of it. For a million items, that’s a million objects sitting in RAM. If you’re going to iterate through them once and throw them away, you’re wasting memory.

Generator expressions are lazy. They produce one value at a time, on demand. The generator object itself is tiny regardless of the range. Values are computed only when something asks for the next one.

The syntax difference is just [] vs (). The behaviour difference can matter significantly when processing large datasets.

You can also write generator functions using yield:

def read_chunks(filepath: str, size: int = 8192):
    with open(filepath, "rb") as f:
        while chunk := f.read(size):
            yield chunk

for chunk in read_chunks("bigfile.bin"):
    process(chunk)

The file is never fully loaded into memory. Each yield suspends the function, hands a chunk to the caller, and resumes when the next chunk is needed. This is the right pattern for large files, database result sets, and any sequence you don’t need all at once.

When to use generators

  • Reading large files line by line
  • Database queries with large result sets
  • Data pipelines where each stage processes one item at a time
  • Any time you’d otherwise load everything into memory just to iterate once
def read_csv_rows(path: str):
    with open(path) as f:
        header = next(f).strip().split(",")
        for line in f:
            yield dict(zip(header, line.strip().split(",")))

active = (row for row in read_csv_rows("users.csv") if row["status"] == "active")

No file is fully loaded. Each row is produced and consumed on demand.

Decorators

A decorator is a function that wraps another function.

import time
from functools import wraps

def timer(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        start = time.perf_counter()
        result = func(*args, **kwargs)
        elapsed = time.perf_counter() - start
        print(f"{func.__name__} took {elapsed:.3f}s")
        return result
    return wrapper

@timer
def slow_query(n):
    return sum(range(n))

slow_query(10_000_000)  # slow_query took 0.412s

@timer is syntactic sugar for slow_query = timer(slow_query). The decorator receives the original function, wraps it in wrapper, and returns the wrapper. From that point on, slow_query is wrapper: calling it calls wrapper, which calls the original function inside.

Decorators are everywhere in Python. @property, @staticmethod, @classmethod are built-in. Flask’s @app.route(), FastAPI’s @app.get(), pytest’s @pytest.mark.parametrize - all decorators.

@wraps(func) copies the original function’s metadata (__name__, __doc__, __module__) onto the wrapper. Without it, debugging becomes painful as stack traces would show wrapper instead of the actual function name, and help() shows nothing useful.

Always use it.

Decorators with arguments

def retry(times=3, exceptions=(Exception,)):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(times):
                try:
                    return func(*args, **kwargs)
                except exceptions as e:
                    if attempt == times - 1:
                        raise
                    print(f"Attempt {attempt + 1} failed: {e}. Retrying...")
        return wrapper
    return decorator

@retry(times=5, exceptions=(ConnectionError, TimeoutError))
def fetch_data(url: str) -> dict:
    ...

A decorator factory: a function that takes arguments and returns a decorator.

Context managers

with open("data.csv") as f:
    data = f.read()
# file is closed here - even if an exception occurred

The with statement guarantees cleanup. Always.

Context managers implement the __enter__ and __exit__ protocol. __enter__ runs when you enter the with block and returns the value bound to the as variable. __exit__ runs when the block exits, whether normally, by exception, or by return.

The key property is guaranteed cleanup. It doesn’t matter how the block exits. The file will be closed, the lock will be released, the database connection returned to the pool, the temporary directory deleted.

Writing your own is straightforward with contextlib.contextmanager:

from contextlib import contextmanager
import tempfile, shutil

@contextmanager
def temp_directory():
    path = tempfile.mkdtemp()
    try:
        yield path          # caller gets this value
    finally:
        shutil.rmtree(path) # runs no matter what

with temp_directory() as tmpdir:
    process_files(tmpdir)
# directory deleted here

The yield suspends the function and hands control to the with block. When the block exits, execution resumes after yield, inside the finally. For data engineers: context managers are the right pattern for database connections, temporary files, and any resource that needs reliable cleanup.

Context managers - common patterns

# Database connection
with db.connect() as conn:
    result = conn.execute("SELECT * FROM users")

# Multiple context managers (Python 3.10+)
with (
    open("input.csv") as infile,
    open("output.csv", "w") as outfile,
):
    process(infile, outfile)

# Suppressing specific exceptions
from contextlib import suppress
with suppress(FileNotFoundError):
    os.remove("temp.txt")

Python Quirks and Gotchas

Floating point arithmetic

>>> 0.1 + 0.2
0.30000000000000004
>>> 0.1 + 0.2 == 0.3
False

Not a Python bug. An IEEE 754 problem.

Floating-point numbers are stored in binary. Most decimal fractions, including 0.1, cannot be represented exactly in binary, just as 1/3 cannot be represented exactly in decimal. When you add two approximations together, the rounding errors accumulate.

This is not a Python problem. It happens in every language that uses IEEE 754 double precision. C, Java, JavaScript, R, all produce the same result.

What to do about it:

For comparisons, use math.isclose():

import math
math.isclose(0.1 + 0.2, 0.3)  # True

For financial or precision-critical arithmetic, use decimal.Decimal:

from decimal import Decimal
Decimal("0.1") + Decimal("0.2")  # Decimal('0.3')

Note the non-string arguments like Decimal(0.1) still gives you the floating-point approximation. String arguments likeDecimal("0.1") gives you exactly one-tenth.

Floating point - what to do

import math
from decimal import Decimal

# Approximate comparison - good for most cases
math.isclose(0.1 + 0.2, 0.3)                        # True
math.isclose(0.1 + 0.2, 0.3, rel_tol=1e-9)          # configurable tolerance

# Exact decimal arithmetic - good for money, measurements
Decimal("0.10") + Decimal("0.20")                    # Decimal('0.30')
Decimal("1.23") * Decimal("1.05")                    # Decimal('1.2915')

# What NOT to do
0.1 + 0.2 == 0.3       # False
round(0.1 + 0.2, 1) == 0.3  # True here - but unreliable in general

The walrus operator :=

Python 3.8+

# Before - reads the chunk twice, or uses a flag
chunk = f.read(8192)
while chunk:
    process(chunk)
    chunk = f.read(8192)

# With walrus - assign and test in one expression
while chunk := f.read(8192):
    process(chunk)

Assign a value and use it in the same expression.

The walrus operator (:=) is an assignment expression. Unlike regular assignment (=), it returns the value being assigned, which means you can embed it inside conditions, comprehensions, and other expressions.

The file-reading loop is the canonical example so you need to call f.read() once, check if it’s non-empty, then process it. Without walrus, you either call read() twice or use a while True with a break.

Another good use: avoiding double evaluation in comprehensions:

# compute expensive_transform once per row, keep only non-None results
results = [
    transformed
    for row in data
    if (transformed := expensive_transform(row)) is not None
]

Without walrus, you’d call expensive_transform twice, or write a longer loop.

When not to use it: don’t use walrus just because you can. If it makes the code harder to read, write the loop. Use it where it genuinely removes repetition.

Walrus - good uses

# File reading
while chunk := f.read(8192):
    process(chunk)

# Avoiding double evaluation in comprehensions
results = [t for row in data if (t := transform(row)) is not None]

# Capturing a regex match
import re
if match := re.search(r"\d+", text):
    print(match.group())

async / await

Concurrency without threads. One event loop, cooperative multitasking.

import asyncio
import httpx

async def fetch(url: str) -> str:
    async with httpx.AsyncClient() as client:
        response = await client.get(url)
        return response.text

async def main():
    urls = ["https://api.example.com/1", "https://api.example.com/2"]
    results = await asyncio.gather(*[fetch(url) for url in urls])
    return results

asyncio.run(main())

async/await lets you write code that looks sequential but actually suspends at await points to let other work happen while waiting for I/O to complete.

The event loop runs one coroutine at a time. When a coroutine hits an await, it suspends and hands control back to the event loop, which can then run other coroutines. When the awaited operation completes, a network response arrives or a file read finishes, then the coroutine is resumed.

This is different from threading. There’s no parallelism; one thing is running at any given moment. But for I/O-bound work you spend most of your time waiting anyway. asyncio.gather() lets you wait for multiple things simultaneously so 100 HTTP requests that each take 200ms take ~200ms total, not 20 seconds.

The rules:

  • async def defines a coroutine function
  • await can only be used inside an async def
  • You need async-compatible libraries (httpx, asyncpg, aiofiles) - regular requests or psycopg2 will block the event loop
  • asyncio.run() starts the event loop - call it once at the top level

For data engineering: async is most useful for ingestion pipelines making many parallel API calls or database queries. For CPU-bound transformation work, it adds complexity without benefit, use multiprocessing there.

async - the mental model

Thread model:              Async model:
  Thread 1 ========          Coroutine A ==[waiting]==============
  Thread 2 ========                         Coroutine B ==[wait]==
  Thread 3 ========                                  Coroutine C =
  (parallel, GIL limited)    (one thread, interleaved, cooperative)
  • Threading: OS switches between threads. Good for I/O, limited by GIL for CPU.
  • Async: you decide when to yield (at await). Very low overhead, no GIL issues.
  • Neither replaces multiprocessing for CPU-bound work.

Package Management and Virtual Environments

Why virtual environments?

# Without virtual environments:
pip install pandas==1.5.0   # project A needs this
pip install pandas==2.1.0   # project B needs this - project A is now broken

Virtual environments give each project its own isolated Python installation and package set.

When you install a package globally with pip, it goes into the system Python installation. Every project on your machine shares the same packages. The moment two projects need different versions of the same library, and they will, you have a conflict.

Virtual environments solve this by creating a lightweight, isolated Python environment per project. Each environment has its own site-packages directory. Installing into one doesn’t affect any other.

This also means you can pin your dependencies to exact versions, knowing the environment is reproducible. The same setup on a colleague’s machine, a CI server, or a production container installs the same versions.

Creating and using a virtual environment

# Create
python -m venv .venv

# Activate (Windows)
.venv\Scripts\activate

# Activate (macOS/Linux)
source .venv/bin/activate

# Install freely - only affects this environment
pip install pandas requests

# Deactivate
deactivate

The .venv directory contains the environment. Put it in .gitignore - never commit it.

pip and requirements.txt

# Capture current environment
pip freeze > requirements.txt

# Reproduce it elsewhere
pip install -r requirements.txt
# requirements.txt
pandas==2.1.4
numpy==1.26.2
requests==2.31.0

pip freeze dumps every installed package and its exact version. pip install -r requirements.txt reproduces that environment. For small scripts and simple projects, this works fine.

The problems start when your project grows:

  • pip freeze captures everything, including transitive dependencies you didn’t ask for. Your file grows to 80 packages when you only directly depend on 5.
  • There’s no distinction between “I need this” and “this came along because something else needed it”.
  • Upgrading is painful because you don’t know which packages are safe to update.
  • No built-in way to separate dev dependencies (pytest, black) from production ones.

requirements.txt is fine for quick scripts. For anything collaborative or production-facing, you want a better tool.

pyproject.toml - the modern standard

PEP 517 / 518

[project]
name = "my-project"
version = "1.0.0"
requires-python = ">=3.11"
dependencies = [
    "pandas>=2.0",
    "requests>=2.28",
]

[project.optional-dependencies]
dev = [
    "pytest>=7.0",
    "black>=23.0",
    "mypy>=1.0",
]

pyproject.toml is the modern way to declare a Python project. PEP 517 and 518 standardised it as the place for build system configuration, and it’s since become the home for project metadata, tool configuration (pytest, black, mypy, ruff), and dependency declarations.

The key improvement over requirements.txt is the distinction between what you depend on (declared in pyproject.toml with version ranges, human-maintained) and exactly what was installed (recorded in a lockfile, machine-generated). You commit both.

This means:

  • You express intent (pandas>=2.0) rather than pinning everything manually
  • Dev dependencies are explicit and separate
  • Your package manager resolves the best compatible set and writes an exact lockfile
  • Updating a dependency is a deliberate, tracked operation

Most modern Python tooling - Poetry, PDM, Hatch, and UV - centres on pyproject.toml.

Tar, wheel, zip, and other packaging formats

When you pip install, what exactly is being downloaded?

FormatExtensionTypeWhat it is
Wheel.whlBinaryPre-built, installs fast. The modern standard.
Source dist.tar.gzSourceRaw source code. Needs building on install.
Legacy.eggBinaryOld format. You’ll still see it. Ignore it.
pip install pandas          # downloads a .whl if available, .tar.gz fallback
pip download pandas         # saves the file - lets you see what you got

When pip install runs, it contacts PyPI and downloads a distribution file. There are two main formats you’ll encounter.

Wheel (.whl) is a zip file with a specific naming convention: pandas-2.1.4-cp311-cp311-win_amd64.whl tells you the package name, version, Python version (cp311 = CPython 3.11), and platform (win_amd64). Wheels are pre-built - they contain compiled code ready to unpack directly into site-packages. Installation is fast because there’s no build step.

Source distributions (.tar.gz) contain the raw source code. When pip installs one, it first runs the build process - which may compile C extensions, generate code, or run other setup scripts. This is slower and can fail if you’re missing a C compiler or system headers. You’ll hit this most often with packages that have C extensions (like psycopg2 before the psycopg2-binary package existed).

The naming also matters for compatibility. A wheel tagged py3-none-any is pure Python and works everywhere. A wheel tagged cp311-cp311-win_amd64 only works on CPython 3.11 on Windows 64-bit. If pip can’t find a matching wheel, it falls back to the source distribution.

.egg is an older format from the setuptools era. You’ll still encounter it in legacy projects or old packages. It works, but the ecosystem has largely moved to wheels.

For most people this is all invisible. But it matters when:

  • A package fails to install because it can’t build from source - look for a -binary variant or check if you need system libraries
  • You’re packaging your own code - build wheels so your users don’t have to build from source
  • You’re working offline or in an air-gapped environment - pip download lets you fetch wheels ahead of time

UV - package management that doesn’t make you want to quit

Rust-based. Extremely fast. Replaces pip, pip-tools, virtualenv, and more.

# Create a project
uv init my-project
cd my-project

# Add a dependency
uv add pandas

# Add a dev dependency
uv add --dev pytest black

# Run a script in the project environment
uv run python main.py

# Reproduce the exact environment from the lockfile
uv sync

UV is a Python package manager written in Rust by Astral (the people behind Ruff). It is a huge improvement. I say this having used pip, conda, and pipenv over the years.

It’s fast. Dramatically faster than pip. An install that takes 30 seconds with pip takes 2 seconds with UV, because it parallelises downloads and has an efficient resolver.

It manages virtual environments automatically. uv run creates one if it doesn’t exist, installs the right dependencies, and runs your command without you having to remember to activate anything.

It generates a uv.lock file. Exact, reproducible, cross-platform. uv sync recreates the exact environment from the lockfile on any machine.

It works with or without pyproject.toml, for scripts as well as packages.

As of 2026 it’s become the default recommendation for new projects. If you’re starting something new, start with UV. If you’re on pip and requirements.txt, it migrates cleanly.

UV - the cheat sheet

uv init my-project          # new project
uv add pandas               # add dependency
uv add --dev pytest black   # add dev dependency
uv remove requests          # remove dependency
uv sync                     # recreate env from lockfile
uv run pytest               # run in project env
uv run python script.py     # run script in project env
uv lock                     # update lockfile
uv pip install ...          # drop-in for pip when needed

uv.lock goes in version control. .venv does not.

Thanks for stopping by!

Questions? Reach out!

Get in touch →