Just python things
10 Apr 2026 22 minute read

DISCLAIMER: This article was initially written as a set of slides for workshop and presentation purposes for educational use. It has changed over time and will likely continue to evolve.
A collection of Python features, quirks and gotchas, and how to use and avoid them.
Python Philosophy
Run import this in any Python interpreter and you get the Zen of Python, 19 aphorisms written by Tim Peters in 1999 as a kind of unofficial design philosophy for the language. They’re half serious, half tongue-in-cheek, and surprisingly useful as a mental checklist when you’re making design decisions.
You don’t need to memorise all 19. But a few of them come up constantly in code reviews, architectural discussions, and the kind of arguments that happen in pull request comments at 5pm on a Friday.
Explicit is better than implicit is the one I quote most. It’s the reason type hints are worth writing even though Python doesn’t enforce them. It’s the reason get_user_by_id(user_id=42) is better than get_user(42). It’s the reason a function called process is a red flag.
Errors should never pass silently is the one most frequently violated. The pattern of catching an exception, doing nothing with it, and carrying on is responsible for a truly staggering number of production incidents. If you catch an exception, do something with it. Either log it, re-raise it, return a meaningful error. Don’t just pass.
There should be one obvious way to do it is the aspirational one. Python actually has multiple ways to do most things, which is both its greatest strength and the reason every Python codebase looks slightly different. The Zen is aspirational, not descriptive.
How Python Works
When you run a Python script, CPython, the reference implementation of Python, written in C, compiles your source code to bytecode, an intermediate representation, and then executes that bytecode on the Python Virtual Machine.
This is why Python has a reputation for being slower than compiled languages like C or Go - there’s an extra layer of interpretation at runtime. It’s also why Python is portable: the same .py file runs on Windows, macOS, and Linux without recompilation, as long as the interpreter is installed.
You’ll sometimes see .pyc files appearing in __pycache__ directories. Those are cached bytecode files Python generates so it doesn’t have to re-parse your source code every time. They’re safe to delete and will be regenerated automatically.
The dis module disassembles Python bytecode into human-readable instructions. You don’t need to understand every instruction, but it’s illuminating to look at occasionally as it tells you exactly what Python is doing when it runs your code.
LOAD_FAST loads a local variable onto the stack. Local variable lookups are faster than global lookups (LOAD_GLOBAL), which is one practical reason to avoid reaching for global state inside functions.
BINARY_OP handles the + operation. Because everything in Python is an object, Python has to look up the __add__ method on the left operand to find out how to perform the addition. This method lookup is part of why Python arithmetic is slower than native C arithmetic.
The Python VM is a stack machine. Meaning it pushes values onto a stack, applies operations, and pops the result. This is a simple, platform-independent instruction set that the interpreter can execute anywhere.
In statically typed languages like Java or C#, a function that takes a Duck only accepts Duck objects or subclasses. Python doesn’t care about the declared type. It only cares whether the object has the method or attribute you’re trying to use at the moment you use it. If it does, it works. If it doesn’t, you get an AttributeError at runtime.
This is enormously flexible. You can write a function that works with any object that has a .read() method so it can take file objects, StringIO, HTTP response bodies, custom objects, all without them sharing any common base class.
The downside is that you find out about type mismatches at runtime, not at development time. This is where type hints and tools like mypy or Pyright make a difference: you get the flexibility of duck typing and a little bit of the safety of type checking.
The practical guidance: if your function needs an object with a .write() method, document that. Don’t check isinstance(obj, File) unless you genuinely need to differentiate behaviour.
Let the duck be a duck.
Python 3.8 introduced Protocol as a way to formally describe duck typing contracts. You define what methods an object needs, and any class that implements them satisfies the protocol without needing to have inherited it, registered with it, or even knowing it exists. This is called structural subtyping.
For shared utilities or library code, Protocol is worth knowing. It lets you be explicit about what a function needs from its arguments without locking callers into an inheritance hierarchy.
The Global Interpreter Lock is a mutex that CPython uses to protect its internal data structures from concurrent access. It ensures only one thread runs Python bytecode at any given time, which sidesteps a class of memory corruption bug, but also means Python threads don’t give you true CPU parallelism.
This surprises people. You can create threads in Python. They work. But for CPU-bound work like heavy computations or data processing, multiple threads won’t use multiple CPU cores. They’ll take turns holding the GIL.
The GIL also doesn’t protect your own data structures. You can still have race conditions. results.append(n * n) is often safe in practice because append is fast. But just because it works most of the time doesn’t mean you should rely on it.
The practical rules:
- I/O-bound work (network, file, database): threads work fine - the GIL is released during I/O
- CPU-bound work: use
multiprocessing- each process has its own interpreter and its own GIL - High-concurrency I/O: use
asyncio- cooperative, single-threaded, avoids GIL entirely
Python 3.13 introduced an experimental free-threaded build (no GIL). It’s not the default yet, but the direction is clear.
When you write x = [1, 2, 3], Python creates a list object and makes x point to it. The list’s reference count is 1. y = x bumps it to 2. del x brings it back to 1. del y brings it to 0, and Python deallocates the object immediately. This is deterministic and immediate, so no waiting for a garbage collection cycle.
The problem is cycles. If object A holds a reference to object B, and B holds a reference back to A, neither will ever hit zero, even if nothing else in the programme can reach them. Python’s cyclic garbage collector periodically scans for these and cleans them up.
For most data work: just let Python manage memory. The one practical point though, if you’re holding a large DataFrame or array in memory and you’re done with it, del df and let it go out of scope. Don’t hold onto large objects longer than you need to.
Awesome Python Features
Default argument values are evaluated once, when the function is defined, not each time the function is called. The list [] is created once and attached to the function object. Every call that uses the default shares that same list.
This is not a bug. It’s consistent with how Python works: the function object is created at definition time, and its defaults are part of that object. But it regularly bites people who expect a fresh list on every call.
The fix is the standard Python idiom: use None as the default and create the mutable object inside the function.
The gotcha applies equally to dicts, sets, and any other mutable type as a default. Immutable defaults like integers, strings, tuples, are safe because they can’t be modified in place.
*args captures extra positional arguments into a tuple. **kwargs captures extra keyword arguments into a dict. The names are convention - *things and **options work equally well. The * and ** are what matter.
You can also unpack when calling a function:
def process(source, target, verbose=False):
...
config = {"source": "input.csv", "target": "output.csv", "verbose": True}
process(**config) # same as process(source="input.csv", target="output.csv", verbose=True)
coords = [48.8566, 2.3522]
print(*coords) # same as print(48.8566, 2.3522)
This is useful for forwarding configuration dicts to functions, writing wrappers that pass arguments through, and anywhere you want to avoid threading individual parameters through multiple layers.
List comprehensions build the entire result in memory before you can use any of it. For a million items, that’s a million objects sitting in RAM. If you’re going to iterate through them once and throw them away, you’re wasting memory.
Generator expressions are lazy. They produce one value at a time, on demand. The generator object itself is tiny regardless of the range. Values are computed only when something asks for the next one.
The syntax difference is just [] vs (). The behaviour difference can matter significantly when processing large datasets.
You can also write generator functions using yield:
def read_chunks(filepath: str, size: int = 8192):
with open(filepath, "rb") as f:
while chunk := f.read(size):
yield chunk
for chunk in read_chunks("bigfile.bin"):
process(chunk)
The file is never fully loaded into memory. Each yield suspends the function, hands a chunk to the caller, and resumes when the next chunk is needed. This is the right pattern for large files, database result sets, and any sequence you don’t need all at once.
@timer is syntactic sugar for slow_query = timer(slow_query). The decorator receives the original function, wraps it in wrapper, and returns the wrapper. From that point on, slow_query is wrapper: calling it calls wrapper, which calls the original function inside.
Decorators are everywhere in Python. @property, @staticmethod, @classmethod are built-in. Flask’s @app.route(), FastAPI’s @app.get(), pytest’s @pytest.mark.parametrize - all decorators.
@wraps(func) copies the original function’s metadata (__name__, __doc__, __module__) onto the wrapper. Without it, debugging becomes painful as stack traces would show wrapper instead of the actual function name, and help() shows nothing useful.
Always use it.
Context managers implement the __enter__ and __exit__ protocol. __enter__ runs when you enter the with block and returns the value bound to the as variable. __exit__ runs when the block exits, whether normally, by exception, or by return.
The key property is guaranteed cleanup. It doesn’t matter how the block exits. The file will be closed, the lock will be released, the database connection returned to the pool, the temporary directory deleted.
Writing your own is straightforward with contextlib.contextmanager:
from contextlib import contextmanager
import tempfile, shutil
@contextmanager
def temp_directory():
path = tempfile.mkdtemp()
try:
yield path # caller gets this value
finally:
shutil.rmtree(path) # runs no matter what
with temp_directory() as tmpdir:
process_files(tmpdir)
# directory deleted here
The yield suspends the function and hands control to the with block. When the block exits, execution resumes after yield, inside the finally. For data engineers: context managers are the right pattern for database connections, temporary files, and any resource that needs reliable cleanup.
Python Quirks and Gotchas
Floating-point numbers are stored in binary. Most decimal fractions, including 0.1, cannot be represented exactly in binary, just as 1/3 cannot be represented exactly in decimal. When you add two approximations together, the rounding errors accumulate.
This is not a Python problem. It happens in every language that uses IEEE 754 double precision. C, Java, JavaScript, R, all produce the same result.
What to do about it:
For comparisons, use math.isclose():
import math
math.isclose(0.1 + 0.2, 0.3) # True
For financial or precision-critical arithmetic, use decimal.Decimal:
from decimal import Decimal
Decimal("0.1") + Decimal("0.2") # Decimal('0.3')
Note the non-string arguments like Decimal(0.1) still gives you the floating-point approximation. String arguments likeDecimal("0.1") gives you exactly one-tenth.
The walrus operator (:=) is an assignment expression. Unlike regular assignment (=), it returns the value being assigned, which means you can embed it inside conditions, comprehensions, and other expressions.
The file-reading loop is the canonical example so you need to call f.read() once, check if it’s non-empty, then process it. Without walrus, you either call read() twice or use a while True with a break.
Another good use: avoiding double evaluation in comprehensions:
# compute expensive_transform once per row, keep only non-None results
results = [
transformed
for row in data
if (transformed := expensive_transform(row)) is not None
]
Without walrus, you’d call expensive_transform twice, or write a longer loop.
When not to use it: don’t use walrus just because you can. If it makes the code harder to read, write the loop. Use it where it genuinely removes repetition.
async/await lets you write code that looks sequential but actually suspends at await points to let other work happen while waiting for I/O to complete.
The event loop runs one coroutine at a time. When a coroutine hits an await, it suspends and hands control back to the event loop, which can then run other coroutines. When the awaited operation completes, a network response arrives or a file read finishes, then the coroutine is resumed.
This is different from threading. There’s no parallelism; one thing is running at any given moment. But for I/O-bound work you spend most of your time waiting anyway. asyncio.gather() lets you wait for multiple things simultaneously so 100 HTTP requests that each take 200ms take ~200ms total, not 20 seconds.
The rules:
async defdefines a coroutine functionawaitcan only be used inside anasync def- You need async-compatible libraries (
httpx,asyncpg,aiofiles) - regularrequestsorpsycopg2will block the event loop asyncio.run()starts the event loop - call it once at the top level
For data engineering: async is most useful for ingestion pipelines making many parallel API calls or database queries. For CPU-bound transformation work, it adds complexity without benefit, use multiprocessing there.
Package Management and Virtual Environments
When you install a package globally with pip, it goes into the system Python installation. Every project on your machine shares the same packages. The moment two projects need different versions of the same library, and they will, you have a conflict.
Virtual environments solve this by creating a lightweight, isolated Python environment per project. Each environment has its own site-packages directory. Installing into one doesn’t affect any other.
This also means you can pin your dependencies to exact versions, knowing the environment is reproducible. The same setup on a colleague’s machine, a CI server, or a production container installs the same versions.
pip freeze dumps every installed package and its exact version. pip install -r requirements.txt reproduces that environment. For small scripts and simple projects, this works fine.
The problems start when your project grows:
pip freezecaptures everything, including transitive dependencies you didn’t ask for. Your file grows to 80 packages when you only directly depend on 5.- There’s no distinction between “I need this” and “this came along because something else needed it”.
- Upgrading is painful because you don’t know which packages are safe to update.
- No built-in way to separate dev dependencies (
pytest,black) from production ones.
requirements.txt is fine for quick scripts. For anything collaborative or production-facing, you want a better tool.
pyproject.toml is the modern way to declare a Python project. PEP 517 and 518 standardised it as the place for build system configuration, and it’s since become the home for project metadata, tool configuration (pytest, black, mypy, ruff), and dependency declarations.
The key improvement over requirements.txt is the distinction between what you depend on (declared in pyproject.toml with version ranges, human-maintained) and exactly what was installed (recorded in a lockfile, machine-generated). You commit both.
This means:
- You express intent (
pandas>=2.0) rather than pinning everything manually - Dev dependencies are explicit and separate
- Your package manager resolves the best compatible set and writes an exact lockfile
- Updating a dependency is a deliberate, tracked operation
Most modern Python tooling - Poetry, PDM, Hatch, and UV - centres on pyproject.toml.
When pip install runs, it contacts PyPI and downloads a distribution file. There are two main formats you’ll encounter.
Wheel (.whl) is a zip file with a specific naming convention: pandas-2.1.4-cp311-cp311-win_amd64.whl tells you the package name, version, Python version (cp311 = CPython 3.11), and platform (win_amd64). Wheels are pre-built - they contain compiled code ready to unpack directly into site-packages. Installation is fast because there’s no build step.
Source distributions (.tar.gz) contain the raw source code. When pip installs one, it first runs the build process - which may compile C extensions, generate code, or run other setup scripts. This is slower and can fail if you’re missing a C compiler or system headers. You’ll hit this most often with packages that have C extensions (like psycopg2 before the psycopg2-binary package existed).
The naming also matters for compatibility. A wheel tagged py3-none-any is pure Python and works everywhere. A wheel tagged cp311-cp311-win_amd64 only works on CPython 3.11 on Windows 64-bit. If pip can’t find a matching wheel, it falls back to the source distribution.
.egg is an older format from the setuptools era. You’ll still encounter it in legacy projects or old packages. It works, but the ecosystem has largely moved to wheels.
For most people this is all invisible. But it matters when:
- A package fails to install because it can’t build from source - look for a
-binaryvariant or check if you need system libraries - You’re packaging your own code - build wheels so your users don’t have to build from source
- You’re working offline or in an air-gapped environment -
pip downloadlets you fetch wheels ahead of time
UV is a Python package manager written in Rust by Astral (the people behind Ruff). It is a huge improvement. I say this having used pip, conda, and pipenv over the years.
It’s fast. Dramatically faster than pip. An install that takes 30 seconds with pip takes 2 seconds with UV, because it parallelises downloads and has an efficient resolver.
It manages virtual environments automatically. uv run creates one if it doesn’t exist, installs the right dependencies, and runs your command without you having to remember to activate anything.
It generates a uv.lock file. Exact, reproducible, cross-platform. uv sync recreates the exact environment from the lockfile on any machine.
It works with or without pyproject.toml, for scripts as well as packages.
As of 2026 it’s become the default recommendation for new projects. If you’re starting something new, start with UV. If you’re on pip and requirements.txt, it migrates cleanly.