Overview

Profile before optimizing. Most Python performance problems are algorithmic (O(n^2) where O(n log n) exists), I/O-bound (waiting, not computing), or caused by unnecessary object allocation. Reach for the profiler before adding numpy, multiprocessing, or Cython. Read python for the project baseline; this page covers the optimization sequence.

Profile with cProfile and py-spy before changing code

Never guess where the bottleneck is. Run cProfile for whole-program sampling; use py-spy for attaching to a running process without restarting it.

# cProfile: per-function wall-clock and CPU time
python -m cProfile -o profile.out -s cumulative myscript.py
python -m pstats profile.out
 
# py-spy: low-overhead sampling profiler, works on running processes
uv run py-spy record -o profile.svg --pid <PID>

Look at the top 5 functions by cumulative time. Fix the worst one. Re-profile. Repeat. Do not optimize functions outside the top 5; the return is negligible.

Fix the algorithm before tuning the implementation

An O(n^2) loop is not fixed by rewriting it in Cython. Replace it with an O(n) or O(n log n) algorithm first.

# Bad: O(n^2) membership test
result = [x for x in big_list if x in other_list]
 
# Good: O(n) after converting to a set
other_set = set(other_list)
result = [x for x in big_list if x in other_set]

Common improvements: use dict and set for membership and lookup, use collections.Counter for frequency counts, use heapq for priority queues, use bisect for sorted-list insertion.

Vectorize numerical work with numpy

Pure Python loops over numbers are slow because each integer is a boxed object. numpy operates on unboxed C arrays with BLAS/LAPACK backends.

import numpy as np
 
# Bad: pure Python loop
def dot_slow(a: list[float], b: list[float]) -> float:
    return sum(x * y for x, y in zip(a, b))
 
# Good: numpy vectorized
def dot_fast(a: np.ndarray, b: np.ndarray) -> np.float64:
    return np.dot(a, b)

Prefer numpy array operations over element-wise loops. Avoid np.vectorize; it is not faster than a Python loop, just shorter to write.

Understand the GIL and choose the right concurrency model

The GIL (Global Interpreter Lock) prevents two Python threads from executing bytecode simultaneously. This means:

  • I/O-bound work: threads work fine because threads release the GIL while waiting.
  • CPU-bound work: threads do not parallelize; use multiprocessing or ProcessPoolExecutor.
  • Network/async I/O: use asyncio (single thread, cooperative, no GIL contention).
from concurrent.futures import ProcessPoolExecutor
import math
 
def compute(n: int) -> float:
    return math.factorial(n)
 
# CPU-bound: use processes, not threads
with ProcessPoolExecutor(max_workers=4) as pool:
    results = list(pool.map(compute, range(1000, 1100)))

Python 3.13 experimental free-threaded mode disables the GIL; it is not production-ready as of 2026-05-14. Check the status before relying on it.

Use asyncio for high-concurrency I/O

For services that handle many simultaneous network connections, asyncio is more efficient than threads because it uses a single OS thread with cooperative scheduling. See python-async for the full async playbook.

# Handles thousands of concurrent connections without thread overhead
async def handle_requests(urls: list[str]) -> list[bytes]:
    async with asyncio.TaskGroup() as tg:
        tasks = [tg.create_task(fetch(u)) for u in urls]
    return [t.result() for t in tasks]

Reserve asyncio for I/O. CPU work inside a coroutine still blocks the event loop; offload it with asyncio.to_thread or ProcessPoolExecutor.

Accelerate hot paths with Cython, mypyc, or Rust

When profiling shows that a pure-Python function is the bottleneck after algorithmic improvements, consider native compilation.

  • mypyc: compiles type-annotated Python to C extensions. Zero source changes required; add full type hints, run mypyc module.py. See python-typing for the annotation patterns mypyc requires.
  • Cython: the established option for C-level performance; requires .pyx syntax.
  • PyO3 (Rust): write a Rust crate and expose it via maturin. Best for algorithmically complex, allocation-heavy code.

Start with mypyc when you already have pyright-clean, fully typed code; it is the lowest-friction path to a 2-10x speedup without changing the Python interface.

Profile memory with tracemalloc

High memory use causes GC pressure and swap, which hurt throughput. Use tracemalloc to find the allocating lines.

import tracemalloc
 
tracemalloc.start()
run_workload()
snapshot = tracemalloc.take_snapshot()
for stat in snapshot.statistics("lineno")[:10]:
    print(stat)

Common memory wins: use __slots__ on dataclasses with millions of instances, use generators instead of building full lists, and avoid holding large datasets in memory when streaming is possible.