If you’re using Python in the world of data science or scientific computing, you will soon discover that Python has two different packaging systems: pip and Conda. Which raises some questions:
- How are they different?
- What are the tradeoffs between the two?
- Which should you use?
While it’s not possible to answer this question for every situation, in this article you will learn the basic differences, constrained to:
- Python only; Conda has support for other languages but I won’t go into that.
- Linux, including running on Docker, though with some mention of macOS and Windows.
- Focusing on the Conda-Forge package repository; Conda has multiple package repositories, or “channels”.
By the end you should understand why Conda exists, when you might want to use it, and the tradeoffs between choosing each one.
The starting point: which kind of dependencies?
The fundamental difference between pip and Conda packaging is what they put in packages.
- Pip packages are Python libraries like NumPy or
- Conda packages include Python libraries (NumPy or
matplotlib), C libraries (
libjpeg), and executables (like C compilers, and even the Python interpreter itself).
Pip: Python libraries only
For example, let’s say you want to install Python 3.9 with NumPy, Pandas, and the gnuplot rendering tool, a tool that is unrelated to Python.
Here’s what the pip
requirements.txt would look like:
Installing Python and gnuplot is out of scope for pip. You as a user must deal with this yourself. You might, for example, do so with a Docker image:
FROM ubuntu:20.04 RUN apt-get update && apt-get install -y gnuplot python3.9 COPY requirements.txt . RUN pip install -r requirements.txt
Both the Python interpreter and gnuplot need to come from system packages, in this case Ubuntu’s packages.
Conda: Any dependency can be a Conda package (almost)
With Conda, Python and gnuplot are just more Conda packages, no different than NumPy or Pandas.
environment.yml that corresponds (somewhat) to the
requirements.txt we saw above will include all of these packages:
name: myenv channels: - conda-forge dependencies: - python=3.9 - numpy - pandas - gnuplot
Conda only relies on the operating system for basic facilities, like the standard C library. Everything above that is Conda packages, not system packages.
We can see the difference if the corresponding
Dockerfile; there is no need to install any system packages:
FROM continuumio/miniconda3 COPY environment.yml . RUN conda env create
This base image ships with Conda pre-installed, but we’re not relying on any existing Python install, we’re installing a new one in the new environment.
Note: Outside the very specific topic under discussion, the Dockerfiles in this article are not examples of best practices, since the added complexity would obscure the main point of the article.
Why Conda packages everything
Why did Conda make the decision to package everything, Python interpreter included? How does this benefit you? In part it’s about portability and reproducibility.
- Portability across operating systems: Instead of installing Python in three different ways on Linux, macOS, and Windows, you can use the same
environment.ymlon all three.
- Reproducibility: It’s possible to pin almost the whole stack, from the Python interpreter upwards.
- Consistent configuration: You don’t need to install system packages and Python packages in two different ways; (almost) everything can go in one file, the
But it also addresses another problem: how to deal with Python libraries that require compiled code. That’s a big enough topic that it gets a whole new section, next.
Beyond pure Python: Packaging compiled extensions
In the early days of Python packaging, a package included just the source code that needed to be installed. For pure Python packages, this worked fine, and still does. But what happens when you need to compile some Rust or C or C++ or Fortran code as part of building the package?
Solution #1: Compile it yourself
The original solution was to have each user compile the code themselves at install time. This can be quite slow, wastes resources, is often painful to configure, and still doesn’t solve a big part of the problem: shared library dependencies.
The Pillow image graphics library, for example, relies on third party shared libraries like
In order to compile Pillow yourself, you have to install all of them, plus their development headers.
On Linux or macOS you can install the system packages or the Homebrew packages; for Windows this can be more difficult.
But you’re going to have to write different configuration for every single OS and even Linux distribution.
Solution #2: Pip wheels
The way pip solves this problem is with packages called “wheels” that can include compiled code.
In order to deal with shared library dependencies like
libpng, any shared library external dependencies get bundled inside the wheel itself.
For example, let’s look at a Pillow wheel for Linux; a wheel is just a ZIP file so we can use standard ZIP tools:
$ zipinfo Pillow.whl ... Pillow.libs/libpng16-213e245f.so.16.37.0 Pillow.libs/libjpeg-183418da.so.9.4.0 ... PIL/FpxImagePlugin.py PIL/PalmImagePlugin.py ... PIL/_imagingcms.cpython-39-x86_64-linux-gnu.so ...
The wheel includes both Python code, a compiled Python extension, and third-party shared libraries like
This can sometimes make packages larger, as multiple copies of third-party shared libraries may be installed, one per wheel.
Solution #3: Conda packages
Conda packages take a different approach to third-party shared libraries.
libpng are packaged as additional Conda packages:
$ conda install -c conda-forge pillow ... The following NEW packages will be INSTALLED: ... jpeg conda-forge/linux-64::jpeg-9d-h36c2ea0_0 ... libpng conda-forge/linux-64::libpng-1.6.37-h21135ba_2 ... pillow conda-forge/linux-64::pillow-7.2.0-py38h9776b28_2 zstd conda-forge/linux-64::zstd-1.5.0-ha95c52a_0 ...
libpng can then be depended on by other installed packages.
They’re not wheel-specific, they’re available to any package in the Conda environment.
Conda can do this because it’s not a packaging system only for Python code; it can just as easily package shared libraries or executables.
Summary: pip vs Conda
|Installs Python||No||Yes, as package|
|3rd-party shared libraries||Inside the wheel||Yes, as package|
|Executables and tools||No||Yes, as package|
|Python source code||Yes, as package||Yes, as package|
PyPI vs. Conda-Forge
Another fundamental difference between pip and Conda is less about the tools themselves, and more about the package repositories they rely on and how they work. In particular, most Python programs will rely on open source libraries, and these need to be downloaded from somewhere. For these, pip relies on PyPI, whereas Conda supports multiple different “channels” hosted on Anaconda.
The default Conda channel is maintained by Anaconda Inc, the company that created Conda. It tends to have limited package selection and be somewhat less up-to-date, with some potential benefits regarding stability and GPU support. Beyond that I don’t know that much about it.
But there’s also the Conda-Forge community channel, which packages far more packages, tends to be up-to-date, and is where you probably want to get your Conda packages most of the time. You can mix packages from the default channel and Conda-Forge, if you want the default channel’s GPU packages.
Let’s compare PyPI with Conda-Forge.
Each package maintainer might compile or build their packages in their own idiosyncratic way, maintaining their own build infrastructure, choosing their own compilation options, and so on.
For example, NumPy can rely on multiple different BLAS libraries for fast linear algebra operations. The maintainers have chosen to build their PyPI packages with OpenBLAS; if you want another option, like Intel’s (maybe?) faster MKL, you’re out of luck unless you’re willing to compile the code yourself.
Conda-Forge is a community project where package maintainers can be different than the original author of the package.
For example, I have commit access to the
typeguard Conda-Forge recipe even though I am not a maintainer of the
Instead of custom builds done differently by each package maintainer, Conda-Forge has centralized build systems that recompile libraries, update recipe repositories, and in general automate everything massively. When a new version of Python 3 comes out, for example, a centralized update will happen, all the individual package maintainers will get PRs adding new packages; on PyPI this is up to individual maintainers to figure out.
Because of packaging infrastructure is centralized, Conda-Forge is able to let you choose which BLAS to use, and it will be used for NumPy and SciPy and whatever other packages you use that rely on BLAS.
Dealing with PyPI-only packages in Conda
While Conda-Forge has many packages, it doesn’t have all of them; many Python packages can only be found on PyPI. You can deal with lack of these packages in a number of ways.
Install pip packages in a Conda environment
Conda environments are wrappers around virtualenvs; as such you can just call
pip install yourself.
If you’re using an
environment.yml to install your Conda packages, you can also add pip packages:
name: myenv channels: - conda-forge dependencies: - python=3.9 - numpy - pandas - gnuplot - pip: # Package that is only on PyPI - sandu
Package it for Conda-Forge yourself
Because Conda-Forge does not require maintainers of the code to do the packaging, anyone can volunteer to add a package to Conda-Forge. That includes you!
For many Python packages it’s surprisingly easy process, and it’s quite automated, so handling new releases is often as easy as approving an automatically-created PR.
Summary: PyPI vs. Conda-Forge
|Who creates package?||Author of code||Anyone|
|Build infrastructure||Maintained by author||Centralized|
|Open source Python libraries||Essentially all||Many|
|Other open source tools||None||Many|
|Windows/Linux/macOS packages||Usually, but up to maintainer||Almost always|
Additional tooling for Pip and Conda
Here’s a quick summary of some of the additional tooling you might want to use with either one:
|Reproducible builds||pip-tools, pipenv, Poetry||conda-lock|
|Security scanning||Most security scanners||Jake|
|Alternatives||Poetry, pipenv||Mamba; much faster, highly recommended|
To reiterate: if you do use Conda, I highly recommend using Mamba as a replacement. It supports the same command-line options and is much faster.
Which should you use?
So which should you use, pip or Conda? For general Python computing, pip and PyPI are usually fine, and the surrounding tooling tends to be better.
For data science or scientific computing, however, Conda’s ability to package third-party libraries, and the centralized infrastructure provided by Conda-Forge, means setup of complex packages will often be easier. In the end, which works best for you will depend on your situation and requirements; quite possibly both will be fine.
Production Docker packaging is too complicated to learn from Google searches
With as much as a dozen different intersecting technologies, and an unknown number of details to get right, Docker packaging isn't simple, especially for production.
But you still need fast builds that save you time, and security best practices that keep you safe.
Take the fast path to learning best practices, by using the Python on Docker Production Handbook.
Free ebook: Introduction to Dockerizing for Production
Learn a step-by-step iterative DevOps packaging process in this free mini-ebook. You'll learn what to prioritize, the decisions you need to make, and the ongoing organizational processes you need to start.
Plus, you'll join my newsletter and get weekly articles covering practical tools and techniques, from Docker packaging to Python best practices.