Databricks Custom Container on GPU Compute with Python 3.9

Databricks has a new feature in public preview, where users can use Databricks Container Services on GPU compute. We have some users that wanted to run their deep learning model on Databricks, but their python environment was customized enough to warrant trying out this feature.

Skip to the Artifacts Section to see the code used to build and push the final image to Azure Container Registry.

The first step was reproducing the model locally, and getting an understanding of what is needed. At a high level, we found we needed Python 3.9, PyTorch integration with CUDA GPU compute, and a few non-standard python packages (that is they do not ship with the last Databricks Runtime that support Python 3.9). Next, we determined when we deliver this to our users, we want it to be easy to set up, reproduce, and understand. The final product will be a compute policy that we make available to the target users.

How a user will consume the compute policy

Writing the Dockerfile

Databricks provides Databricks runtime images, which usually we could use as a starting point. However, at the time of this writing there are none available for Python 3.9 with GPU support. The Databricks containers repository contains the code used to build GPU Containers. Digging into the most up-to-date CUDA directory (cuda-11.8), we see the PyTorch Dockerfile, which was extended from the gpu-venv image. It is this image which sets the python version, and installs the system environment.

Let us rewrite the gpu-venv Dockerfile to use Python 3.9. First we update the python_version ARG.


# Dockerfile
ARG python_version="3.10"


# Dockerfile
ARG python_version="3.9"

Next we update the python installation script. In order to install Python 3.9 on Ubuntu 22.04 (determined from the base image gpu-venv extends), we need to add the Deadsnakes PPA.


# Dockerfile

# Install python 3.10 from ubuntu.
# Install pip via get-pip.py bootstrap script and install versions that match Anaconda distribution.
RUN apt-get update \
 && apt-get install curl software-properties-common -y python${python_version} python${python_version}-dev python${python_version}-distutils \
 && curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py \
 && /usr/bin/python${python_version} get-pip.py pip==${pip_version} setuptools==${setuptools_version} wheel==${wheel_version} \
 && rm get-pip.py


# Dockerfile

# Install python 3.9 from deadsnakes.
RUN add-apt-repository ppa:deadsnakes/ppa
# Install pip via get-pip.py bootstrap script and install versions that match Anaconda distribution.
RUN apt-get update \
 && apt-get install -y curl software-properties-common python${python_version} python${python_version}-dev python${python_version}-distutils \
 && curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py \
 && /usr/bin/python${python_version} get-pip.py pip==${pip_version} setuptools==${setuptools_version} wheel==${wheel_version} \
 && rm get-pip.py

We found out later that the numpy version in the provided requirements.txt caused an issue with one of the packages we install later, rasterio. We upgraded this version in the requirements.txt.


# requirements.txt

numpy==1.24.4 # upgrade from 1.23.5 for rasterio

There are no other changes to the gpu-venv image needed, so we can add PyTorch support to the end of the Dockerfile, as done in the PyTorch Dockerfile.


# Dockerfile

# install the pytorch versions
RUN /databricks/python3/bin/pip install --no-cache-dir \
	torch==2.0.1 \
	torchvision==0.15.2 \
	&& /databricks/python3/bin/pip cache purge

Lastly, we can install the custom python package requirements needed to run our users’ model.


# Dockerfile

COPY custom_requirements.txt /databricks/.

RUN /databricks/python3/bin/pip install --no-cache-dir -r /databricks/custom_requirements.txt

Manually test the Custom Container in Databricks

Now we need to build the image, and push it to Azure Container Registry so our Databricks compute can make use of it. At the time of writing, Dockerhub and ACR are the only supported registries. To follow our workflow, reference buildImage.sh and pushImage.sh in the Artifacts Section.

To test you custom container manually, first you will need to create a new cluster, and set it to Databricks Standard Runtime 12.2 LTS. This is the last runtime which supports Python 3.9 and using another version will cause issues. Also, make sure to use the standard runtime

“To create custom images for GPU compute, you must select a standard runtime version

instead of Databricks Runtime ML for GPU”

Next, edit the Databricks Cluster -> Advanced -> Docker, and reference the container image in ACR. For ACR, you will need to create a token, and a token password to serve as the username and password Authentication. Once Use your own Docker container is checked, you can select GPU Node types alongside your Databricks Standard Runtime 12.2 LTS.

Advanced options -> Docker -> use your own Docker container

If you plan on accessing Unity Catalog Volumes, make sure to add the spark.databricks.unityCatalog.volumes.enabled config value and set it to true.

Advanced options -> Spark -> Spark config

Reference the Azure Databricks Custom Container Documentation for further support.

One last issue with using Databricks Runtime 12.2 LTS, is it does not support the new Git Folders feature. Since we are managing this model via a git repository, we needed to use the legacy Repos to connect our repository to Databricks.

Test your code with the compute you’ve just created, if everything runs as expected, you’re ready for the final step.

Deliver as a Compute Policy

As we said in the beginning, we want to make our custom container environment available using a new compute policy. We manage our workspaces with terraform, and we can implement this new compute policy with a cluster policy terraform module we created.


module "dbricks_prod_01_cluster_policy_geospatial_py39_gpu" {
 source  = "app.terraform.io/organizationName/cluster-policy/databricks"
 version = "0.0.9"

 name             = "Py39 Pytorch GPU"
 policy_family_id = "personal-vm"
 account_groups   = ["account-group-1", "account-group-2"]

 overrides = {
   "autotermination_minutes" : {
     "type" : "range",
     "maxValue" : 60,
     "defaultValue" : 45,
     "isOptional" : false
   },
   "docker_image.url" : { "type" : "fixed", "value" : "<azureContainerRegistryName>.azurecr.io/databricks/pytorch-gpu:py39" },
   "docker_image.basic_auth.username" : { "type" : "fixed", "value" : local.dbricks_prod_01.cr_token_name },
   "docker_image.basic_auth.password" : { "type" : "fixed", "value" : local.dbricks_prod_01.cr_token_password },
   "spark_version" : {
     "type" : "allowlist",
     "values" : ["12.2.x-scala2.12"],
     "defaultValue" : "12.2.x-scala2.12"
   },
   "driver_node_type_id" : { "type" : "fixed", "value" : "Standard_NC4as_T4_v3" },
   "node_type_id" : { "type" : "fixed", "value" : "Standard_NC4as_T4_v3" },
   "spark_conf.spark.databricks.unityCatalog.volumes.enabled" : { "type" : "fixed", "value" : "true" },
 }

 providers = {
   databricks = databricks.dbricks_prod_01
 }
}

The module creates a new compute policy which overrides the personal-vm policy family, adding features like the custom docker image, Unity Catalog volume access, group access, and other helpful features. After this is deployed, all users in the account groups we specify will be able to use the new policy.

If you are a workspace admin, you can manually create a new compute policy in your Databricks workspace, and add your own overrides to referencing your custom container.

Specifically, you’ll want these overrides if overriding the Personal Compute family.


{
 "docker_image.basic_auth.password":"type":"fixed","value":"tokenPassword"},
 "docker_image.basic_auth.username":{"type":"fixed","value":"token-name"},
 "docker_image.url":{"type":"fixed","value":"azureContainerRegistryName.azurecr.io/databricks/pytorch-gpu:py39"},
 "spark_conf.spark.databricks.unityCatalog.volumes.enabled":{"type":"fixed","value":"true"}

}

Now your users are ready to use the new Python 3.9 GPU environment via a new Compute Policy!

Artifacts

Dockerfile


FROM databricksruntime/gpu-base:cuda11.8

ARG python_version="3.9"
ARG pip_version="23.2.1"
ARG setuptools_version="68.0.0"
ARG wheel_version="0.38.4"
ARG virtualenv_version="20.24.2"

ENV TZ=Etc/UTC
ENV DEBIAN_FRONTEND=noninteractive

# Install python 3.9 from deadsnakes.
RUN add-apt-repository ppa:deadsnakes/ppa
# Install pip via get-pip.py bootstrap script and install versions that match Anaconda distribution.
RUN apt-get update \
 && apt-get install -y curl software-properties-common python${python_version} python${python_version}-dev python${python_version}-distutils \
 && curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py \
 && /usr/bin/python${python_version} get-pip.py pip==${pip_version} setuptools==${setuptools_version} wheel==${wheel_version} \
 && rm get-pip.py

# virtualenv 20.0.24 introduced a periodic update feature, which attempts to update all
# seeder packages every 14 days. This launches background processes that may interfere
# with user cleanup and may allow users to inadvertently update pip to newer versions
# incompatible with Databricks. Instead, we patch virtualenv to disable periodic updates per
# https://virtualenv.pypa.io/en/latest/user_guide.html#embed-wheels-for-distributions.
RUN /usr/local/bin/pip${python_version} install --no-cache-dir virtualenv==${virtualenv_version} \
 && sed -i -r 's/^(PERIODIC_UPDATE_ON_BY_DEFAULT) = True$/\1 = False/' /usr/local/lib/python${python_version}/dist-packages/virtualenv/seed/embed/base_embed.py \
 && /usr/local/bin/pip${python_version} download pip==${pip_version} --dest \
 /usr/local/lib/python${python_version}/dist-packages/virtualenv_support/

# Create /databricks/python3 environment.
# We install pip and wheel so their executables show up under /databricks/python3/bin.
# We use `--system-site-packages` so python will fallback to system site packages.
# We use `--no-download` so virtualenv will install the bundled pip and wheel.
# Initialize the default environment that Spark and notebooks will use
RUN virtualenv --python=python${python_version} --system-site-packages /databricks/python3 --no-download --no-setuptools

# These python libraries are used by Databricks notebooks and the Python REPL
# You do not need to install pyspark - it is injected when the cluster is launched
# Versions are intended to reflect latest DBR LTS: https://docs.databricks.com/en/release-notes/runtime/15.4lts.html#system-environment

COPY requirements.txt /databricks/.

RUN /databricks/python3/bin/pip install -r /databricks/requirements.txt

# Specifies where Spark will look for the python binary
ENV PYSPARK_PYTHON=/databricks/python3/bin/python3

RUN virtualenv --python=python${python_version} --system-site-packages /databricks/python-lsp --no-download --no-setuptools

COPY python-lsp-requirements.txt /databricks/.

RUN /databricks/python-lsp/bin/pip install -r /databricks/python-lsp-requirements.txt

# Use pip cache purge to cleanup the cache safely
RUN /databricks/python3/bin/pip cache purge

# install the pytorch versions
RUN /databricks/python3/bin/pip install --no-cache-dir \
   torch==2.0.1 \
   torchvision==0.15.2 \
   && /databricks/python3/bin/pip cache purge

COPY custom_requirements.txt /databricks/.

RUN /databricks/python3/bin/pip install --no-cache-dir -r /databricks/custom_requirements.txt

buildImage.sh


# build the image
docker build -f Dockerfile -t python ./

# create an alias of the image
export CONTAINER_REGISTRY_SERVER=<yourAzureContainerRegistry>.azurecr.io
export NAMESPACE=databricks
export IMAGE_IDENTIFIER=pytorch-gpu
export VERSION=py39
export IMAGE_TAG=${CONTAINER_REGISTRY_SERVER}/${NAMESPACE}/${IMAGE_IDENTIFIER}:${VERSION}
docker tag python ${IMAGE_TAG}

pushImage.sh


export CONTAINER_REGISTRY_SERVER=<yourAzureContainerRegistry>.azurecr.io
export NAMESPACE=databricks
export IMAGE_IDENTIFIER=pytorch-gpu
export VERSION=py39
export ALIAS=${CONTAINER_REGISTRY_SERVER}/${NAMESPACE}/${IMAGE_IDENTIFIER}:${VERSION}
docker push ${ALIAS}

Databricks Custom Container on GPU Compute with Python 3.9

Deliver as a Compute Policy

Recent Posts

Commenti