Every year, I take two weeks off around Christmas and New Year. It’s my favorite time of the year because I usually don’t go anywhere, but staying at home, so that I can play my favorite video games, or fix things up in the house, or start some fun projects that require a large chunk of time that does not fit a typical weekend.
This year, I decided to play with Machine Learning, or like many people refer to it as a synonym, AI. There can be many angles to it: running an open weight LLM with Ollama / LM Studio; building some fun AI agent; play with coding agents like Antigravity / Claude Code / Codex. For me, I have always been more interested in training machine learning models.
My first experience with machine learning was Andrew Ng’s machine learning class on coursera.org. It was a few years before PyTorch and Tensorflow were introduced, so things were done in GNU Octave. I later implemented some basic training logic in python using numpy in a class project. It’s finally time for me to play with PyTorch, and obviously, the best way is in a Google Colab.
The free hosted Colab runtime offers only Nvidia T4 GPU, and a small amount of RAM (10-12GB). If I want to play with better GPU and bigger model, I have a few choises:
- Colab Paid Service: It’s super vague on what GPU is available; and it’s a subscription that I have to remember to unsubscribe
- Run a local runtime: I don’t have a good GPU because they are too expensive nowadays
- Run a local runtime on GCE VM: let’s try this!
Google Compute Engine (GCE) G4 VM
Google Cloud Platform (GCP) GA’ed the GCE G4 VM in October 2025, which offers RTX PRO 6000 with 96GB of VRAM, which seems a perfect choice for LLM. With spot pricing, it’s about $2.5 per hour. It is a bit pricy if we run it 24×7, but I am going to just use it a couple hours here and there. It is also definitely much cheaper than owning any GPU.
Creating the VM is easy from the web console. We keep most things as default. The only few changes are:
- Set provision model to spot – it’s about 50% cheaper this way
- Select the Debian 12 OS image with a 40GB root disk
- Add local SSD
My complaint here is that the UI says that it’s better to use a GPU-optimized Debian OS image to avoid manually install CUDA, but when I tried the Deep Learning on Linux images, they are actually not compatible with RTX PRO 60001. Any way, at least the documentation exists, but the UI could really be better.

Setting up the VM
GPU Driver Installation
It was straightforward to follow steps at https://docs.cloud.google.com/compute/docs/gpus/install-drivers-gpu to install the GPU driver.
curl -L https://storage.googleapis.com/compute-gpu-installation-us/installer/latest/cuda_installer.pyz --output cuda_installer.pyz
sudo python3 cuda_installer.pyz install_driver --installation-mode=binary --installation-branch=prod
The machine will restart and we should rerun the cuda_installer again to finish the installation. Running the nvidia-smi should produce some meaningful output.
We also install nvidia-container-toolkit so that we can use the GPU in the colab docker container, following https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html.
sudo apt-get install -y nvidia-container-toolkit
Local SSD Setup
To save cost, I decided to use local SSD instead of persistent storage. The downside, however, is that I need to jump through a few more hoops to make docker use a different location for all the images – the colab docker image can be as big as 55GB.
git clone https://github.com/penguingao/gce_local_ssd_setup.git
cd gce_local_ssd_setup
sudo ./install.sh
sudo init 6
The VM should restart and the SSDs should be correctly mounted in gce_local_ssd_setup/local_ssd_mnt_[0-3].
Install Docker
First we follow the steps at https://docs.docker.com/engine/install/debian/.
# Uninstall all conflicting packages:
sudo apt remove $(dpkg --get-selections docker.io docker-compose docker-doc podman-docker containerd runc | cut -f1)
# Add Docker's official GPG key:
sudo apt update
sudo apt install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/debian/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
# Add the repository to Apt sources:
sudo tee /etc/apt/sources.list.d/docker.sources <<EOF
Types: deb
URIs: https://download.docker.com/linux/debian
Suites: $(. /etc/os-release && echo "$VERSION_CODENAME")
Components: stable
Signed-By: /etc/apt/keyrings/docker.asc
EOF
sudo apt update
# Install
sudo apt install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
Now that docker is installed, we need to point docker and containerd to use the local ssd for storage.
sudo tee /etc/docker/daemon.json <<EOF
{
"data-root": "${HOME}/gce_local_ssd_setup/local_ssd_mnt_0/docker_root"
}
EOF
# restart docker
sudo systemctl daemon-reload
sudo systemctl restart docker
After this docker info should show Docker Root Dir as the new directory. Also, edit sudo systemctl edit docker to include the following lines:
[Unit]
After=chmod_local_ssd.service
Requires=chmod_local_ssd.service
This makes sure docker is started after we have properly setup the permission of the SSD directory.
Finally, let’s make containerd also use a different root by editing /etc/containerd/config.toml to have the following line:
root = '<your-home-dir>/gce_local_ssd_setup/local_ssd_mnt_0/containerd_root'
And similarly, edit sudo systemctl edit containerd to include the following lines:
[Unit]
After=chmod_local_ssd.service
Requires=chmod_local_ssd.service
After restarting containerd by sudo systemctl restart containerd, we have properly started using the local SSD for docker.
Start the local Colab runtime
We should follow https://research.google.com/colaboratory/local-runtimes.html.
docker run --gpus=all -p 127.0.0.1:9000:8080 us-docker.pkg.dev/colab-images/public/runtime
This might take a little while but once it finishes, you’ll have the URL with the token printed in the terminal. We just need to forward that port to our local machine before visiting https://colab.research.google.com/ to connect to the runtime.
gcloud compute ssh llm-dev-0 -- -NL 9000:localhost:9000
After pasting the URL, and the runtime is connected, we can start using PyTorch!


- I only learned about this reading https://www.reddit.com/r/googlecloud/comments/1gw6pkx/compute_engine_deep_learning_vm_images_still/ ↩︎





Leave a Reply