Setting up a CUDA environment on an EC2 GPU instance is surprisingly like building a custom PC, except your motherboard is a massive data center and your CPU is a cluster of interconnected processors.

Let’s get this thing running. We’ll use an g4dn.xlarge instance for this example, which comes with an NVIDIA T4 GPU.

First, SSH into your EC2 instance. You’ll want to update your package lists and install some essentials:

sudo apt-get update
sudo apt-get install -y build-essential dkms linux-headers-$(uname -r)

Now, the crucial part: installing the NVIDIA driver. EC2 GPU instances often come with a basic driver, but it’s usually best to install the latest compatible one. NVIDIA provides a convenient repository for this.

sudo apt-get install -y nvidia-driver-535

After the driver installation, a reboot is necessary for the kernel modules to load correctly.

sudo reboot

Once your instance is back up, SSH in again and verify the driver installation by running:

nvidia-smi

This command should output detailed information about your NVIDIA GPU, including its name, temperature, and the driver version. If you see this, your driver is good to go.

Next, we need the CUDA Toolkit. NVIDIA’s CUDA Toolkit includes the compiler (nvcc), libraries, and development tools necessary for GPU programming. You can download it directly from NVIDIA’s website. It’s often easier to use their apt repository.

First, add the CUDA repository:

sudo apt-get install -y software-properties-common
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update

Now, install the CUDA Toolkit itself. We’ll install the toolkit without the bundled driver, as we’ve already installed a compatible one:

sudo apt-get install -y cuda-toolkit-12-2 --allow-downgrades

After installation, you need to configure your environment variables so that nvcc and other CUDA tools are accessible. Add these lines to your ~/.bashrc file:

export PATH=/usr/local/cuda-12.2/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-12.2/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

Then, reload your shell configuration:

source ~/.bashrc

To confirm that the CUDA Toolkit is installed and accessible, check the nvcc version:

nvcc --version

This should display the CUDA compiler version, confirming your setup.

Finally, let’s compile and run a sample CUDA program to ensure everything is working end-to-end. NVIDIA includes sample applications with the toolkit. You can find them in /usr/local/cuda-12.2/samples. A good one to test is deviceQuery.

Navigate to the samples directory and build it:

cd /usr/local/cuda-12.2/samples/1_Utilities/deviceQuery
sudo make

Now, run the compiled executable:

./deviceQuery

If everything is correctly set up, this program will list details about your NVIDIA GPU and conclude with "Result = PASS". This is your definitive confirmation that your CUDA environment is operational on your EC2 GPU instance.

The next hurdle you’ll likely encounter is managing multiple CUDA versions or optimizing your PyTorch/TensorFlow installation to leverage the newly configured CUDA environment.

Want structured learning?

Take the full Ec2 course →