Optimize Multi-Node LLM Training at Scale with DigitalOcean AI

DigitalOcean's infrastructure supporting AI workloads for training and fine-tuning LLMs like MPT-7B and MPT-30B using MosaicML LLM Foundry.

Optimize Multi-Node LLM Training at Scale with DigitalOcean AI

Introduction

Optimizing large-scale AI workloads requires robust infrastructure, and DigitalOcean has proven to be a powerful solution for training and fine-tuning large language models (LLMs) like MPT-7B and MPT-30B. With the integration of MosaicML LLM Foundry and the use of H100 GPUs across multi-node clusters, DigitalOcean’s platform has demonstrated impressive performance and efficiency at scale. This makes it an ideal choice for AI tasks that demand consistent results and resource optimization. In this article, we explore how DigitalOcean enables efficient, production-ready LLM training without the complexity of traditional infrastructure management.

What is MosaicML LLM Foundry?

MosaicML LLM Foundry is an open-source tool that helps train and fine-tune large language models across multiple computer nodes. It simplifies the process by providing an end-to-end solution, including data preparation, model training, and evaluation, without needing complex setups. This makes it easier for users to work with large-scale models and ensures efficient use of resources, such as GPUs, during training. It’s designed for easy, automated use and supports a variety of large language models.

Finetuning tool: MosaicML LLM Foundry

So, you’re working with AI, right? If you’ve ever tried to train or finetune large language models (LLMs) across multiple machines, you’ll know it can get pretty tricky. Well, here comes MosaicML LLM Foundry, an open-source tool that takes all the stress out of the process. It’s a powerful solution designed to make training advanced AI models easier and more efficient, especially when dealing with large-scale workloads.

What makes MosaicML LLM Foundry stand out? Let’s break it down:

End-to-End Workflow Support

Here’s the thing – building an AI model isn’t just about writing code. There’s data preparation, training, finetuning, inference, and evaluation. It’s a journey from start to finish, and MosaicML LLM Foundry manages it all. It’s a tool built for real-world applications, where all these stages need to flow smoothly. Think of it like the conductor of an orchestra, ensuring each phase of the model’s lifecycle is perfectly in sync, making your AI journey seamless.

Command-Line Interface (CLI)

For some of us, dealing with complex coding is like running a marathon in a maze. But with the command-line interface of the LLM Foundry, you can leave all the headache-inducing, application-specific code behind. No need to be a programming wizard to start training your models. It’s simple, intuitive, and helps you focus on what really matters – your model. You’ll spend less time wrestling with lines of code and more time actually making your AI better.

Shared Disk Multinode Setup

When it comes to distributed workloads, many tools ask you to pull your hair out with complicated setups like SLURM. But here’s the twist: MosaicML LLM Foundry simplifies the process with a shared disk multinode configuration. This feature lets you scale your models across multiple nodes without all the additional setup overhead. No more tangled configurations or managing high-performance computing (HPC) systems—everything’s easier and more straightforward. It’s like setting up a mini cloud environment in minutes.

Efficient GPU Utilization

If you’ve ever run into problems where your GPUs aren’t fully utilized, you know how frustrating that can be. Thankfully, MosaicML LLM Foundry includes a model FLOPS utilization (MFU) metric. This ensures that your GPUs are being used to their full potential. You can track exactly how efficiently your computational resources are performing, optimizing the model’s execution and fine-tuning its performance. It’s like having a dashboard for your GPU usage that helps you fine-tune the whole process.

Existing Benchmarks for Calibration

Want to set realistic expectations? You’ll love this: MosaicML LLM Foundry comes with existing benchmarks that give you an idea of how your models should perform, even before you start training. These benchmarks were tested on different hardware setups, so while they may not be an exact match for your system, they still serve as solid guidelines. They’re like a map, guiding you on your training and finetuning journey.

Support for Real-World Models

Let’s talk about MPT-7B and MPT-30B—these are the heavyweights in the world of large language models. And guess what? MosaicML LLM Foundry supports these models, making it easy to pretrain and finetune them across multiple nodes. Whether you’re diving into state-of-the-art AI applications or looking to scale up your own LLMs, this tool has you covered. It’s built for the big leagues, handling the heavy lifting with ease.

Weights & Biases Integration

If you’re serious about tracking your models’ performance, MosaicML LLM Foundry plays nicely with Weights & Biases, a tool widely used for visualizing machine learning experiments. This integration allows you to track hyperparameters and visualize metrics in real-time. It’s like having an AI coach by your side, making sure your model’s performance is always on point. With Weights & Biases, you get deeper insights into your model’s progress, ensuring you’re always moving forward and improving.

All in all, MosaicML LLM Foundry takes the complexity out of training and finetuning large language models. Its combination of easy-to-use features, efficient resource management, and integration with tools like Weights & Biases makes it a top choice for developers and data scientists looking to work on large-scale AI projects. Whether you’re fine-tuning an MPT-7B or MPT-30B model, this tool ensures that your models evolve quickly and efficiently, leaving the headaches behind.

MosaicML LLM Foundry: Research Paper (2024)

Results and Discussion

Let’s dive into the results from our pretraining and finetuning experiments. Imagine trying to train and finetune large language models (LLMs) on cloud servers spread across multiple machines—sounds intense, right? Well, that’s exactly what we did, and it worked like a charm. The results were solid, and the performance was exactly what we expected. What’s even better? You can easily run these kinds of experiments on similar infrastructure, showing how flexible and effective the setup really is.

Full Pretraining

We kicked things off with full pretraining—think of it as laying the groundwork for a giant AI brain. The details of these experiments are captured in Table 1, where we break down the results. Rather than using complex graphs, we decided to keep it simple with a detailed table breakdown. Here’s how the data looked for different model sizes, ranging from the MPT-125M to MPT-1B models:

Model	Training Data	Max Duration (Batches)	Ratio Params / 125M Params	No. Nodes	Actual Runtime (Wallclock)	Throughput (Tokens/s)	Memory per GPU (from 82GB)	Checkpoint Size	Evaluation Accuracy
MPT-125M	C4	4800	1	8	9m 7.873s	6,589,902	1.5G	~0.1	0.53
MPT-350M	C4	13400	2.8	8	38m 10.823s	3,351,644	4.0G	~0.145	0.56
MPT-760M	C4	29000	6.08	8	103m 23.136s	2,737,276	8.6G	~0.27	0.56
MPT-1B	C4	24800	8	8	208m 24.319s	2,368,224	15G	~0.33	0.58

Now, here’s the story:

Inference Verification: For every model, we checked whether the trained models were performing inference correctly. Spoiler alert: they did. And to make sure they were working properly, we ran evaluations using test data that the models hadn’t seen before. Everything passed with flying colors.

Model Conversion: After each round of training, we converted the models to the Hugging Face format—this was a big win. This format is more efficient for inference, shrinking the model’s size without losing any performance. It’s like trimming off the excess weight but keeping all the muscle strength.

Impact of Larger Models: Here’s where things get interesting. As the models got bigger—like moving from MPT-125M to MPT-350M, and all the way up to the huge MPT-70B—we saw the runtime increase. The bigger the model, the more time it took. For example, training the MPT-70B on 8 nodes would take roughly two months. But if we scaled up to 64 nodes, we could cut that down to about one week. But here’s the key takeaway: that’s just how it works. Bigger models need more computational power. So, more time spent doesn’t mean the system’s inefficient—it just needs more horsepower.

GPU Efficiency: The magic behind all of this is in the GPU utilization. We used a metric called model FLOPS utilization (MFU) to measure how effectively the GPUs were being used. The results? The system was efficient! The increased runtimes for larger models? Completely due to the computational demands of those models, not because the system was slacking off. This shows that the infrastructure is well-optimized to handle big, heavy models without wasting any resources.

Finetuning

Now, after getting our models all trained up, it was time to finetune them. Think of this like taking a car that runs well and fine-tuning it to get that extra mile per gallon or make it purr like a kitten. Here are the results for the finetuning process, where we tested the models on different configurations of nodes:

Model	Finetuning Data	Max Training Duration (Epochs)	No. Nodes	Actual Runtime (Wallclock)	Speedup vs. One Node	Throughput (Tokens/s)	Memory per GPU	Evaluation Accuracy
MPT-7B-Dolly-SFT	mosaicml/dolly_hhrlhf	2	1	78m 28.121s	–	7,124	24.9	0.85
MPT-7B-Dolly-SFT	mosaicml/dolly_hhrlhf	2	2	29m 24.485s	2.67x	13,844	19.9	0.84
MPT-7B-Dolly-SFT	mosaicml/dolly_hhrlhf	2	4	18m 21.026s	4.28x	28,959	17.5	0.84
MPT-7B-Dolly-SFT	mosaicml/dolly_hhrlhf	2	8	13m 35.352s	5.77x	50,708	9.37	0.84
MPT-30B-Instruct	kowndinya23/instruct-v3	2	8	125m 12.579s	3.76x	52,022	~36	0.85

Let’s walk through these observations:

Speedup with Multiple Nodes: Here’s where the fun happens. When we scaled from 1 node to 8, we saw some serious performance improvements. The speedup was 2.67x with 2 nodes, 4.28x with 4, and 5.77x with 8 nodes. This shows that the system really comes alive when you throw more nodes into the mix. The gains were most noticeable when going from 1 to 4 nodes, but even at 8 nodes, there was a clear increase in speed.

Checkpoint Overhead: Of course, when you’re finetuning a model, you’re saving checkpoints after each epoch. It’s like making sure you save your game progress so you don’t have to start from scratch if something goes wrong. This process adds a bit of overhead, but it’s totally normal. It ensures that your work doesn’t get lost, even if something unexpected happens along the way.

Inference and Accuracy Validation: The last thing we checked was whether the finetuned models were still performing well. Guess what? They were! The accuracy of the finetuned models, when tested on unseen data, was even better than the pretrained smaller models. That’s the power of finetuning—it makes the model more precise, more tailored, and ultimately more reliable.

In short, these results show that finetuning large language models on multinode infrastructure isn’t just doable—it’s super efficient. By scaling up with multiple nodes, you can speed up your training and finetuning without sacrificing performance. That means you can get better results faster, and who doesn’t want that?

Large Language Model Scaling and Optimization (2023)

Full Pretraining

Alright, let’s break down the results from our full pretraining runs. Now, I know what you might be thinking—”Graphs? Tables?” But we’ve skipped the graphs this time because the data is pretty manageable and varied enough to be better displayed in a detailed table. It’s the same data, just more digestible in its own way. So, here’s a closer look at how our models performed:

Model	Training Data	Max Duration (Batches)	Ratio Params / 125M Params	No. Nodes	Actual Runtime (Wallclock)	Throughput (Tokens/s)	Memory per GPU (from 82GB)	Checkpoint Size	Evaluation Accuracy
MPT-125M	C4	4800	1	8	9m 7.873s	6,589,902	1.5G	~0.1	0.53
MPT-350M	C4	13400	2.8	8	38m 10.823s	3,351,644	4.0G	~0.145	0.56
MPT-760M	C4	29000	6.08	8	103m 23.136s	2,737,276	8.6G	~0.27	0.56
MPT-1B	C4	24800	8	8	208m 24.319s	2,368,224	15G	~0.33	0.58

So, what did we learn from these pretraining experiments? Let’s go through the key points:

Inference Verification: You’ve probably heard the saying, “Trust but verify,” right? Well, we took that to heart. For each model, we ran an inference test on unseen test data to make sure the model wasn’t overfitting to the training data. The result? Everything passed with flying colors, meaning the models were ready for prime time, not just stuck in the training phase.

Model Conversion for Efficiency: After all that training, we didn’t just leave the models as is. We converted them into the Hugging Face format—and trust me, this made a big difference. Not only did this reduce the model size (think of it as a clean-up job that makes the model easier to work with), but it didn’t sacrifice any performance. You get the same power, but in a more lightweight package.

Performance of Larger Models: Here’s where the plot thickens. As we scaled up to larger models in the MPT series, like MPT-3B, MPT-7B, MPT-13B, and MPT-30B, we saw the expected trend: more time was required. But that’s not a flaw in the system—it’s just a natural consequence of bigger models needing more computational muscle. The star of the show, the MPT-70B, would take about two months on 8 nodes to finish pretraining. But wait, if we scale up to 64 nodes, we’re talking just one week. That’s some serious horsepower!

GPU Efficiency: Now, this one’s cool. To really get a feel for how efficiently we were using our resources, we looked at something called model FLOPS utilization (MFU). Think of it as the efficiency gauge for GPUs. The results showed that the GPUs were working exactly as they should be—efficiently. More runtime for larger models? Yeah, that’s just because they needed more time to crunch the numbers. It wasn’t because the system was slacking. It was all about the computational demands of these heavyweight models.

Wrapping it up: The pretraining process for large language models, including the MPT series, was a big success. It didn’t matter whether we were working with smaller or larger models; the system held up. The real takeaway here is the infrastructure’s robustness. It can scale effectively, handling both smaller models like MPT-125M and the massive MPT-70B without skipping a beat. The system’s efficiency and performance were maintained throughout, which is pretty much the sweet spot when you’re working with high-performance computing (HPC) environments.

So, in the end, whether you’re training a small model or going full beast mode with something like MPT-30B, this system can handle it. And that, my friend, is the beauty of it all.

For more on model efficiency, refer to the Nature article on Model Efficiency (2020).

Finetuning

Just like in the pretraining phase, we ran some extensive experiments on finetuning, and guess what? The results were just as exciting! We tested the process by training and evaluating large language models (LLMs) across multiple nodes, which let us dig into how efficient and scalable the finetuning process really is. So, let’s dive into the story of how it all unfolded.

In the beginning, the process might have seemed like a typical setup: data flowing through nodes, training models, and testing results. But when you start scaling things up and using multiple nodes, that’s when the magic happens.

So, how did these models perform as they scaled from one node to several? Well, here are the key results, laid out like a story of progress:

Model	Finetuning Data	Max Training Duration (Epochs)	Evaluation Interval (Epochs)	No. Nodes	Actual Runtime (Wallclock)	Speedup vs One Node	Throughput (Tokens/s)	Memory per GPU (from 82GB)	Inference & Evaluation OK?	Evaluation Accuracy
MPT-7B-Dolly-SFT	mosaicml/dolly_hhrlhf	2	1	1	78m 28.121s	–	7,124	24.9	Y	0.85
MPT-7B-Dolly-SFT	mosaicml/dolly_hhrlhf	2	1	2	29m 24.485s	2.67x	13,844	19.9	Y	0.84
MPT-7B-Dolly-SFT	mosaicml/dolly_hhrlhf	2	1	4	18m 21.026s	4.28x	28,959	17.5	Y	0.84
MPT-7B-Dolly-SFT	mosaicml/dolly_hhrlhf	2	1	8	13m 35.352s	5.77x	50,708	9.37	Y	0.84
MPT-30B-Instruct	kowndinya23/instruct-v3	2	1	8	125m 12.579s	3.76x	52,022	~36	Y	0.85

Key Moments from the Finetuning Journey

Speedup with Multiple Nodes: Here’s where things get interesting. The whole goal of adding more nodes is to make things faster, right? Well, when we scaled up from one node to two, we saw a speedup of 2.67x. Not bad! But it didn’t stop there. The magic happened as we kept scaling: at four nodes, the speedup hit 4.28x, and by eight nodes, we reached 5.77x. It’s like the more nodes we threw at it, the faster everything ran. But here’s the catch—there’s a point where adding more nodes doesn’t give you the same big boost. You start to see diminishing returns as the system gets closer to its peak efficiency. Still, you’re winning at this point, right?

Overhead for Model Checkpoints: If you’ve been in the game long enough, you know about checkpoints. After each epoch, the model saves its progress, making sure no hard work is lost. That’s standard procedure. It adds a little bit of overhead—think of it like stopping for a quick breather in a race. You might lose a few seconds, but it’s totally necessary to keep things on track. The good news? This slight impact on runtime didn’t hurt the overall finetuning performance.

Inference and Accuracy Validation: Now, let’s talk about the real test—the model’s inference abilities. You don’t just want your model to memorize training data; you want it to generalize to new, unseen data. That’s exactly what we did. We validated the inference results for each of the finetuned models and found that they didn’t just work—they outperformed the pretrained models. As you’d expect, the finetuning process helped these models adapt and perform better when given new data to analyze. With the finetuned models, we saw an improvement in accuracy, especially in the larger models.

Improved Accuracy for Larger Models: This was a given. When you finetune larger models like the MPT-30B, they tend to do even better than smaller models. It’s like a student who’s been through a bunch of training—they just know their stuff better. The larger models reached an accuracy of 0.85, which is solid proof that finetuning made them more efficient at handling the task-specific data.

The Final Takeaway

The finetuning results are proof that scaling your infrastructure, particularly through multinode configurations, can seriously speed up the process of training large language models (LLMs). When you add more nodes, you get faster results—especially when you’re working with larger models like MPT-7B or MPT-30B. And while you might see a slight drop-off in speed once you hit the optimal node capacity, the improvement in performance is undeniable. Plus, it’s clear that finetuning is one of those key processes that helps turn a decent model into a great one.

So, if you’re diving into AI and looking to optimize your model training, remember this: multinode configurations can be your best friend. You get faster, more efficient training, and with models like MPT-7B and MPT-30B, the results speak for themselves.

For more details, check out the full research paper.

Scaling Language Model Training with Multi-node Configurations

Appendices

Datasets for Pretraining and Finetuning

Pretraining Dataset

When it comes to large language models (LLMs), pretraining them from scratch is no small task. To get the models to understand language, we need a huge amount of data. For this, we turned to the C4 dataset provided by MosaicML’s LLM Foundry. Now, C4 is not just any dataset; it’s a cleaned version of the Common Crawl web corpus, which means it’s packed with high-quality, diverse text—perfect for training a solid LLM. We’re talking over 10 billion rows of data here, so you can imagine how much text that is. And, the best part? MosaicML’s LLM Foundry handles the preprocessing, making everything more efficient and saving us a ton of time.

Finetuning Dataset

Once the LLMs have been pretrained, it’s time to finetune them. But here’s the thing—finetuning datasets are different from the pretraining ones. They’re much more specialized and smaller because they focus on tweaking the model’s behavior for specific tasks. For this, we used two main datasets supported by MosaicML. But of course, we made a few tweaks to adapt for the MPT-30B model. First, we used the Dolly HH-RHLF dataset to finetune the MPT-7B model. It’s a mix of Databricks’ dolly-15k dataset and a filtered subset of Anthropic’s HH-RLHF, which makes it even better. The cool thing here is that this dataset includes a test split (missing in the original dolly set), with 200 samples from Dolly and 4,929 samples from HH-RLHF. The training set itself is huge, with 59,310 samples! For the MPT-30B model, we chose the Instruct-v3 dataset, which is all about instruction-based finetuning. It helps the model get better at understanding and following instructions. A quick side note: when we first checked the version on MosaicML, we found a couple of bugs—nothing major, just some incorrect columns. So, we decided to go with a corrected copy from kowndinya23/instruct-v3, just to make sure everything was in tip-top shape.

Network Speed: NCCL Tests

When working with multinode cloud infrastructure, it’s critical to ensure that the network can handle the massive amount of data being exchanged. To verify this, we ran NCCL tests on our machines. These tests were crucial for confirming that our network could handle the heavy lifting required for large-scale model training. Now, here’s a fun fact: while Caasify’s infrastructure teams have conducted far more extensive testing, we still decided to include some of the results here so customers could get a real feel for typical network speeds in a multinode setup. After all, who doesn’t love a good performance benchmark?

We used this

$ mpirun -H hostfile -np 128 -N 8 –allow-run-as-root \
-x NCCL_IB_PCI_RELAXED_ORDERING=1 \
-x NCCL_IB_CUDA_SUPPORT=1 \
-x NCCL_IB_HCA^=mlx5_1,mlx5_2,mlx5_7,mlx5_8 \
-x NCCL_CROSS_NIC=0 \
-x NCCL_IB_GID_INDEX=1 \
$(pwd)/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1

And here’s what we found for 16 nodes:

Size (B)	Count (elements)	Type	Redop	Root	In-place	Time (us)	Algbw (GB/s)	Busbw (GB/s)	#Wrong
8	2	float	sum	-1	0.00	63.25	65.28	0.00	0
16	4	float	sum	-1	0.00	63.10	62.37	0.00	0

This table shows the network bandwidth for the 16-node environment, confirming that the network is more than capable of handling the heavy demands of model pretraining and finetuning.

Hardware and Software

The backbone of this project was built on Caasify’s bare-metal cloud servers. These servers, consisting of 8 nodes, were equipped with the mighty H100 GPUs, all connected via NVLink. That’s 64 GPUs in total, working together to bring models to life. The nodes were linked through a high-speed RDMA over Converged Ethernet (RoCE) network, which allowed data to flow smoothly and quickly between them. To handle all the massive datasets and model checkpoints, the infrastructure also had a shared storage filesystem with multiple terabytes of storage—because, let’s face it, LLMs aren’t small.

The operating system running on these machines was Ubuntu, with MosaicML running inside Docker containers. This setup ensures that the environment is streamlined and easily reproducible, making the whole process much more reliable. Now, here’s where things get interesting: since MosaicML utilizes shared drives and doesn’t need a workload manager like SLURM, it was necessary to replicate the Docker containers across each node. Once that was done, all we had to do was execute a single command on each node to kick off the finetuning process. This made for a smooth and efficient workflow, with 8 out of 16 nodes being used for the experiment.

MosaicML LLM Foundry

Now, let’s talk about the secret weapon behind the scenes: MosaicML LLM Foundry. This open-source tool is designed to streamline the entire process of pretraining and finetuning models in a multinode environment. There are a few standout features that make it super useful:

End-to-End Workflow Support: Whether you’re prepping data, training your model, finetuning it, running inference, or evaluating it, MosaicML LLM Foundry has you covered. It’s all about making sure you can handle every part of the machine learning lifecycle without missing a beat.
Command-Line Interface: One of the best parts of this tool is that you don’t need to worry about writing tons of application-specific code. Just fire up the command-line interface (CLI), and you’re good to go—whether it’s for pretraining or finetuning. This simplicity helps developers focus more on their models and less on the code that powers them.
Shared Disk Multinode Setup: Forget about the complicated setups with schedulers like SLURM. MosaicML LLM Foundry simplifies things with a shared disk setup, meaning you can easily scale across multiple nodes without dealing with all that extra overhead.
GPU Compute Efficiency: With the model FLOPS utilization (MFU) metric, you can keep track of how efficiently your GPUs are being used. This helps ensure that everything is running at full capacity without wasting resources.
Support for Real-World Models: MosaicML LLM Foundry is built to handle large language models, particularly the MPT series (like MPT-7B and MPT-30B). It’s ready to scale as needed, whether you’re working on small models or the big ones.
Integration with Weights & Biases: We all know how important it is to track your progress. That’s why MosaicML LLM Foundry integrates with Weights & Biases, allowing you to visualize your experiments, monitor metrics, and fine-tune your models based on real-time feedback.

With all these features, MosaicML LLM Foundry is a powerful tool for anyone looking to work with large language models like MPT-7B and MPT-30B, making it easier and more efficient to manage the training and finetuning process.

MosaicML LLM Foundry Overview

Datasets for Pretraining and Finetuning

Pretraining Dataset

When you’re training large language models (LLMs) from scratch, you need a ton of data to get things moving. It’s like trying to teach someone a language from scratch—they need lots of material to start building a solid base. That’s where the C4 dataset, supported by MosaicML’s LLM Foundry, comes into play. Think of it as a treasure chest filled with text data—over 10 billion rows to be exact. But it’s not just any data. C4 is a cleaned-up version of the Common Crawl web corpus, so it’s packed with high-quality, diverse text from all over the internet.

To get things started, we downloaded and preprocessed the data using LLM Foundry’s end-to-end workflow. This workflow is a real time-saver. It takes care of all the heavy lifting of preparing large-scale data, so we don’t have to worry about that part. By simplifying the process, MosaicML ensures the data is ready to train large models efficiently, which helps these models understand and generate language better.

Finetuning Dataset

Now, finetuning datasets are a bit different. They’re more specialized, smaller, and focus on helping pre-trained models improve and specialize for specific tasks. Think of it like giving someone extra tutoring in a specific subject after they’ve already mastered the basics.

For our finetuning experiments, we used two main datasets, both provided by MosaicML. These datasets were slightly adjusted for specific models, including the MPT-30B model.

First up, we used the Dolly HH-RHLF dataset to finetune the MPT-7B model. This dataset combines Databricks’ dolly-15k dataset with a filtered subset of Anthropic’s HH-RLHF dataset. The cool part here is that this updated version includes a test split that was missing in the original dolly set. So, we have 200 samples randomly picked from Dolly, along with an additional 4,929 samples from HH-RLHF that went through the filtering process. The final training set contains 59,310 samples, split between Dolly (14,814 samples) and HH-RLHF (44,496 samples).

For the MPT-30B model, we turned to the Instruct-v3 dataset. This one’s designed specifically for instruction-based finetuning and consists of prompts and responses. These prompts help the model better understand how to follow instructions, which is key for improving its task-oriented skills. But here’s the twist: the original dataset hosted on MosaicML had a couple of issues—mainly incorrect column arrangements that would have made working with it tricky. So, instead of fixing it ourselves, we used a corrected version hosted on kowndinya23/instruct-v3. This version was ready to go, and since the end-to-end process automatically connects to remote datasets like those on Hugging Face, using the corrected version was a lot more efficient.

These datasets are the foundation for finetuning pre-trained models, helping them improve at tasks they weren’t originally trained for. By adjusting the datasets for each model, we make sure the finetuning process not only boosts performance but also makes the models more versatile and accurate when used in real-world scenarios.

C4 Dataset Overview

Network Speed: NCCL Tests

Imagine you’re managing a huge AI project, working across multiple servers. The goal? To make sure everything runs smoothly, especially when you’re dealing with big tasks like model pretraining and finetuning. That’s exactly what we tested using NCCL (NVIDIA Collective Communications Library) tests, which help measure how well the network handles communication between multiple nodes. We weren’t just checking speed; we wanted to make sure the system could handle all the heavy work.

Here’s the thing: while Caasify’s infrastructure team has done a lot of testing already, we wanted to check how the system performed under normal working conditions. So, we ran these tests to give you a real look at typical network performance in multinode setups. Spoiler alert: everything worked just fine!

To get the results, we used the mpirun command (don’t worry, you don’t need to be a command expert to understand this). Here’s how it looked:


mpirun -H hostfile -np 128 -N 8 –allow-run-as-root \
-x NCCL_IB_PCI_RELAXED_ORDERING=1 \
-x NCCL_IB_CUDA_SUPPORT=1 \
-x NCCL_IB_HCA^=mlx5_1,mlx5_2,mlx5_7,mlx5_8 \
-x NCCL_CROSS_NIC=0 \
-x NCCL_IB_GID_INDEX=1 \
$(pwd)/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1

And what did we find? The 16 nodes we tested gave us some great data on performance. The network bandwidth could easily handle the demands of pretraining and finetuning models—exactly what you need when working on large-scale AI tasks.

Here’s a peek at some of the results:

Size (B)	Count (elements)	Algbw (GB/s)	Busbw (GB/s)	Time (us)
8	2	63.25	0.00	63.25
16	4	63.10	0.00	62.37
128	32	64.08	0.00	63.23
1024	256	71.55	0.01	70.99

As we moved from one node to eight, the performance stayed solid, showing the network could handle the growing load.

But what does this all mean? Simply put, the network is solid and can easily handle the demands of large model training.

Hardware and Software

Now, where did we run these tests? On Caasify’s bare-metal cloud servers, of course. These servers had 8 nodes, each equipped with H100 GPUs connected via NVLink, giving us a total of 64 GPUs to work with. It’s like having a super-powered computer ready to handle all your heavy lifting.

The servers were connected via RDMA over Converged Ethernet (RoCE), which basically means we’ve got high-speed, low-latency communication between all the nodes. This ensures smooth data transfer, so the model training doesn’t hit any slowdowns. To store all the massive datasets and model checkpoints, the infrastructure also included a shared storage system with multiple terabytes of space—perfect for AI tasks.

The machines were running Ubuntu, with MosaicML running inside Docker containers. This setup made sure that the environment was consistent and easily repeatable, which is important for ensuring the results can be trusted. Since MosaicML uses shared drives, we didn’t need complex tools like SLURM for managing workloads. That saved us a lot of time. However, because of the shared drives, we had to duplicate the Docker containers across all the nodes. Once that was done, we could trigger the finetuning process with a single command on each node, using 8 out of the 16 nodes for the experiment.

MosaicML LLM Foundry

So, what powers all of this? It’s the MosaicML LLM Foundry, an open-source tool that makes it super easy to pretrain and finetune large language models in a multinode setup. Here’s why it’s so great:

End-to-End Workflow Support: With MosaicML LLM Foundry, you get a full environment for the entire machine learning process—from prepping data to training the model, finetuning it, running tests, and evaluating it. It’s a one-stop solution that saves you from jumping between different tools.
Command-Line Interface: Want to start pretraining or finetuning? You can do it with just a few commands, without needing to worry about complex, custom code. This lets you focus on optimizing your model, not the setup.
Shared Disk Multinode Setup: Unlike many tools that need complicated schedulers like SLURM, MosaicML keeps things simple with a shared disk setup. This makes scaling across nodes easy, letting you deploy and train models faster without all the extra overhead.
GPU Compute Efficiency: With the model FLOPS utilization (MFU) metric, you can check how well your GPUs are being used. This ensures that you’re getting the most out of your hardware, which is crucial for both training and finetuning.
Existing Benchmarks: MosaicML provides benchmarks that help you see how well your setup is doing. While the benchmarks might not be a perfect match for every machine, they still give you a good idea of how things should perform.
Support for Large Models: The tool is made for handling large models like the MPT series (MPT-7B, MPT-30B, and others). These models are used in everything from research to actual production.
Integration with Weights & Biases: If you like to track your progress, MosaicML LLM Foundry integrates with Weights & Biases, so you can visualize your experiments, monitor metrics, and make adjustments as you go.

In short, MosaicML LLM Foundry is a game-changer for anyone working with AI and large language models. It makes the whole process easier, faster, and more efficient.

NCCL Overview

Hardware and Software

Imagine you’re working on a huge AI project where you need to push the limits of performance. You’ve got a big job ahead—training and fine-tuning large models on Caasify’s bare-metal cloud servers, powered by H100 GPUs. These servers aren’t your average setup; we’ve got 8 H100x8 nodes linked through RDMA over Converged Ethernet (RoCE), which basically means we have fast, low-latency communication between all the nodes. This ensures everything runs smoothly without any communication slowdowns. With a shared Virtual Private Cloud (VPC) and several terabytes of storage, we have all the space we need for large datasets and model checkpoints—the essential data for training and fine-tuning AI. Each node has 8 H100 GPUs, connected through NVLink, giving us a total of 64 GPUs across the cluster. It’s like having a powerful computing engine at your fingertips, all connected for maximum performance.

The machines run Ubuntu, and we use MosaicML inside Docker containers. Docker isn’t just for keeping things isolated; it helps maintain a consistent environment, making sure we can reproduce results without the usual infrastructure headaches. Since MosaicML uses shared drives and doesn’t need complex tools like SLURM for workload management, we duplicated Docker containers across each node. This setup lets us trigger the fine-tuning process with a single command on each node, optimizing how the infrastructure works and giving us control over the resources. By using 8 out of 16 nodes, we could simplify the whole process.

MosaicML LLM Foundry

Now, let’s talk about the powerhouse behind all this: MosaicML LLM Foundry. If you’re working with large language models (LLMs), this tool is like a Swiss Army knife for AI development. Whether you’re pretraining or fine-tuning models, MosaicML LLM Foundry makes everything easier and more efficient, especially when you’re working in a multinode setup. Here’s what makes it stand out. The tool doesn’t just do one thing—it covers everything. It offers full workflow support, which means it handles all stages of the model lifecycle: data preparation, training, fine-tuning, inference, and evaluation. Think of it as the backbone of your AI project, making sure everything works together, like a well-rehearsed team.

And here’s the cool part: it comes with a command-line interface (CLI). This is a huge win for developers because it means you don’t need to write a bunch of complex, application-specific code. No more struggling with complicated scripts or configurations. Just run a few commands to start pretraining or fine-tuning, and the system does the heavy lifting. It’s all about simplicity and efficiency.

But that’s not all. MosaicML LLM Foundry also comes with a shared disk multinode setup, meaning you don’t have to worry about complex schedulers like SLURM. This makes it super easy to scale across nodes quickly and without all the added hassle. It’s about getting results faster, without the extra work.

When it comes to GPU efficiency, the tool tracks something called the model FLOPS utilization (MFU). This is like a real-time performance tracker for your GPUs, making sure they’re being fully used during training and fine-tuning. No wasted resources! MosaicML also gives you existing benchmarks to help you figure out how well your setup is performing. While these benchmarks might not match perfectly across different machines, they still give you a good idea of how your models should be performing and how efficiently your resources are being used. They act as a great starting point to set your expectations.

And the best part? MosaicML LLM Foundry isn’t just for any model—it’s built to handle large models, like those in the MPT series (MPT-7B, MPT-30B, and more). These are real-world models that need a solid environment to run in, and this tool can handle that no problem.

But wait, there’s more. MosaicML LLM Foundry also integrates with Weights & Biases, a powerful platform for tracking and visualizing machine learning experiments. This lets you monitor how your model is doing in real-time, track important metrics, and make adjustments based on the data. It’s like having a dashboard for your AI project.

So, when you put it all together, MosaicML LLM Foundry is your go-to tool for tackling large-scale AI projects. With its easy-to-use features, powerful tools, and simple scaling, it’s perfect for anyone looking to scale up their AI efforts. Whether you’re working on pretraining, fine-tuning, or anything in between, this tool has got your back.

MosaicML LLM Foundry

Imagine you’ve been tasked with training and fine-tuning some of the most advanced large language models (LLMs) out there, like MPT-7B or MPT-30B, and you need to do this across multiple nodes at scale. That’s where MosaicML LLM Foundry steps in—a tool that acts as your reliable guide through this high-tech journey.

Now, let’s picture this: you’ve got a machine learning pipeline that can get pretty complicated, right? It starts with raw data and ends with a fully trained and tested model, ready to be deployed. Normally, you’d need a bunch of different tools to handle each step of the process. But with MosaicML LLM Foundry, everything you need is neatly packed into one tool.

End-to-End Workflow Support is one of the features that really stands out. It takes care of everything, from preparing your data to training, fine-tuning, and evaluating your model. This means you don’t have to juggle different systems for each step. It’s all integrated, just like it would be in a real-world scenario. It’s a bit like a chef having all their ingredients and tools in one spot—super efficient and streamlined.

Here’s the cool part: you don’t have to be a coding expert to make this work. The tool runs through a command-line interface (CLI), so you don’t have to spend time writing complicated code. You just run a few simple commands, and your models start training or fine-tuning. This makes it easy to automate tasks, so you can spend more time optimizing your models rather than dealing with complex configurations. It’s straightforward, and it keeps things moving smoothly.

What makes it even better? The shared disk multinode setup in MosaicML LLM Foundry lets you scale up without the need for complex resource management tools like SLURM. In most high-performance computing (HPC) environments, you’d have to set up something like SLURM to manage your resources. But with MosaicML LLM Foundry, you skip that step, making things easier and reducing the extra workload. Scaling becomes simple, and you can keep your focus on what matters—your model.

Let’s talk about GPU efficiency. When training large models, you want to make sure you’re using every bit of GPU power, right? MosaicML LLM Foundry includes a handy tool that tracks GPU performance using something called model FLOPS utilization (MFU). This tool makes sure your GPUs are being fully used during the training process, so no power goes to waste.

But here’s where it gets even better: MosaicML LLM Foundry also provides benchmarks. These benchmarks help you set realistic expectations and ensure that your model performs well across different setups. While the benchmarks might not match perfectly across every machine, they give you a good foundation for understanding how things are working. It’s like having a race benchmark—so you know how to fine-tune your approach for the best results.

Speaking of fine-tuning for the best results, MosaicML LLM Foundry is designed to handle large language models—the real heavy hitters like MPT-7B and MPT-30B. Whether you’re diving into research or handling production-level AI tasks, this tool can manage models that need a lot of computational power, making it the perfect choice for cutting-edge AI development.

And for those of you who love tracking progress (who doesn’t, right?), MosaicML LLM Foundry integrates smoothly with Weights & Biases, a platform that lets you track machine learning experiments in real-time. This integration helps you keep an eye on performance metrics, track progress, and make smart decisions based on data to continuously improve your models. It’s like having a performance dashboard for your car—except in this case, your “car” is a supercharged AI model, and the dashboard shows you live stats to fine-tune everything.

When you put all of these features together, it’s easy to see why MosaicML LLM Foundry is such a powerful tool. It handles computational resources efficiently, supports every stage of model development, and makes the whole process smoother and easier for users. Whether you’re fine-tuning MPT-7B or tackling complex models like MPT-30B, this tool makes sure you’re always ready for success.

MosaicML LLM Foundry (2025)

Conclusion

In conclusion, DigitalOcean’s infrastructure, powered by the H100 GPUs and integrated with MosaicML LLM Foundry, proves to be a robust solution for scaling AI workloads. The validation of its ability to efficiently train and fine-tune large language models (LLMs) like MPT-7B and MPT-30B across multi-node clusters demonstrates how easily users can leverage the platform for large-scale AI tasks without the complexity of traditional infrastructure. With consistent performance improvements and efficient resource utilization, DigitalOcean provides a powerful and scalable platform for businesses and researchers working with cutting-edge AI models.Looking ahead, as AI demands continue to grow, platforms like DigitalOcean will play a crucial role in enabling the rapid training and deployment of large models, ensuring that scaling AI processes remains accessible and efficient.

Optimize LLMs with LoRA: Boost Chatbot Training and Multimodal AI (2025)

Optimize Multi-Node LLM Training at Scale with DigitalOcean AI