Scale LLM Training with MosaicML: Boost AI Workloads on Multi-Node Clusters

Introduction

Scaling AI workloads efficiently requires leveraging powerful tools like MosaicML and multi-node clusters. In the world of large language models (LLMs), pretraining and finetuning are essential steps for achieving high performance, and choosing the right infrastructure is crucial. With MosaicML’s LLM Foundry, combined with DigitalOcean’s bare-metal H100 multi-node clusters, organizations can seamlessly scale their AI training. This article explores how this setup enhances resource utilization, accelerates model training, and offers robust scalability for demanding AI tasks, providing a comprehensive solution for both pretraining and finetuning large language models.

What is MosaicML LLM Foundry?

MosaicML LLM Foundry is an open-source framework that helps users train and fine-tune large language models across multiple machines. It simplifies the entire process, including data preparation, model training, testing, and evaluation. Users can run pretraining and finetuning tasks on large-scale models without the need for complex setup or specialized code. The tool supports real-world language models and uses efficient resource management for better performance during training.

Full Pretraining Experiment

Let’s dive into the pretraining journey we took with our models—it’s kind of like running a marathon, but instead of runners, we’ve got data, GPUs, and some serious computing power. We didn’t graph the data this time because the table we’re using has only a few rows, but believe me, it packs a punch when it comes to the relevant results. These results are what help us figure out how well our system handles model training at different scales.

We ran pretraining experiments on models ranging from the small MPT-125M to the heavyweight MPT-1B. All these models used the C4 dataset, which is a pretty reliable standard for training large language models. Every experiment used 8 nodes to keep things consistent. Now, let me break down the results for you:

Model	Training Data	Max Duration (Batches)	Ratio Params/125M Params	Ratio Batches/125M Batches	Evaluation Interval (Batches)	No. of Nodes	Actual Runtime (Wallclock)	Actual Runtime (s)	Ratio Runtime/125M Runtime	Throughput (Tokens/s)	Model FLOPS Utilization (MFU)	Memory per GPU (from 82GB)	Checkpoint Size	Conversion, Inference & Evaluation OK?	Evaluation Accuracy
MPT-125M	C4	4800	1	1	1000	8	9m7.873s	547.9	1	6,589,902	~0.1	13.4	1.5G	Y	0.53
MPT-350M	C4	13400	2.8	2.8	1000	8	38m10.823s	2291	4.18	3,351,644	~0.145	8.91	4.0G	Y	0.56
MPT-760M	C4	29000	6.08	6.0	2000	8	103m23.136s	6203	11.32	2,737,276	~0.27	12.5	8.6G	Y	0.56
MPT-1B	C4	24800	8	5.2	2000	8	208m24.319s	12504	22.82	2,368,224	~0.33	16.3	15G	Y	0.58

Table 1: Results of full pretraining runs for MosaicML models from 125M to 1B parameters on 8 nodes.

Main Observations for Full Pretraining

Okay, here’s where the fun really starts. When we looked at the results, a few things stood out. First off, we made sure to verify the models’ performance using unseen testing data. This step is super important because it ensures the model isn’t just memorizing the data it trained on but can also handle new, unseen information, just like it would in the real world. That’s why we also made sure that after training, the models were converted to a format compatible with popular machine learning platforms (think Hugging Face) for more efficient inference.

Now, when we compared the smaller models to the larger ones, it became pretty clear: the bigger the model, the longer the runtime. For example, MPT-3B, MPT-7B, MPT-13B, MPT-30B, and MPT-70B—basically the big guys—are expected to act in the same way, but the runtime increases as they get bigger. Want to hear some numbers? Sure! The MPT-70B, the largest model we tested, would need about two months to fully pretrain on a cloud server with 8 nodes. But here’s where it gets interesting: if we bumped it up to 64 nodes instead of just 8, that same training could be done in about a week! Cool, right?

One key metric we looked at was Model FLOPS utilization (MFU). You’re probably asking, what’s that? Well, it’s a way of measuring how well the GPUs are being used during training. Instead of just looking at raw GPU usage (which can be misleading), MFU gives a clearer picture of how effectively the GPU is being used. The results were exactly what we expected: bigger models naturally need more computation, but there was no inefficiency in the system itself. Everything ran smoothly, and the resources were being used as they should be. This is great news because it shows our infrastructure can handle these massive models without a problem.

So, here’s the bottom line: after running these pretraining experiments, we’re pretty sure our infrastructure can handle demanding AI workloads. Whether it’s pretraining or finetuning, scaling up to large models, or just making sure everything runs smoothly across multiple nodes, everything worked as expected. And that’s exactly what you want to hear when dealing with AI at scale, right?

Make sure to verify the models’ performance using unseen testing data to ensure the model can handle new, unseen information.

AI Training and Scaling Research

Finetuning Experiment

Let’s walk through the finetuning process and the results we got from running a series of experiments. It’s a lot like the pretraining phase, but this time, we ran multiple setups to see how different models performed when finetuned under various configurations. The models we used included the MPT-7B and MPT-30B, trained on different datasets. We kept track of key factors like the training data, duration, throughput, and the number of nodes used. This was crucial because it helped us see how scaling up the number of nodes impacted the speed and overall performance of the finetuning process.

Now, let me break down the results for you from Table 2.

Model	Finetuning Data	Max Training Duration (Epochs)	Evaluation Interval (Epochs)	No. of Nodes	Actual Runtime (Wallclock)	Actual Runtime (s)	Speedup Versus One Node	Throughput (Tokens/s)	Memory per GPU (from 82GB)	Inference & Evaluation OK?	Evaluation Accuracy
MPT-7B-Dolly-SFT	mosaicml/dolly_hhrlhf	2	1	1	78m28.121s	4708	–	7124	24.9	Y	0.85
MPT-7B-Dolly-SFT	mosaicml/dolly_hhrlhf	2	1	2	29m24.485s	1764	2.67x	13,844	19.9	Y	0.84
MPT-7B-Dolly-SFT	mosaicml/dolly_hhrlhf	2	1	4	18m21.026s	1101	4.28x	28,959	17.5	Y	0.84
MPT-7B-Dolly-SFT	mosaicml/dolly_hhrlhf	2	1	8	13m35.352s	815	5.77x	50,708	9.37	Y	0.84
MPT-30B-Instruct	kowndinya23/instruct-v3	2	1	8	125m12.579s	7513	3.76x	52,022	~36	Y	0.85

Table 2: Results of full finetuning runs for MosaicML MPT-7B and MPT-30B models for 1-8 nodes.

Main Observations for Finetuning

Here’s the real takeaway from all of this: we really wanted to figure out how adding more nodes would impact the training process. Ideally, you’d expect performance to improve as you add more nodes. When we ran the MPT-7B-Dolly-SFT model on two nodes, we saw a 2.67x speedup compared to using just one node. With four nodes, the speedup jumped to 4.28x , and with eight nodes, we saw a 5.77x speedup. Pretty good, right? It’s like having more people helping out—obviously, the more hands on deck, the faster things get done. But here’s the thing: the speedup starts to slow down a bit when you move from four to eight nodes. It’s still noticeable, just not as dramatic as before.

This scaling effect happens because of parallel processing. Basically, we’re splitting up the workload into smaller parts and letting the system handle more data at the same time, which speeds things up. It’s like having a team working on different parts of a project instead of just one person doing everything.

However, there’s a bit of a catch: with all that parallelism, there’s some overhead. Specifically, saving model checkpoints after each training epoch adds a slight delay. But that’s totally fine—it’s an essential step to ensure the model’s progress is saved and can be picked up again if needed.

As we ran these experiments, we also kept an eye on how well the models performed on unseen data. I mean, it’s one thing to do well with the data you trained on, but you really want to make sure the models work just as well with new, unseen data. That’s where model accuracy comes into play. And here’s the good news: the finetuned models did great. Their evaluation accuracy on new data was even higher than that of the pretrained models, which is exactly what you want to see. This shows that finetuning worked as expected, helping the model adapt and get better.

In conclusion, the results of our finetuning experiments confirm that adding more nodes makes the training process faster, though the speedup becomes less noticeable after a certain point. The models consistently achieved high accuracy, with solid performance across throughput and GPU usage. This makes the finetuning process efficient and scalable, especially when working with large language models like the MPT series. So whether you’re finetuning a model for a specific task or scaling up to handle more complex AI workloads, the setup we used proves that the system can handle it all.

MosaicML Finetuning Guide

Appendices

Datasets for Pretraining and Finetuning

Pretraining Dataset

When you’re working with large language models, pretraining is no easy task. It requires a large dataset—something that can help the model learn the basics of language. Enter the C4 dataset, a go-to set that drives the pretraining process. This dataset comes from over 10 billion rows of data, pulled from the Common Crawl web corpus. But here’s the catch: it’s been cleaned up. All the unnecessary, irrelevant bits have been removed, so only high-quality text is left for training. We download and prep this data as part of the LLM Foundry workflow, getting it ready for training.

Finetuning Dataset

Once pretraining is done, we move on to finetuning, which uses more specialized datasets designed for specific tasks. These datasets are smaller but no less important because they help the model get better at specific tasks. For our finetuning process, we used a couple of datasets provided by MosaicML, with a few tweaks for the 30B model.

MPT-7B-Dolly-SFT

For the MPT-7B model, we used the Dolly HH-RHLF dataset, which combines Databricks’ dolly-15k dataset and a subset of Anthropic’s HH-RLHF. This combined dataset even includes a special test split, which wasn’t in the original dolly dataset. The test split includes 200 randomly chosen samples from dolly and 4,929 test samples from HH-RLHF, all processed through filtering. In total, the training set has 59,310 samples—14,814 from Dolly and 44,496 from HH-RLHF.

MPT-30B-Instruct

For the bigger MPT-30B model, we used the Instruct-v3 dataset, which consists of prompts and responses made for instruction-based finetuning. At first, there was an issue with the dataset’s formatting—some columns were out of order. To fix this, we grabbed a corrected version of the dataset from a different source. This was quicker than fixing it ourselves, especially since MosaicML automatically pulls datasets from sources like Hugging Face.

Network Speed: NCCL Tests

We ran NCCL tests on the hardware to check how the network was holding up, especially in terms of bandwidth, which is super important for multi-node training. These tests helped us make sure the network could handle the massive amount of data transfer happening between nodes. While other teams have run more thorough tests elsewhere, we wanted to share our results because they offer a good snapshot of what you can expect with typical networking speeds. This is helpful if you’re running multinode workloads and need to know what kind of bandwidth to expect.

To run the tests, we used this command:


$ mpirun \ 
  -H hostfile \ 
  -np 128 \ 
  -N 8 \ 
  –allow-run-as-root \ 
  -x NCCL_IB_PCI_RELAXED_ORDERING=1 \ 
  -x NCCL_IB_CUDA_SUPPORT=1 \ 
  -x NCCL_IB_HCA^=mlx5_1,mlx5_2,mlx5_7,mlx5_8 \ 
  -x NCCL_CROSS_NIC=0 \ 
  -x NCCL_IB_GID_INDEX=1 \ 
  $(pwd)/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1

Here’s how the results for the NCCL tests ran across 16 nodes:

Size (B)	Count (Elements)	Type	Redop	Root	Out-of-Place	Busbw (GB/s)
8	2	float	sum	-1	63.25	65.28
16	4	float	sum	-1	63.10	62.37
32	8	float	sum	-1	62.90	63.54
64	16	float	sum	-1	63.23	63.40

Table: NCCL Test Results for Machines Used in This Report, for 16 Nodes

The results show that the network bandwidth was more than enough to handle the pretraining and finetuning tasks across multiple nodes.

The results are also displayed in the graph below, showing the NCCL bus bandwidth test results for 1 to 8 nodes.

Hardware and Software Configuration

For these tests, we used cloud-based bare-metal machines, each packed with high-performance hardware. Every machine had eight H100 GPUs connected via NVLink, giving us a total of 64 GPUs across all the nodes. The nodes were connected through an RDMA over Converged Ethernet (RoCE) network, which allowed fast data transfer between the machines.

We used Ubuntu as the operating system on these nodes, and the MosaicML framework was deployed inside Docker containers. Because we were using shared drives and didn’t need SLURM (which is used to schedule jobs on multiple nodes), we had to manually copy the Docker containers to each node. Once that was done, we could run finetuning tasks with just one command, making the process much easier to manage.

In our experiments, we only used 8 out of the available 16 nodes. This setup worked perfectly and let us optimize performance while keeping everything running smoothly.

MosaicML LLM Foundry

MosaicML LLM Foundry is an open-source tool that makes it easier to pretrain and finetune large language models (LLMs) in a multi-node environment. Here’s what makes it so great:

End-to-End Support: It covers everything, from data preparation to training, finetuning, inference, and evaluation. This full process makes it perfect for real-world applications, where everything needs to run seamlessly from one step to the next.
Command-Line Interface: The tool offers a simple command-line interface (CLI) that lets you launch pretraining and finetuning tasks without having to write complex, app-specific code. This makes things a lot more accessible, even for people who don’t have a lot of specialized coding knowledge.
Shared Disk Multinode Setup: Instead of using complicated tools like SLURM, which add extra configuration steps, MosaicML LLM Foundry uses a shared disk setup. This keeps things simple so you can focus more on training and less on managing the infrastructure.
GPU Efficiency: The tool tracks how well the GPUs are being used through a metric called Model FLOPS Utilization (MFU). This helps ensure the GPUs are being used efficiently, so you get better performance and a clearer view of how the training is going.
Benchmark Support: MosaicML provides benchmarks, though they are based on different machines. While they may not be a direct one-to-one match, they’re still useful for setting expectations on how well models should perform.
Real-World Model Support: The tool works with large models like the ones in the MPT series, which are commonly used for solving complex language tasks. This makes the tool great for tackling real-world AI workloads.
Integration with Weights & Biases: For those of you who like to track experiments and visualize results, MosaicML LLM Foundry integrates smoothly with Weights & Biases. This gives you the tools to log and track key metrics, making it easier to monitor the training process and adjust as needed.

MosaicML LLM Foundry: Enabling Efficient Pretraining and Finetuning of Large Language Models

Hardware and Software Setup

Imagine you’re setting up a high-performance racing team. Each car needs the perfect engine, the ideal road, and the right conditions to make sure everything runs smoothly. In our case, our “race cars” are the bare-metal cloud servers, specially set up with 8 H100x8 nodes, and each of these nodes is packed with 8 H100 GPUs. That means we have a total of 64 GPUs, all working together, connected by NVLink. Think of NVLink as the high-speed highways connecting the cars in this race—no traffic, just smooth, fast data flow.

These servers are connected

Error in message streamRetry

MosaicML LLM Foundry

Let’s dive into the world of large language models (LLMs) and the tool that’s making their pretraining and finetuning so much easier: MosaicML LLM Foundry. Picture this: you’re working on an AI project, and you have multiple nodes, each one taking care of a part of a big task. This is where MosaicML LLM Foundry steps in, doing so much more than just helping you complete the job—it makes the entire process run smoother and faster.

One of the best things about MosaicML LLM Foundry is that it covers the whole journey of an AI project. It’s like having everything in one place: data preparation, model training, finetuning, inference, and evaluation. Imagine having to juggle all these tasks using different tools and then dealing with the chaos of integrating them. But with LLM Foundry, the process becomes simple and lets you focus on what really matters—building and improving your model.

Now here’s the best part—no more spending hours writing endless lines of application-specific code. The user-friendly command-line interface (CLI) lets you kick off pretraining and finetuning tasks with just a few commands. For developers and data scientists like you, this is a huge time-saver. It’s flexible, easy to customize, and lets you manage tasks across multiple nodes in a multi-node cluster without breaking a sweat. Plus, you don’t need to be a coding expert to get things up and running, which is always a plus.

I know what you might be thinking—handling multiple nodes and keeping everything in sync must be a nightmare, right? Normally, you’d need a complicated job scheduler like SLURM to manage that workload. But with MosaicML LLM Foundry, you don’t need to worry about that. Instead, it uses a shared disk system across all nodes, meaning no extra configurations or headaches. It’s a clean, efficient setup that ensures everything runs smoothly, and all the data is easily accessible across the nodes. No more stress—just easy deployment.

And let’s talk about those GPUs. Well, MosaicML thought of everything. They’ve included a model FLOPS utilization (MFU) metric, which tracks how effectively your GPUs are being used during training. This helps make sure you’re getting the most out of your hardware. Instead of just looking at raw GPU usage, the MFU metric tells you exactly how well your resources are being put to use. It’s like having a dashboard that shows you how well your engine is running.

But there’s more. If you’re like me and enjoy having some solid benchmarks to guide you, you’ll be happy to know that MosaicML LLM Foundry has built-in benchmarking features. These benchmarks help set realistic expectations for how your model will perform. While these benchmarks are based on different hardware configurations, they still give you valuable insights. You get a better sense of how your models should perform, which is super helpful when you’re scaling things up.

Speaking of scaling, LLM Foundry is built to support large models like those in the MPT series. These models are used for some pretty demanding natural language processing tasks, and MosaicML LLM Foundry is up for the challenge. If you’re working on big AI projects, this tool is a must-have, helping you meet the heavy computational demands of cutting-edge AI research.

And, of course, we can’t forget the cherry on top: the integration with Weights & Biases. If you’ve ever tracked your experiments and monitored models, you know how important it is to have a good visualization and logging tool. MosaicML LLM Foundry integrates smoothly with Weights & Biases, allowing you to log key metrics and see exactly how your training is progressing. It’s not just about keeping track of things—it also makes collaboration easier, so your team can dive into the data whenever they need to.

So, in short, MosaicML LLM Foundry is a game-changer for anyone working with large language models in a multi-node setup. Whether you’re pretraining models from scratch or fine-tuning them to perfection, this tool has everything you need. From start to finish, it’s designed to save you time, make the most of your resources, and help you push the limits of what you can do with AI workloads. With its end-to-end support, efficient GPU usage, and seamless integrations, it’s an essential tool for anyone serious about scaling their models.

MosaicML LLM Foundry Overview

Conclusion

In conclusion, scaling large language model (LLM) training with MosaicML’s LLM Foundry on bare-metal H100 multi-node clusters provides an efficient and scalable solution for demanding AI workloads. As demonstrated by DigitalOcean’s infrastructure, both pretraining from scratch and finetuning are optimized for high performance, allowing seamless resource utilization across multiple nodes. The tests confirm that the platform’s reliable scalability ensures minimal operational tuning, making it an ideal choice for large-scale AI tasks.Looking ahead, as AI workloads continue to grow, leveraging robust infrastructures like those powered by MosaicML and multi-node clusters will become even more crucial for organizations aiming to accelerate their AI model training.

Optimize LLMs with LoRA: Boost Chatbot Training and Multimodal AI

Alireza Pourmahdavi

I’m Alireza Pourmahdavi, a founder, CEO, and builder with a background that combines deep technical expertise with practical business leadership. I’ve launched and scaled companies like Caasify and AutoVM, focusing on cloud services, automation, and hosting infrastructure. I hold VMware certifications, including VCAP-DCV and VMware NSX. My work involves constructing multi-tenant cloud platforms on VMware, optimizing network virtualization through NSX, and integrating these systems into platforms using custom APIs and automation tools. I’m also skilled in Linux system administration, infrastructure security, and performance tuning. On the business side, I lead financial planning, strategy, budgeting, and team leadership while also driving marketing efforts, from positioning and go-to-market planning to customer acquisition and B2B growth.