Introduction
Building a Puppeteer web scraper with Docker and App Platform allows developers to efficiently automate data extraction while ensuring scalability and flexibility. Whether you’re working with race results or public domain books, this setup provides a powerful solution for web scraping tasks. In this article, we’ll explore how to create a web application that scrapes data using Puppeteer in a Docker container, all deployed seamlessly on App Platform. With a focus on best practices like rate limiting and bot identification, you’ll learn how to optimize your web scraping applications for performance and reliability.
What is Project Gutenberg Book Search?
This is a web application that allows users to search for and access books from the public domain collection on Project Gutenberg. It scrapes the site for book details and presents the information in an organized manner, with various download options. The tool follows responsible web scraping practices, such as rate limiting and clear bot identification, ensuring it respects the website’s terms of service and works efficiently.
Race Time Insights Tool
As an ultra marathon enthusiast, I’ve had my fair share of challenges. One of the toughest questions I often find myself asking is: how do I estimate my finish time for a race that I’ve never attempted before? It’s a question that’s bothered me for quite some time, and naturally, I turned to my coach for some insight. His suggestion was simple yet brilliant—look at runners who have completed both a race I’ve done and the race I’m targeting. By finding patterns in their performance across both events, I could get a better idea of my own potential finish times.
The idea sounded good in theory, but here’s the thing: manually going through race results from multiple sources would take forever. It would be a huge pain to gather all that data and then make meaningful comparisons. That’s when I decided to build something that could automate the whole process—something that would save me (and other runners) a lot of time and energy. And so, Race Time Insights was born.
This tool automatically compares race results by finding athletes who’ve participated in both races. All you have to do is input the URLs of two races, and the application scrapes race results from platforms like UltraSignup and Pacific Multisports. It then shows how other athletes performed across both events, giving you valuable insights.
Building this tool was a huge eye-opener for me—it really made me appreciate how powerful Caasify’s App Platform is. I was able to use Puppeteer with headless Chrome in Docker containers to focus on solving the problem for runners, while App Platform took care of all the behind-the-scenes infrastructure. The result? A tool that’s scalable, efficient, and helps the running community make better, data-driven decisions about their race goals.
But after finishing Race Time Insights, I thought: why not share what I learned with other developers? I wanted to create a guide on how they could use the same technologies—Puppeteer, Docker containers, and Caasify App Platform—to build their own tools. The challenge? When you work with external data, you’ve got to be mindful of things like rate limiting and sticking to terms of service.
That’s when I turned to Project Gutenberg. It’s a treasure chest of public domain books, and because its terms of service are super clear, it was the perfect example for demonstrating these technologies. In this post, I’ll show you how to build a book search application using Puppeteer inside a Docker container, deployed on App Platform, while following best practices for external data access.
Project Gutenberg Book Search
I’ve built and shared a web application that scrapes book information from Project Gutenberg responsibly. The app lets you search through thousands of public domain books, view detailed info about each one, and download them in different formats. What’s really exciting about this project is that it shows how you can do web scraping the right way—respecting the source data, following best practices, and still providing tons of value to users.
Being a Good Digital Citizen
When you build a web scraper, there’s a right way to do it and a wrong way. You need to respect both the technical and legal boundaries. Project Gutenberg is a perfect example of doing it right because:
- It has clear terms of service
- It provides robots.txt guidelines
- Its content is fully in the public domain
- It encourages more accessibility to its resources
When building our scraper, we followed several best practices to make sure we were doing things the right way:
Rate Limiting
For this demo, I set up a simple rate limiter that makes sure there’s at least one second between requests:
// A naive rate limiting implementation
const rateLimiter = {
lastRequest: 0,
minDelay: 1000, // 1 second between requests
async wait() {
const now = Date.now();
const timeToWait = Math.max(0, this.lastRequest + this.minDelay – now);
if (timeToWait > 0) {
await new Promise(resolve => setTimeout(resolve, timeToWait));
}
this.lastRequest = Date.now();
}
};
This approach is simplified just for demonstration. It assumes the app runs in a single instance and stores state in memory, which wouldn’t be ideal for larger-scale use. If I wanted to scale this, I’d probably use Redis for distributed rate limiting or set up a queue-based system for better performance. We use this rate limiter before every request to Project Gutenberg:
async searchBooks(query, page = 1) {
await this.initialize();
await rateLimiter.wait(); // Enforce rate limit
// … rest of search logic
}
async getBookDetails(bookUrl) {
await this.initialize();
await rateLimiter.wait(); // Enforce rate limit
// … rest of details logic
}
Clear Bot Identification
It’s important to let website administrators know who is accessing their site and why. This kind of transparency helps build trust and avoids issues later on. With a custom User-Agent, we can clearly identify our bot:
await browserPage.setUserAgent(‘GutenbergScraper/1.0 (Educational Project)’);
This helps administrators monitor and analyze bot traffic separately from human users, and it could even result in better support for legitimate scrapers.
Efficient Resource Management
Running Chrome in a headless environment can use a lot of memory, especially when running multiple instances. To prevent memory leaks and ensure the app runs smoothly, we make sure to properly close each browser page once we’re done with it:
try {
// … scraping logic
} finally {
await browserPage.close(); // Free up memory and system resources
}
By following these practices, we make sure our scraper is effective and respectful of the resources it accesses. This is especially important when working with valuable public resources like Project Gutenberg.
Web Scraping in the Cloud
The application relies on modern cloud architecture and containerization through Caasify’s App Platform. This approach strikes the perfect balance between making development easier and keeping the app reliable in production.
The Power of App Platform
Caasify’s App Platform makes deployment a breeze by handling all the usual heavy lifting:
- Web server configuration
- SSL certificate management
- Security updates
- Load balancing
- Resource monitoring
With App Platform handling the infrastructure, we can focus on just the application code.
Headless Chrome in a Container
The core of our scraping functionality is Puppeteer, which lets us control Chrome programmatically. Here’s how we set up and use Puppeteer in our app:
const puppeteer = require(‘puppeteer’); class BookService {
constructor() {
this.baseUrl = ‘https://www.gutenberg.org’;
this.browser = null;
} async initialize() {
if (!this.browser) {
// Add environment info logging for debugging
console.log(‘Environment details:’, {
PUPPETEER_EXECUTABLE_PATH: process.env.PUPPETEER_EXECUTABLE_PATH,
CHROME_PATH: process.env.CHROME_PATH,
NODE_ENV: process.env.NODE_ENV
}); const options = {
headless: ‘new’,
args: [
‘–no-sandbox’,
‘–disable-setuid-sandbox’,
‘–disable-dev-shm-usage’,
‘–disable-gpu’,
‘–disable-extensions’,
‘–disable-software-rasterizer’,
‘–window-size=1280,800’,
‘–user-agent=GutenbergScraper/1.0 (+https://github.com/wadewegner/doappplat-puppeteer-sample) Chromium/120.0.0.0’
],
executablePath: process.env.PUPPETEER_EXECUTABLE_PATH || ‘/usr/bin/chromium-browser’,
defaultViewport: { width: 1280, height: 800 }
}; this.browser = await puppeteer.launch(options);
}
}
}
This setup lets us:
- Run Chrome in headless mode (no GUI needed)
- Execute JavaScript in the context of web pages
- Safely manage browser resources
- Work reliably in a containerized environment
The setup also includes some key configurations for running in a containerized environment:
- Proper Chrome Arguments: Important flags like –no-sandbox and –disable-dev-shm-usage for working in containers.
- Environment-aware Path: It uses the right Chrome binary path from environment variables.
- Resource Management: It adjusts viewport sizes and disables unnecessary features.
- Professional Bot Identity: It uses a clear user agent and HTTP headers to identify the scraper.
- Error Handling: It makes sure to clean up properly to avoid memory leaks.
While Puppeteer makes controlling Chrome a breeze, running it in a container requires careful setup to ensure all the necessary dependencies and configurations are in place. Let’s dive into how we set this up in our Docker environment.
Docker: Ensuring Consistent Environments
One of the hardest things about deploying web scrapers is making sure they work the same in both development and production. Your scraper might run perfectly on your local machine, but then fail in the cloud because of missing dependencies or different system configurations. This is where Docker comes in.
Docker helps by packaging everything the application needs—from Node.js to Chrome—into one container that runs the same way on any machine. This guarantees that the scraper behaves the same whether you’re running it locally or on Caasify’s Cloud.
Here’s how we set up our Docker environment:
FROM node:18-alpine
# Install Chromium and dependencies
RUN apk add –no-cache \
chromium \
nss \
freetype \
harfbuzz \
ca-certificates \
ttf-freefont \
dumb-init # Set environment variables
ENV PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=true \
PUPPETEER_EXECUTABLE_PATH=/usr/bin/chromium-browser \
PUPPETEER_DISABLE_DEV_SHM_USAGE=true
The Alpine-based image keeps our container lightweight while including all the necessary dependencies. When you run this container—whether on your laptop or in Caasify’s Cloud—you get the exact same environment with the correct versions and configurations needed for running headless Chrome.
Development to Deployment
Now, let’s walk through getting this project up and running.
Local Development
First, fork the example repository to your GitHub account. This gives you your own copy to work with and deploy from. Then clone your fork locally:
# Clone your fork
git clone https://github.com/YOUR-USERNAME/doappplat-puppeteer-sample.git
cd doappplat-puppeteer-sample
Then, build and run with Docker:
# Build and run with Docker
docker build -t gutenberg-scraper .
docker run -p 8080:8080 gutenberg-scraper
Understanding the Code
The application is structured around three main components:
- Book Service: Handles web scraping and data extraction
- Express Server: Manages routes and renders templates
- Frontend Views: Clean, responsive UI using Bootstrap
Deployment to Caasify
Now that you have your fork of the repository, deploying to Caasify’s Cloud is easy:
- Create a new Cloud application
- Connect to your forked repo
- On resources, delete the second resource (that isn’t a Dockerfile); this is auto-generated by the platform and not needed
- Deploy by clicking Create Resources
The application will be automatically built and deployed, with App Platform handling all the infrastructure details.
Conclusion
In conclusion, building a Puppeteer web scraper with Docker and App Platform offers a powerful and scalable solution for modern web scraping needs. Whether you’re developing an application to estimate ultra marathon race times or scraping public domain books from Project Gutenberg, this setup ensures efficiency, flexibility, and best practices like rate limiting and bot identification. By leveraging Docker containers and deploying on App Platform, developers can create reliable, cloud-based scraping solutions that meet the demands of today’s data-driven applications. Looking ahead, as web scraping continues to evolve, integrating advanced technologies and cloud platforms will be key to streamlining data collection and enhancing automation workflows.For more insights into efficient web scraping and deployment, consider exploring how Puppeteer, Docker, and App Platform can shape the future of automated data extraction.
Docker system prune: how to clean up unused resources (2025)