March 26, 2026 5 min read

Process Files 4.0 — Batch File Processing for Digital Archives

How I built a modular Bash toolkit to convert messy archive collections into clean, web-ready access copies.

Bash libvips

Process Files 4.0 — Batch File Processing for Digital Archives

Every digital archive eventually hits the same problem: you have a directory full of original files — TIFFs from scanners, Word documents, spreadsheets, audio recordings — and you need web-friendly versions of all of them. Thumbnails, preview images, zoomable tilesets, PDFs, compressed audio. Doing this by hand doesn't scale.

Process Files is the tool I built to handle this. Point it at a source directory, tell it which formats to process, and it handles the rest.

The Architecture

The core idea is simple: a main orchestrator script (start.sh) that discovers and runs format-specific modules. Each module is a self-contained Bash script responsible for one file type.

1# Run all modules against a source directory
2./start.sh /path/to/collection
3 
4# Or pick specific formats with watermarking enabled
5./start.sh -m tif,pdf -w /path/to/collection1# Run all modules against a source directory
2./start.sh /path/to/collection
3 
4# Or pick specific formats with watermarking enabled
5./start.sh -m tif,pdf -w /path/to/collection

The orchestrator parses CLI flags, sets up working and output directories, then loops through either all modules or just the ones you requested:

 1# If no modules specified, run them all
 2if [ -z $MODULES ]; then
 3    for module in $MODULES_DIR*.sh; do
 4        bash $module $SOURCE $WORKING_DIR $PROCESSED_DIR $NOTILES $WATERMARK
 5    done
 6fi
 7 
 8# Otherwise, run only the requested modules
 9for i in $(echo $MODULES | sed "s/,/ /g"); do
10    if [ -f $MODULES_DIR$i".sh" ]; then
11        bash $MODULES_DIR$i".sh" $SOURCE $WORKING_DIR $PROCESSED_DIR $NOTILES $WATERMARK
12    fi
13done 1# If no modules specified, run them all
 2if [ -z $MODULES ]; then
 3    for module in $MODULES_DIR*.sh; do
 4        bash $module $SOURCE $WORKING_DIR $PROCESSED_DIR $NOTILES $WATERMARK
 5    done
 6fi
 7 
 8# Otherwise, run only the requested modules
 9for i in $(echo $MODULES | sed "s/,/ /g"); do
10    if [ -f $MODULES_DIR$i".sh" ]; then
11        bash $MODULES_DIR$i".sh" $SOURCE $WORKING_DIR $PROCESSED_DIR $NOTILES $WATERMARK
12    fi
13done

Directory Structure Preservation

One of the trickier parts was preserving nested folder structures from the source. Archives aren't flat — they have subdirectories for series, boxes, folders. The output needs to mirror that hierarchy.

Each module walks the source directory, strips the base path to extract relative paths, and tracks unique directories and files in text manifests:

 1for file in $(find $SOURCE -type f -iname "$FILE_PATTERN" \
 2    ! -path "$PROCESSED_DIR*" ! -path "$WORKING_DIR*"); do
 3 
 4    raw=${file#"$SOURCE"}
 5    filename=${raw#"/"}
 6    directory=${raw%/*}
 7    directory=${directory#"/"}
 8 
 9    if [ ! -z "$directory" ]; then
10        if ! grep -Fxq $directory $MODULE_WORKING_DIRECTORY/directories.txt; then
11            echo "$directory" >> $MODULE_WORKING_DIRECTORY/directories.txt
12        fi
13    fi
14 
15    if ! grep -Fxq $filename $MODULE_WORKING_DIRECTORY/files.txt; then
16        echo $filename >> $MODULE_WORKING_DIRECTORY/files.txt
17    fi
18done 1for file in $(find $SOURCE -type f -iname "$FILE_PATTERN" \
 2    ! -path "$PROCESSED_DIR*" ! -path "$WORKING_DIR*"); do
 3 
 4    raw=${file#"$SOURCE"}
 5    filename=${raw#"/"}
 6    directory=${raw%/*}
 7    directory=${directory#"/"}
 8 
 9    if [ ! -z "$directory" ]; then
10        if ! grep -Fxq $directory $MODULE_WORKING_DIRECTORY/directories.txt; then
11            echo "$directory" >> $MODULE_WORKING_DIRECTORY/directories.txt
12        fi
13    fi
14 
15    if ! grep -Fxq $filename $MODULE_WORKING_DIRECTORY/files.txt; then
16        echo $filename >> $MODULE_WORKING_DIRECTORY/files.txt
17    fi
18done

These manifests drive the rest of the pipeline — directory creation, file conversion, and output placement all read from them.

The Image Pipeline

The TIFF module is the most involved. For each source TIFF, it:

Converts to JPEG (with optional watermark)
Generates a 100x100 thumbnail
Generates a 500px preview
Creates a Zoomify tileset for deep-zoom viewing

 1while read line; do
 2    no_ext="${line%.*}"
 3 
 4    if [ "$WATERMARK" == "true" ]; then
 5        apply_watermark "$SOURCE/$line" "$MODULE_WORKING_DIRECTORY/jpeg/$no_ext.jpg"
 6    else
 7        convert "$SOURCE/$line" "$MODULE_WORKING_DIRECTORY/jpeg/$no_ext.jpg"
 8    fi
 9 
10    convert -resize 100x100 "$MODULE_WORKING_DIRECTORY/jpeg/$no_ext.jpg" \
11        "$PROCESSED_DIR/100/$no_ext.jpg"
12    convert -resize 500x500 "$MODULE_WORKING_DIRECTORY/jpeg/$no_ext.jpg" \
13        "$PROCESSED_DIR/500/$no_ext.jpg"
14 
15    if [ "$NOTILES" == "false" ]; then
16        vips dzsave "$MODULE_WORKING_DIRECTORY/jpeg/$no_ext.jpg"[autorotate=true] \
17            "$PROCESSED_DIR/zoom/$no_ext" --layout zoomify --vips-progress
18    fi
19done < $MODULE_WORKING_DIRECTORY/files.txt 1while read line; do
 2    no_ext="${line%.*}"
 3 
 4    if [ "$WATERMARK" == "true" ]; then
 5        apply_watermark "$SOURCE/$line" "$MODULE_WORKING_DIRECTORY/jpeg/$no_ext.jpg"
 6    else
 7        convert "$SOURCE/$line" "$MODULE_WORKING_DIRECTORY/jpeg/$no_ext.jpg"
 8    fi
 9 
10    convert -resize 100x100 "$MODULE_WORKING_DIRECTORY/jpeg/$no_ext.jpg" \
11        "$PROCESSED_DIR/100/$no_ext.jpg"
12    convert -resize 500x500 "$MODULE_WORKING_DIRECTORY/jpeg/$no_ext.jpg" \
13        "$PROCESSED_DIR/500/$no_ext.jpg"
14 
15    if [ "$NOTILES" == "false" ]; then
16        vips dzsave "$MODULE_WORKING_DIRECTORY/jpeg/$no_ext.jpg"[autorotate=true] \
17            "$PROCESSED_DIR/zoom/$no_ext" --layout zoomify --vips-progress
18    fi
19done < $MODULE_WORKING_DIRECTORY/files.txt

The tileset generation uses libvips with the Zoomify layout, which produces a tile pyramid that can be loaded by any deep-zoom viewer. The autorotate=true flag handles images with EXIF rotation metadata so they display correctly regardless of how the scanner saved them.

Adaptive Watermarking

The watermark system picks between two watermark sizes based on image dimensions. Larger images get a smaller repeating watermark tile so it doesn't dominate the image, while smaller images get a proportionally larger one:

 1apply_watermark() {
 2    local infile="$1"
 3    local outfile="$2"
 4 
 5    basew=$(identify -format '%w' "$infile")
 6    baseh=$(identify -format '%h' "$infile")
 7 
 8    if [ $baseh -gt 1024 ] || [ $basew -gt 1024 ]; then
 9        wmfile="$SCRIPT_DIR/watermark_small.png"
10    else
11        wmfile="$SCRIPT_DIR/watermark.png"
12    fi
13 
14    composite -tile "$wmfile" "$infile" "$outfile"
15} 1apply_watermark() {
 2    local infile="$1"
 3    local outfile="$2"
 4 
 5    basew=$(identify -format '%w' "$infile")
 6    baseh=$(identify -format '%h' "$infile")
 7 
 8    if [ $baseh -gt 1024 ] || [ $basew -gt 1024 ]; then
 9        wmfile="$SCRIPT_DIR/watermark_small.png"
10    else
11        wmfile="$SCRIPT_DIR/watermark.png"
12    fi
13 
14    composite -tile "$wmfile" "$infile" "$outfile"
15}

ImageMagick's composite -tile repeats the watermark PNG across the entire image surface — simple and effective.

Document and Audio Conversion

The other modules follow the same pattern but with different tools. Word documents go through unoconv to produce PDFs, HTML files get converted via Pandoc and LaTeX, and WAV files are compressed to 320kbps MP3 with LAME:

1# Word to PDF
2unoconv -f pdf -o "$PROCESSED_DIR/pdf/$line.pdf" "$SOURCE/$line"
3 
4# WAV to MP3
5lame -b 320 -h "$SOURCE/$line" "$PROCESSED_DIR/audio/$line.mp3"1# Word to PDF
2unoconv -f pdf -o "$PROCESSED_DIR/pdf/$line.pdf" "$SOURCE/$line"
3 
4# WAV to MP3
5lame -b 320 -h "$SOURCE/$line" "$PROCESSED_DIR/audio/$line.mp3"

What I'd Do Differently

This tool has been through four major versions. If I were starting fresh, I'd probably reach for a language with better error handling and parallelism — Bash gets the job done but swallowing errors silently is its default behavior, and there's no easy way to process files concurrently across CPU cores.

That said, the dependency on external tools like ImageMagick, VIPS, and unoconv means Bash is actually a reasonable glue language here. The real work happens in those tools — the script is just orchestration.

The module system has held up well. Adding a new format means dropping a new .sh file into the modules directory. The orchestrator picks it up automatically. Sometimes the simplest architecture is the right one.

Back to all stuff