All things
5 min read

Process Files 4.0 — Batch File Processing for Digital Archives

How I built a modular Bash toolkit to convert messy archive collections into clean, web-ready access copies.

Process Files 4.0 — Batch File Processing for Digital Archives

Every digital archive eventually hits the same problem: you have a directory full of original files — TIFFs from scanners, Word documents, spreadsheets, audio recordings — and you need web-friendly versions of all of them. Thumbnails, preview images, zoomable tilesets, PDFs, compressed audio. Doing this by hand doesn't scale.

Process Files is the tool I built to handle this. Point it at a source directory, tell it which formats to process, and it handles the rest.

The Architecture

The core idea is simple: a main orchestrator script (start.sh) that discovers and runs format-specific modules. Each module is a self-contained Bash script responsible for one file type.

1# Run all modules against a source directory
2./start.sh /path/to/collection
3 
4# Or pick specific formats with watermarking enabled
5./start.sh -m tif,pdf -w /path/to/collection

The orchestrator parses CLI flags, sets up working and output directories, then loops through either all modules or just the ones you requested:

1# If no modules specified, run them all
2if [ -z $MODULES ]; then
3 for module in $MODULES_DIR*.sh; do
4 bash $module $SOURCE $WORKING_DIR $PROCESSED_DIR $NOTILES $WATERMARK
5 done
6fi
7 
8# Otherwise, run only the requested modules
9for i in $(echo $MODULES | sed "s/,/ /g"); do
10 if [ -f $MODULES_DIR$i".sh" ]; then
11 bash $MODULES_DIR$i".sh" $SOURCE $WORKING_DIR $PROCESSED_DIR $NOTILES $WATERMARK
12 fi
13done

Directory Structure Preservation

One of the trickier parts was preserving nested folder structures from the source. Archives aren't flat — they have subdirectories for series, boxes, folders. The output needs to mirror that hierarchy.

Each module walks the source directory, strips the base path to extract relative paths, and tracks unique directories and files in text manifests:

1for file in $(find $SOURCE -type f -iname "$FILE_PATTERN" \
2 ! -path "$PROCESSED_DIR*" ! -path "$WORKING_DIR*"); do
3 
4 raw=${file#"$SOURCE"}
5 filename=${raw#"/"}
6 directory=${raw%/*}
7 directory=${directory#"/"}
8 
9 if [ ! -z "$directory" ]; then
10 if ! grep -Fxq $directory $MODULE_WORKING_DIRECTORY/directories.txt; then
11 echo "$directory" >> $MODULE_WORKING_DIRECTORY/directories.txt
12 fi
13 fi
14 
15 if ! grep -Fxq $filename $MODULE_WORKING_DIRECTORY/files.txt; then
16 echo $filename >> $MODULE_WORKING_DIRECTORY/files.txt
17 fi
18done

These manifests drive the rest of the pipeline — directory creation, file conversion, and output placement all read from them.

The Image Pipeline

The TIFF module is the most involved. For each source TIFF, it:

  1. Converts to JPEG (with optional watermark)
  2. Generates a 100x100 thumbnail
  3. Generates a 500px preview
  4. Creates a Zoomify tileset for deep-zoom viewing
1while read line; do
2 no_ext="${line%.*}"
3 
4 if [ "$WATERMARK" == "true" ]; then
5 apply_watermark "$SOURCE/$line" "$MODULE_WORKING_DIRECTORY/jpeg/$no_ext.jpg"
6 else
7 convert "$SOURCE/$line" "$MODULE_WORKING_DIRECTORY/jpeg/$no_ext.jpg"
8 fi
9 
10 convert -resize 100x100 "$MODULE_WORKING_DIRECTORY/jpeg/$no_ext.jpg" \
11 "$PROCESSED_DIR/100/$no_ext.jpg"
12 convert -resize 500x500 "$MODULE_WORKING_DIRECTORY/jpeg/$no_ext.jpg" \
13 "$PROCESSED_DIR/500/$no_ext.jpg"
14 
15 if [ "$NOTILES" == "false" ]; then
16 vips dzsave "$MODULE_WORKING_DIRECTORY/jpeg/$no_ext.jpg"[autorotate=true] \
17 "$PROCESSED_DIR/zoom/$no_ext" --layout zoomify --vips-progress
18 fi
19done < $MODULE_WORKING_DIRECTORY/files.txt

The tileset generation uses libvips with the Zoomify layout, which produces a tile pyramid that can be loaded by any deep-zoom viewer. The autorotate=true flag handles images with EXIF rotation metadata so they display correctly regardless of how the scanner saved them.

Adaptive Watermarking

The watermark system picks between two watermark sizes based on image dimensions. Larger images get a smaller repeating watermark tile so it doesn't dominate the image, while smaller images get a proportionally larger one:

1apply_watermark() {
2 local infile="$1"
3 local outfile="$2"
4 
5 basew=$(identify -format '%w' "$infile")
6 baseh=$(identify -format '%h' "$infile")
7 
8 if [ $baseh -gt 1024 ] || [ $basew -gt 1024 ]; then
9 wmfile="$SCRIPT_DIR/watermark_small.png"
10 else
11 wmfile="$SCRIPT_DIR/watermark.png"
12 fi
13 
14 composite -tile "$wmfile" "$infile" "$outfile"
15}

ImageMagick's composite -tile repeats the watermark PNG across the entire image surface — simple and effective.

Document and Audio Conversion

The other modules follow the same pattern but with different tools. Word documents go through unoconv to produce PDFs, HTML files get converted via Pandoc and LaTeX, and WAV files are compressed to 320kbps MP3 with LAME:

1# Word to PDF
2unoconv -f pdf -o "$PROCESSED_DIR/pdf/$line.pdf" "$SOURCE/$line"
3 
4# WAV to MP3
5lame -b 320 -h "$SOURCE/$line" "$PROCESSED_DIR/audio/$line.mp3"

What I'd Do Differently

This tool has been through four major versions. If I were starting fresh, I'd probably reach for a language with better error handling and parallelism — Bash gets the job done but swallowing errors silently is its default behavior, and there's no easy way to process files concurrently across CPU cores.

That said, the dependency on external tools like ImageMagick, VIPS, and unoconv means Bash is actually a reasonable glue language here. The real work happens in those tools — the script is just orchestration.

The module system has held up well. Adding a new format means dropping a new .sh file into the modules directory. The orchestrator picks it up automatically. Sometimes the simplest architecture is the right one.