Process Files 4.0 — Batch File Processing for Digital Archives
How I built a modular Bash toolkit to convert messy archive collections into clean, web-ready access copies.
Every digital archive eventually hits the same problem: you have a directory full of original files — TIFFs from scanners, Word documents, spreadsheets, audio recordings — and you need web-friendly versions of all of them. Thumbnails, preview images, zoomable tilesets, PDFs, compressed audio. Doing this by hand doesn't scale.
Process Files is the tool I built to handle this. Point it at a source directory, tell it which formats to process, and it handles the rest.
The Architecture
The core idea is simple: a main orchestrator script (start.sh) that discovers and runs format-specific modules. Each module is a self-contained Bash script responsible for one file type.
1# Run all modules against a source directory2./start.sh /path/to/collection3 4# Or pick specific formats with watermarking enabled5./start.sh -m tif,pdf -w /path/to/collection
The orchestrator parses CLI flags, sets up working and output directories, then loops through either all modules or just the ones you requested:
1# If no modules specified, run them all 2if [ -z $MODULES ]; then 3 for module in $MODULES_DIR*.sh; do 4 bash $module $SOURCE $WORKING_DIR $PROCESSED_DIR $NOTILES $WATERMARK 5 done 6fi 7 8# Otherwise, run only the requested modules 9for i in $(echo $MODULES | sed "s/,/ /g"); do10 if [ -f $MODULES_DIR$i".sh" ]; then11 bash $MODULES_DIR$i".sh" $SOURCE $WORKING_DIR $PROCESSED_DIR $NOTILES $WATERMARK12 fi13done
Directory Structure Preservation
One of the trickier parts was preserving nested folder structures from the source. Archives aren't flat — they have subdirectories for series, boxes, folders. The output needs to mirror that hierarchy.
Each module walks the source directory, strips the base path to extract relative paths, and tracks unique directories and files in text manifests:
1for file in $(find $SOURCE -type f -iname "$FILE_PATTERN" \ 2 ! -path "$PROCESSED_DIR*" ! -path "$WORKING_DIR*"); do 3 4 raw=${file#"$SOURCE"} 5 filename=${raw#"/"} 6 directory=${raw%/*} 7 directory=${directory#"/"} 8 9 if [ ! -z "$directory" ]; then10 if ! grep -Fxq $directory $MODULE_WORKING_DIRECTORY/directories.txt; then11 echo "$directory" >> $MODULE_WORKING_DIRECTORY/directories.txt12 fi13 fi14 15 if ! grep -Fxq $filename $MODULE_WORKING_DIRECTORY/files.txt; then16 echo $filename >> $MODULE_WORKING_DIRECTORY/files.txt17 fi18done
These manifests drive the rest of the pipeline — directory creation, file conversion, and output placement all read from them.
The Image Pipeline
The TIFF module is the most involved. For each source TIFF, it:
- Converts to JPEG (with optional watermark)
- Generates a 100x100 thumbnail
- Generates a 500px preview
- Creates a Zoomify tileset for deep-zoom viewing
1while read line; do 2 no_ext="${line%.*}" 3 4 if [ "$WATERMARK" == "true" ]; then 5 apply_watermark "$SOURCE/$line" "$MODULE_WORKING_DIRECTORY/jpeg/$no_ext.jpg" 6 else 7 convert "$SOURCE/$line" "$MODULE_WORKING_DIRECTORY/jpeg/$no_ext.jpg" 8 fi 9 10 convert -resize 100x100 "$MODULE_WORKING_DIRECTORY/jpeg/$no_ext.jpg" \11 "$PROCESSED_DIR/100/$no_ext.jpg"12 convert -resize 500x500 "$MODULE_WORKING_DIRECTORY/jpeg/$no_ext.jpg" \13 "$PROCESSED_DIR/500/$no_ext.jpg"14 15 if [ "$NOTILES" == "false" ]; then16 vips dzsave "$MODULE_WORKING_DIRECTORY/jpeg/$no_ext.jpg"[autorotate=true] \17 "$PROCESSED_DIR/zoom/$no_ext" --layout zoomify --vips-progress18 fi19done < $MODULE_WORKING_DIRECTORY/files.txt
The tileset generation uses libvips with the Zoomify layout, which produces a tile pyramid that can be loaded by any deep-zoom viewer. The autorotate=true flag handles images with EXIF rotation metadata so they display correctly regardless of how the scanner saved them.
Adaptive Watermarking
The watermark system picks between two watermark sizes based on image dimensions. Larger images get a smaller repeating watermark tile so it doesn't dominate the image, while smaller images get a proportionally larger one:
1apply_watermark() { 2 local infile="$1" 3 local outfile="$2" 4 5 basew=$(identify -format '%w' "$infile") 6 baseh=$(identify -format '%h' "$infile") 7 8 if [ $baseh -gt 1024 ] || [ $basew -gt 1024 ]; then 9 wmfile="$SCRIPT_DIR/watermark_small.png"10 else11 wmfile="$SCRIPT_DIR/watermark.png"12 fi13 14 composite -tile "$wmfile" "$infile" "$outfile"15}
ImageMagick's composite -tile repeats the watermark PNG across the entire image surface — simple and effective.
Document and Audio Conversion
The other modules follow the same pattern but with different tools. Word documents go through unoconv to produce PDFs, HTML files get converted via Pandoc and LaTeX, and WAV files are compressed to 320kbps MP3 with LAME:
1# Word to PDF2unoconv -f pdf -o "$PROCESSED_DIR/pdf/$line.pdf" "$SOURCE/$line"3 4# WAV to MP35lame -b 320 -h "$SOURCE/$line" "$PROCESSED_DIR/audio/$line.mp3"
What I'd Do Differently
This tool has been through four major versions. If I were starting fresh, I'd probably reach for a language with better error handling and parallelism — Bash gets the job done but swallowing errors silently is its default behavior, and there's no easy way to process files concurrently across CPU cores.
That said, the dependency on external tools like ImageMagick, VIPS, and unoconv means Bash is actually a reasonable glue language here. The real work happens in those tools — the script is just orchestration.
The module system has held up well. Adding a new format means dropping a new .sh file into the modules directory. The orchestrator picks it up automatically. Sometimes the simplest architecture is the right one.