Automating DICOM Repository Mirrors on GitHub: Fork Sync, Snapshot Ingestion, and the Pipefail Trap

Tatsuhiko Arai (新井 竜彦)

Tatsuhiko Arai (新井 竜彦)

· 13 min read
Just an image

The Starting Point

When you're working with the DICOM standard, you end up accumulating a collection of related libraries and tools. modalware keeps the following repositories as forks for reference and development:

RepositoryWhat it is
modalware/dicom-validator-tsTypeScript DICOM validator
modalware/cornerstone3DWeb-based medical image viewer
modalware/dcmjsDICOM for the browser
modalware/dwvDICOM Web Viewer
modalware/dcmtkThe classic DICOM toolkit
modalware/dicomweb-clientDICOMweb client library
modalware/dicom-validatorDICOM conformance validator
modalware/pydicomPython DICOM library
modalware/pynetdicomPython DICOM networking
modalware/dicomParserLightweight DICOM parser
modalware/DVTkDICOM Validation Toolkit

Left unattended, forks drift from their upstreams. Comparing PRs or reading the latest source becomes increasingly awkward. The manual alternative — opening each repo, clicking "Sync fork," repeating — doesn't scale.

The solution: a dedicated management repository, modalware/fork-sync, that handles everything through GitHub Actions. The forks themselves stay untouched.


Part 1: Syncing Real Forks

gh repo sync --force

GitHub CLI ships a gh repo sync command that aligns a fork with its upstream in one shot. The --force flag matters here: without it, the command refuses to overwrite if the fork has any divergent commits. Since these repos are meant to be pure mirrors, force is the right default.

gh repo sync modalware/dicom-validator-ts --force

The workflow

Eleven repositories, one workflow, matrix strategy. fail-fast: false ensures that if one sync fails (say, a transient network issue), the other ten still run.

# .github/workflows/sync-forks.yml
name: Sync Upstream Forks

on:
  schedule:
    - cron: '0 2 * * *'  # Daily at UTC 02:00 (JST 11:00)
  workflow_dispatch:       # Manual trigger available

jobs:
  sync:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        repo:
          - dicom-validator-ts
          - cornerstone3D
          - dcmjs
          - dwv
          - dcmtk
          - dicomweb-client
          - dicom-validator
          - pydicom
          - pynetdicom
          - dicomParser
          - DVTk
      fail-fast: false
      max-parallel: 4
    name: Sync ${{ matrix.repo }}
    steps:
      - name: Sync fork with upstream
        run: gh repo sync modalware/${{ matrix.repo }} --force
        env:
          GH_TOKEN: ${{ secrets.SYNC_TOKEN }}

GITHUB_TOKEN — the default token a workflow gets automatically — only has write access to its own repository. Writing to other repos in the organization requires a Personal Access Token, stored here as SYNC_TOKEN.

The Workflows scope you'll miss if you don't look for it

The first test run failed on most repositories with:

Upstream commits contain workflow changes, which require the `workflow` scope
or permission to merge.

A Fine-grained PAT with only Contents: Write isn't enough. When upstream commits touch .github/workflows/ files, GitHub requires an additional Workflows permission. This separation is intentional: workflow files define what code runs in CI, so GitHub draws a deliberate permission boundary around them. Easy to miss because most token documentation doesn't highlight it.


Part 2: The Repo That Wasn't a Fork

One of the twelve repositories — dicom3tools — turned out not to be a GitHub fork at all. It was created by manually extracting an upstream tarball and pushing the contents. gh repo sync confirmed this immediately:

can't determine source repository for modalware/dicom3tools because repository is not fork

What dicom3tools is

dicom3tools is a suite of DICOM utilities maintained by David Clunie, who has been involved in shaping the DICOM standard itself for decades. There is no official GitHub repository. Instead, snapshots are published on his site as .tar.bz2 files with timestamps baked into the filename:

dicom3tools_1.00.snapshot.20250525134203.tar.bz2
dicom3tools_1.00.snapshot.20250526102624.tar.bz2
dicom3tools_1.00.snapshot.20260320044638.tar.bz2

modalware/dicom3tools started from one of these snapshots. The goal is to keep it current as new snapshots appear — each one becoming its own commit tagged 1.00.snapshot.YYYYMMDDHHMMSS.

The snapshot workflow

The approach: scrape the index page for available snapshot filenames, compare against the latest git tag in the repository, download and apply any newer ones in chronological order.

# .github/workflows/sync-dicom3tools.yml
name: Sync dicom3tools Snapshots

on:
  schedule:
    - cron: '0 3 1 * *'  # Monthly on the 1st at UTC 03:00 (JST 12:00)
  workflow_dispatch:

jobs:
  sync:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout dicom3tools
        uses: actions/checkout@v4
        with:
          repository: modalware/dicom3tools
          token: ${{ secrets.SYNC_TOKEN }}
          fetch-depth: 0

      - name: Find and apply new snapshots
        shell: bash
        run: |
          set -euo pipefail

          BASE_URL="https://dclunie.com/dicom3tools/workinprogress"

          # Get the timestamp of the latest tag already in the repo
          LATEST_TS=$(git tag --sort=version:refname | grep -oE '[0-9]{14}' | tail -1 || echo "")

          # Scrape available snapshots from the index page
          SNAPSHOTS=$(curl -sf "$BASE_URL/index.html" \
            | grep -oE 'dicom3tools_1\.00\.snapshot\.[0-9]{14}\.tar\.bz2' \
            | sort -u)

          # Keep only snapshots newer than the latest tag
          NEW_SNAPSHOTS=""
          while IFS= read -r snap; do
            TS=$(echo "$snap" | grep -oE '[0-9]{14}')
            if [ -z "$LATEST_TS" ] || [[ "$TS" > "$LATEST_TS" ]]; then
              NEW_SNAPSHOTS+="$snap"$'\n'
            fi
          done <<< "$SNAPSHOTS"
          NEW_SNAPSHOTS=$(printf '%s' "$NEW_SNAPSHOTS" | grep -v '^$' | sort)

          [ -z "$NEW_SNAPSHOTS" ] && echo "No new snapshots." && exit 0

          git config user.name "github-actions[bot]"
          git config user.email "github-actions[bot]@users.noreply.github.com"

          while IFS= read -r snap; do
            [ -z "$snap" ] && continue
            TAG=$(echo "$snap" | sed 's/^dicom3tools_//; s/\.tar\.bz2$//')

            curl -fL --retry 3 --retry-delay 10 --max-time 120 \
              -o "$snap" "$BASE_URL/$snap"

            # Temporarily disable pipefail — see note below
            set +o pipefail
            TOP_DIR=$(tar tf "$snap" 2>/dev/null | head -1 | sed 's|/.*||')
            set -o pipefail

            git rm -rf --quiet . || true
            tar xf "$snap"
            rm -f "$snap"

            if [ -n "$TOP_DIR" ] && [ -d "$TOP_DIR" ]; then
              shopt -s dotglob nullglob
              mv "$TOP_DIR"/* . 2>/dev/null || true
              rmdir "$TOP_DIR" 2>/dev/null || true
              shopt -u dotglob nullglob
            fi

            git add -A
            git commit -m "snapshot: $TAG"
            git tag "$TAG"
          done <<< "$NEW_SNAPSHOTS"

          git push origin HEAD --tags

The first run caught up ten snapshots in one go. Subsequent monthly runs will pick up only what's newer than the latest tag, so reruns are safe.

The pipefail trap

Three consecutive runs failed with exit code 2. The download was succeeding — curl's progress bar confirmed the full 1 MB received — but the very next step fell over:

tar: stdout: write error
##[error]Process completed with exit code 2.

The culprit was this line:

TOP_DIR=$(tar tf "$snap" | head -1 | sed 's|/.*||')

head -1 reads one line and exits, closing the read end of the pipe. tar is still writing the rest of the file listing to that pipe — but the write end is now broken. It receives SIGPIPE and exits with code 2. With set -o pipefail active, a non-zero exit from any stage of a pipeline fails the whole pipeline. The command substitution inherits pipefail, so the failure propagates up and kills the script.

The fix is to bracket the offending pipeline with a temporary pipefail suspension:

set +o pipefail
TOP_DIR=$(tar tf "$snap" 2>/dev/null | head -1 | sed 's|/.*||')
set -o pipefail

This pattern applies to any pipeline intentionally truncated early — head -N, tail -N, awk 'NR==1{exit}'. If pipefail is on, pipe truncation will silently kill your script unless you guard around it.


Setting Up the Personal Access Token

Both workflows need write access to repositories in the modalware organization. The right tool is a Fine-grained Personal Access Token.

Token configuration

SettingValue
Token namemodalware-fork-sync (or anything descriptive)
Resource ownermodalware — select the organization, not your personal account
Repository accessOnly select repositories → pick all 12 target repos
ContentsRead and write
WorkflowsRead and write

Registering the secret

gh secret set SYNC_TOKEN --repo modalware/fork-sync --body "github_pat_..."

Or add it manually via the repository's Settings → Secrets and variables → Actions.


Cost

GitHub Actions pricing in brief:

ConditionCost
Public repositoriesFree, unlimited
Private repositories2,000 minutes/month free; $0.008/minute beyond that

fork-sync is a public repository, so all Actions runs are free. In practice, syncing all eleven forks takes around 16 seconds. The monthly dicom3tools workflow adds a handful of seconds on months where new snapshots appear. Total compute cost: zero.


The Short Version

  1. Put the automation in a dedicated management repo; leave the target repos untouched.
  2. gh repo sync owner/repo --force handles any genuine GitHub fork in one step.
  3. A Fine-grained PAT needs both Contents: Write and Workflows: Write — the workflow scope is a separate permission boundary, and upstreams frequently have .github/ changes.
  4. Repos created from manually pushed tarballs aren't forks. For those, scrape the upstream source, compare timestamps against git tags, download and commit in order.
  5. set -o pipefail + head -1 = SIGPIPE = exit code 2. Wrap with set +o pipefail / set -o pipefail around any pipeline you intentionally truncate.

The full source is at modalware/fork-sync.

About Tatsuhiko Arai (新井 竜彦)

Embedded software engineer (Qt, C/C++, Python). Medical imaging (DICOM) contractor. AWS All Certifications Engineer – Japan (2024–2025).

Copyright © 2026 Tatsuhiko Arai. All rights reserved.