DocuClipper logo

Bank Statement OCR API

Upload a PDF bank statement, register a webhook, and receive structured per-transaction JSON when extraction finishes — dates, amounts, payees, running balances, plus reconciliation metadata.

  • PAT auth on the /api/v1/agent/* endpoints
  • Webhook-driven (no polling)
  • Multi-account statements, multi-page PDFs, scanned documents
  • Reconciliation flag per period (start/end balance match)

Prerequisites

  1. Generate a Personal Access Token under Account → API in the DocuClipper web app. Tokens look like dcp_AbCd…. Store it as the PAT env var.
  2. An HTTPS endpoint to receive webhooks. The runnable examples below use webhook.site as a throwaway receiver for testing — swap it for your own URL in production.

End-to-end example

Registers a webhook → uploads via presigned S3 → creates the job → receives the bank_statement.extraction.completed event → verifies the HMAC signature → cleans up.

bash
#!/usr/bin/env bash
set -euo pipefail
PAT="${PAT:?Set PAT to your dcp_… token}"
BASE="https://www.docuclipper.com"
PDF="statement.pdf"

# 1. Throwaway public receiver for the demo. In production, point at your own HTTPS endpoint.
TOK=$(curl -s -X POST https://webhook.site/token -d '{}' -H 'Content-Type: application/json' | jq -r .uuid)
RECEIVER="https://webhook.site/$TOK"

cleanup() {
  [ -n "${WEBHOOK_ID:-}" ] && curl -s -X DELETE -H "Authorization: Bearer $PAT" "$BASE/api/v1/agent/webhooks/$WEBHOOK_ID" >/dev/null
  curl -s -X DELETE "https://webhook.site/token/$TOK" >/dev/null
}
trap cleanup EXIT

# 2. Register webhook
WEBHOOK_ID=$(curl -s -X POST -H "Authorization: Bearer $PAT" -H "Content-Type: application/json" \
  -d "{\"url\":\"$RECEIVER\",\"events\":[\"bank_statement.extraction.completed\",\"document.extraction.failed\"]}" \
  "$BASE/api/v1/agent/webhooks" | jq -r .id)

# 3. Get presigned upload URL + PUT the file to S3
PRESIGN=$(curl -s -X POST -H "Authorization: Bearer $PAT" -H "Content-Type: application/json" \
  -d "{\"filename\":\"$PDF\",\"mimetype\":\"application/pdf\"}" \
  "$BASE/api/v1/agent/documents/upload-url")
S3_URL=$(echo "$PRESIGN" | jq -r .url)
DOC_ID=$(echo "$PRESIGN" | jq -r .document.id)
curl -s -o /dev/null -X PUT -H "Content-Type: application/pdf" --data-binary "@$PDF" "$S3_URL"

# 4. Create job
JOB_ID=$(curl -s -X POST -H "Authorization: Bearer $PAT" -H "Content-Type: application/json" \
  -d "{\"documents\":[$DOC_ID],\"jobType\":\"ExtractData\"}" \
  "$BASE/api/v1/agent/jobs" | jq -r .jobId)
echo "job $JOB_ID — waiting for bank_statement.extraction.completed…"

# 5. Wait for the webhook. NOTE: this bash flow skips HMAC verification —
#    see the Python or Node.js tab for production code.
DEADLINE=$(( $(date +%s) + 600 ))
while [ $(date +%s) -lt $DEADLINE ]; do
  PAYLOAD=$(curl -s "https://webhook.site/token/$TOK/requests" | jq -r --arg jid "$JOB_ID" --arg ev "bank_statement.extraction.completed" '
    .data[]? | select(.headers["x-docuclipper-event"][0]==$ev) | select(.content|fromjson|.job.id==$jid) | .content' | head -1)
  if [ -n "$PAYLOAD" ]; then echo "$PAYLOAD" | jq .data; exit 0; fi
  sleep 2
done
echo "timed out" >&2; exit 1

Webhook payload

Real data field from a successful run on a real statement. Top level is keyed by documentId → account number.

json
{
  "2666982": {
    "2915192377": {
      "bankMode": {
        "transactions": [
          {
            "memo": "DDA RET eBayCommer 2535 North First Street San Jose CA 000000804502",
            "amount": 24.21,
            "date": "20221219000000",
            "name": "DDA RET eBayCommer North First",
            "payee": "DDA RET eBayCommer North First",
            "checkNumber": "",
            "balance": null,
            "category": "Uncategorized",
            "fitId": "6ef15fd676aab6a7",
            "descriptionLines": ["DDA RET eBayCommer 2535 North First Street San Jose CA 000000804502"],
            "pageNumber": 1
          }
          // … 136 more transactions
        ],
        "totalCredits": "6952.46",
        "totalDebits": "-2696.46",
        "numCredits": 1,
        "numDebits": 136
      },
      "metadata": [
        { "name": "endBalance",   "value": "4281.32",   "period": "0" },
        { "name": "endDate",      "value": "2023-01-16", "period": "0" },
        { "name": "isReconciled", "value": "1",         "period": "0" },
        { "name": "startBalance", "value": "2748.19",   "period": "0" },
        { "name": "startDate",    "value": "2022-12-19", "period": "0" }
      ]
    },
    "metadata": []
  }
}

Field reference

FieldTypeDescription
[documentId][account].bankMode.transactions[]arrayPer-account transaction list
transactions[].datestringTransaction date, YYYYMMDDHHMMSS string
transactions[].amountnumberSigned amount (negative = debit, positive = credit)
transactions[].memostringRaw line as printed on the statement
transactions[].payeestringNormalized payee / counterparty
transactions[].checkNumberstringCheck number when applicable
transactions[].balancenumber | nullRunning balance if printed on the line
transactions[].fitIdstringStable per-line fingerprint, safe to use for dedupe
[documentId][account].bankMode.totalCreditsstringSum of credits across the period
[documentId][account].bankMode.totalDebitsstringSum of debits across the period
[documentId][account].metadata[]arraystartBalance / endBalance / startDate / endDate / isReconciled, by sub-period

Notes & gotchas

  • The webhook payload is the data. You usually don't need to call GET /agent/jobs/:id/data afterwards — the full extraction lives in the webhook body. The data endpoint is a fallback for missed deliveries.
  • Multi-account statements are returned as multiple keys under one documentId, one per account number.
  • Reconciliationmetadata.isReconciled = "1" means start + end balance + transactions add up cleanly. Use it as a downstream confidence signal.
  • Failure events. Always include document.extraction.failed in your webhook subscription so you don't silently miss errors.