Bank Statement OCR API
Upload a PDF bank statement, register a webhook, and receive structured per-transaction JSON when extraction finishes — dates, amounts, payees, running balances, plus reconciliation metadata.
- PAT auth on the
/api/v1/agent/*endpoints - Webhook-driven (no polling)
- Multi-account statements, multi-page PDFs, scanned documents
- Reconciliation flag per period (start/end balance match)
Prerequisites
- Generate a Personal Access Token under Account → API in the DocuClipper web app. Tokens look like
dcp_AbCd…. Store it as thePATenv var. - An HTTPS endpoint to receive webhooks. The runnable examples below use webhook.site as a throwaway receiver for testing — swap it for your own URL in production.
End-to-end example
Registers a webhook → uploads via presigned S3 → creates the job → receives the bank_statement.extraction.completed event → verifies the HMAC signature → cleans up.
bash
#!/usr/bin/env bash
set -euo pipefail
PAT="${PAT:?Set PAT to your dcp_… token}"
BASE="https://www.docuclipper.com"
PDF="statement.pdf"
# 1. Throwaway public receiver for the demo. In production, point at your own HTTPS endpoint.
TOK=$(curl -s -X POST https://webhook.site/token -d '{}' -H 'Content-Type: application/json' | jq -r .uuid)
RECEIVER="https://webhook.site/$TOK"
cleanup() {
[ -n "${WEBHOOK_ID:-}" ] && curl -s -X DELETE -H "Authorization: Bearer $PAT" "$BASE/api/v1/agent/webhooks/$WEBHOOK_ID" >/dev/null
curl -s -X DELETE "https://webhook.site/token/$TOK" >/dev/null
}
trap cleanup EXIT
# 2. Register webhook
WEBHOOK_ID=$(curl -s -X POST -H "Authorization: Bearer $PAT" -H "Content-Type: application/json" \
-d "{\"url\":\"$RECEIVER\",\"events\":[\"bank_statement.extraction.completed\",\"document.extraction.failed\"]}" \
"$BASE/api/v1/agent/webhooks" | jq -r .id)
# 3. Get presigned upload URL + PUT the file to S3
PRESIGN=$(curl -s -X POST -H "Authorization: Bearer $PAT" -H "Content-Type: application/json" \
-d "{\"filename\":\"$PDF\",\"mimetype\":\"application/pdf\"}" \
"$BASE/api/v1/agent/documents/upload-url")
S3_URL=$(echo "$PRESIGN" | jq -r .url)
DOC_ID=$(echo "$PRESIGN" | jq -r .document.id)
curl -s -o /dev/null -X PUT -H "Content-Type: application/pdf" --data-binary "@$PDF" "$S3_URL"
# 4. Create job
JOB_ID=$(curl -s -X POST -H "Authorization: Bearer $PAT" -H "Content-Type: application/json" \
-d "{\"documents\":[$DOC_ID],\"jobType\":\"ExtractData\"}" \
"$BASE/api/v1/agent/jobs" | jq -r .jobId)
echo "job $JOB_ID — waiting for bank_statement.extraction.completed…"
# 5. Wait for the webhook. NOTE: this bash flow skips HMAC verification —
# see the Python or Node.js tab for production code.
DEADLINE=$(( $(date +%s) + 600 ))
while [ $(date +%s) -lt $DEADLINE ]; do
PAYLOAD=$(curl -s "https://webhook.site/token/$TOK/requests" | jq -r --arg jid "$JOB_ID" --arg ev "bank_statement.extraction.completed" '
.data[]? | select(.headers["x-docuclipper-event"][0]==$ev) | select(.content|fromjson|.job.id==$jid) | .content' | head -1)
if [ -n "$PAYLOAD" ]; then echo "$PAYLOAD" | jq .data; exit 0; fi
sleep 2
done
echo "timed out" >&2; exit 1Webhook payload
Real data field from a successful run on a real statement. Top level is keyed by documentId → account number.
json
{
"2666982": {
"2915192377": {
"bankMode": {
"transactions": [
{
"memo": "DDA RET eBayCommer 2535 North First Street San Jose CA 000000804502",
"amount": 24.21,
"date": "20221219000000",
"name": "DDA RET eBayCommer North First",
"payee": "DDA RET eBayCommer North First",
"checkNumber": "",
"balance": null,
"category": "Uncategorized",
"fitId": "6ef15fd676aab6a7",
"descriptionLines": ["DDA RET eBayCommer 2535 North First Street San Jose CA 000000804502"],
"pageNumber": 1
}
// … 136 more transactions
],
"totalCredits": "6952.46",
"totalDebits": "-2696.46",
"numCredits": 1,
"numDebits": 136
},
"metadata": [
{ "name": "endBalance", "value": "4281.32", "period": "0" },
{ "name": "endDate", "value": "2023-01-16", "period": "0" },
{ "name": "isReconciled", "value": "1", "period": "0" },
{ "name": "startBalance", "value": "2748.19", "period": "0" },
{ "name": "startDate", "value": "2022-12-19", "period": "0" }
]
},
"metadata": []
}
}Field reference
| Field | Type | Description |
|---|---|---|
| [documentId][account].bankMode.transactions[] | array | Per-account transaction list |
| transactions[].date | string | Transaction date, YYYYMMDDHHMMSS string |
| transactions[].amount | number | Signed amount (negative = debit, positive = credit) |
| transactions[].memo | string | Raw line as printed on the statement |
| transactions[].payee | string | Normalized payee / counterparty |
| transactions[].checkNumber | string | Check number when applicable |
| transactions[].balance | number | null | Running balance if printed on the line |
| transactions[].fitId | string | Stable per-line fingerprint, safe to use for dedupe |
| [documentId][account].bankMode.totalCredits | string | Sum of credits across the period |
| [documentId][account].bankMode.totalDebits | string | Sum of debits across the period |
| [documentId][account].metadata[] | array | startBalance / endBalance / startDate / endDate / isReconciled, by sub-period |
Notes & gotchas
- The webhook payload is the data. You usually don't need to call
GET /agent/jobs/:id/dataafterwards — the full extraction lives in the webhook body. The data endpoint is a fallback for missed deliveries. - Multi-account statements are returned as multiple keys under one
documentId, one per account number. - Reconciliation —
metadata.isReconciled = "1"means start + end balance + transactions add up cleanly. Use it as a downstream confidence signal. - Failure events. Always include
document.extraction.failedin your webhook subscription so you don't silently miss errors.