How to Build a Payment Processing System: A Production Engineering Blueprint

Alex Kugell

April 28, 2026

1. Payment Intent and Idempotency: Preventing Double Charges Before They Happen

Every production payment processing system starts with a payment intent. This is a record created in the database before any PSP call gets made.

The payment intent is ultimately what separates a production system from a gateway wrapper, and skipping it produces a double charge, which is by far one of the most expensive failure modes that we have encountered in payment engineering.

Why intent-first matters

In a gateway wrapper, the flow is that the user clicks Pay, the application calls PSP, PSP responds, and the application stores the result.

The problem appears when the PSP call succeeds, but the response gets lost. Usually, we see this happening because of an issue like a network drop, an application crash, or a basic timeout.

In any of these cases, the application doesn't know whether the payment occurred, so when the user retries a second PSP call goes out, and a duplicate charge occurs.

In an intent-first system, the flow needs to be slightly different to prevent this. The user clicks Pay, the application creates a payment intent record, the application calls PSP with the intent's idempotency key, PSP responds, and finally the application updates the intent record.

If the response gets lost and the user retries, the application retrieves the existing intent, calls the PSP with the same idempotency key, and the PSP returns the result of the first and only processing attempt.

Three-layer idempotency enforcement

Production-grade idempotency needs three independent enforcement points in order to work efficiently. Here is what that looks like:

API gateway layer: deduplicates incoming requests by client-generated idempotency key (UUID, 24-hour TTL). It returns the cached response for a known key without forwarding to the payment service.

Payment service layer: checks for an existing payment intent with the same idempotency key before submitting to the PSP. If the intent already exists, it returns its current state without creating a new one.

PSP layer: the PSP itself deduplicates by the idempotency key passed in the API call. Even if layers one and two fail, the PSP won't charge twice for the same key.

The write-ahead log gap

While three-layer idempotency can prevent a myriad of issues, unfortunately, there is one failure that you are still at risk of.

In this case, the PSP processes the charge, but the database write fails before it commits.

On retry, the intent doesn't exist in the database, so the system creates a new intent and submits a second PSP call with a different key, which ultimately leads to a second, duplicate charge.

The most common production solution that our engineers implement here is a write-ahead log or outbox record, which is created before the PSP call and deleted only after the database write succeeds.

On startup or after a crash, unresolved write-ahead records trigger a PSP status query ("what happened to this payment?") rather than a new charge attempt.

You need to make sure that the idempotency keys are generated by the client in order to facilitate this, though, not the server, so that the same key travels through every retry.

2. PSP Integration: Abstraction, Retry Strategy, and Multi-PSP Routing

In the PSP adapter pattern, every payment service provider carries a different API surface, different response format, different error taxonomy, and different behavior under failure.

Stripe returns a 402 Payment Required with a structured error object. Adyen returns 200 OK with a resultCode field. Braintree uses its own object hierarchy.

If you unknowingly build your payment processing logic directly against any one of these, you wind up contaminating the business logic with provider-specific parsing. Adding a second PSP later means that you are going to have to work on your core processing logic.

The production solution we often end up implementing here is a PSP adapter layer that translates provider-specific responses into a canonical internal domain model.

Each PSP gets its own adapter responsible for constructing the provider-specific API request from the canonical payment intent, translating the provider response into a canonical result (status: AUTHORIZED / DECLINED_SOFT / DECLINED_HARD / ERROR, psp_reference, raw_response), and mapping provider error codes to canonical decline classifications.

This means that you can have your orchestration layer only interacting with the canonical model. Adding a new PSP means writing a new adapter, which means that you don’t have to risk touching your core payment processing logic at all.

Soft declines versus hard declines

The retry strategy for a failed payment is quite complex, since it largely depends on why it failed.

Soft declines are things that can generally be resolved. These are things like insufficient funds at this moment, an issuer requesting 3DS authentication, or a temporary issuer unavailability.

Hard declines, however, won't resolve at any point in time. Common examples include a stolen card, an invalid card number, or a closed account.

You should under no circumstances retry hard declines automatically. Repeated attempts on a stolen card trigger fraud alerts and can get a merchant account flagged or even terminated.

The adapter layer is what helps you map each PSP's error codes to soft versus hard. That classification drives the retry logic in the orchestration layer.

Circuit breaking and multi-PSP routing.

When a PSP starts returning elevated error rates like 5xx responses, timeouts, or even connection failures, then retrying with the same provider often makes things worse.

A circuit breaker monitors PSP health so that, after N consecutive failures within a window, you can route traffic to a secondary PSP for a cooldown period before re-testing the primary.

Multi-PSP routing also enables authorization rate optimization.

You can automatically route specific BIN ranges, currencies, or transaction types to the PSP that historically delivers the best authorization rate for those characteristics, giving it the best chance of success.

Your approval rates will likely only increase by a percentage or two, but across hundreds of thousands of transactions, this can be a massive difference.

3. The Payment State Machine: Explicit Transitions, No Implicit States

A payment doesn't have two states. Instead, it passes through a sequence, each representing a distinct financial and operational condition, with specific actions required and specific transitions permitted at each step.

The production payment state model:

Each transition carries an explicit and named event, so there are no implicit state changes.

Instead, the state has defined permitted transitions (what states can follow, under what conditions), prohibited transitions (a SETTLED payment cannot transition directly to AUTHORIZED; a REFUNDED payment cannot be re-CAPTURED), and required side effects (a CAPTURED payment must trigger a ledger posting; a REFUND_INITIATED must reduce the available refund amount).

The stuck payment problem

One of the most common production failure modes that we see here happens when a payment reaches AUTHORIZING, the PSP call fires, the response gets lost, and the payment stays stuck in AUTHORIZING indefinitely.

Meanwhile, the PSP has processed the authorization and is waiting for capture. The payment is live at the PSP, but this is entirely invisible to the application.

To fix this, we create a background worker that periodically scans for payments stuck in transitional states beyond a timeout threshold.

For each stuck payment that the worker finds, it queries the PSP status API directly, and the PSP response drives the state transition instead of the original webhook that never arrived.

This pull-based reconciliation treats the PSP as the source of truth for the payment state. It produces eventual consistency even after application crashes, network failures, and lost responses, without requiring the user to re-attempt.

4. The Ledger: Financial Accuracy Under Every Payment Event

Every payment event that changes the financial state of the system requires a corresponding ledger entry, whose authorization can be drastically delayed.

Capture reduces the hold and increases the ledger balance. Settlement moves funds from pending to settled. Refund and chargeback each generate reversal entries.

To do this correctly, in a way that is going to hold up under regulatory scrutiny, there are two foundational requirements:

Double-entry for every event: Each payment event produces a balanced journal entry. Authorization creates a debit to the customer's available balance and a credit to a holds account.

Amounts as strings across all API contracts: The payment system's internal representation of money should use NUMERIC or DECIMAL types in the database. Amounts must travel as strings in minor currency units.

5. Webhook Processing: At-Least-Once Delivery, Idempotent Handlers

PSPs communicate payment events through webhooks. The HTTP POST requests to a configured endpoint when the payment state changes.

Stripe retries webhook delivery for up to 72 hours on failure. Adyen, on the other hand, retries for 24 hours.

What this means for you practically is that the same webhook event will arrive at your endpoint multiple times. Designing your handler for at-most-once delivery produces incorrect refunds, duplicate ledger postings, and duplicate customer notifications at scale.

Three requirements of a production webhook handler

HMAC signature validation before any processing: Every major PSP signs webhook payloads with an HMAC using a secret known only to the PSP and your system. You will need to validate the signature before deserialization.

Idempotent handler with event ID deduplication: Before executing any state transition or ledger posting, the handler checks whether the event ID is already present in a processed events table. If it does, return HTTP 200 to stop the PSP from retrying without re-processing, but only if your event ID is consistent.

Async processing with immediate acknowledgement: The webhook endpoint returns HTTP 200 after signature validation and event ID check, then processes the event asynchronously. Processing synchronously risks hitting the PSP's timeout window, which causes the PSP to consider the delivery failed and retry, even though processing may have completed successfully.

The outbox pattern for exactly-once downstream effects

Processing a webhook event typically triggers downstream effects like a state machine transition, ledger entry, customer notification, or even just an analytics update.

If any one of these fails after others succeed, the system lands in an inconsistent state.

The outbox pattern handles this.

When processing the webhook, you need to write the state transition, the ledger entry, and the outbox events in a single atomic database transaction. A separate outbox processor publishes the events to downstream consumers.

If the outbox write fails, the entire transaction rolls back, and the webhook gets re-delivered and re-processed cleanly. If the outbox write succeeds but publishing fails, the outbox processor retries publishing without needing to re-process the original webhook.

PCI DSS Scope

Now that you understand the basic layers of a payment processing system, how can you make sure that those layers are all compliant?

Every component that stores, processes, or transmits cardholder data (the PAN, CVV, or full card expiration) falls within PCI DSS scope. That scope determines the cost and complexity of PCI certification.

The engineering goal is to minimize PCI DSS scope by never touching raw cardholder data inside your own systems.

Instead, you can use PSP-hosted payment pages or JavaScript libraries like Stripe Elements or Adyen Web Drop-in for card data collection.

In using these pages, card data can travel directly from the browser to the PSP's servers so that your application never touches the raw PAN or CVV.

On top of that, we recommend that you use tokenization for all subsequent operations, like capture, refunds, and even recurring billing. The token carries no value outside the PSP relationship and is stored safely in your database.

For card-on-file use cases like the kind used in subscription payments, network tokens issued by Visa and Mastercard follow the card across re-issues and expiry updates.

Authorization rates on recurring payments improve because the token stays valid when the underlying card gets replaced. PCI scope stays minimal because the raw card number never enters your infrastructure.

This architectural choice determines whether your PCI certification runs as SAQ A or SAQ D, which requires a full security assessment across the entire application stack.

Settlement, Reconciliation, and the Payout Flow

Payment processing doesn't end at capture.

Captured funds need to settle through the acquiring bank to the merchant's bank account, and this settlement process introduces timing gaps, multi-party data flows, and reconciliation requirements distinct from real-time transaction processing.

Card network transactions typically settle T+1 or T+2. In other words, funds that were authorized on Monday arrive in the merchant account on Wednesday.

ACH and bank transfer settlements follow NACHA's same-day or standard windows. Instant payments and ISO 20022 rails settle in seconds.

What your production payment system needs to do is model settlement timing explicitly. A captured payment isn't a settled payment, and your ledger needs to track the difference between those two states.

Discrepancies need to surface automatically as well. Letting them accumulate until a month-end manual review means revenue leakage and fraud signals compound undetected for weeks.

Automated reconciliation running daily, immediately after each settlement window closes, catches these discrepancies the same day they appear. The finance team reviews exceptions to the pipeline surfaces; they don't run the pipeline themselves.

The Engineering Stack

The five layers above map to specific architectural components. This represents what most production-grade systems converge on, not the only valid approach.

Payment intent database: PostgreSQL is generally the best option. ACID transactions for the intent-first write pattern, row-level locking for concurrent authorization checks, and NUMERIC(19,4) for amount storage. For very high-throughput platforms, CockroachDB or Google Spanner provides distributed transactions without sacrificing transactional semantics.

Idempotency key store: Redis at the API gateway deduplication layer for fast lookup with TTL-based expiry. The payment service layer checks the PostgreSQL intent table directly.

PSP adapters: a single payment service with pluggable adapter modules handles three to five providers without the operational overhead of separate services per PSP. The adapter interface defines the clean abstraction boundary.

State machine persistence: a Postgres payment_state column with a CHECK constraint limiting it to valid states. State transitions run as stored procedures or application-level validators enforcing the permitted transition graph.

Event bus: Apache Kafka or Confluent Cloud for the outbox-to-downstream pipeline. At-least-once delivery from the outbox. Microservices architecture for fintech places Kafka at the boundary between the payment service and downstream consumers.

Webhook handler: a dedicated microservice with its own database for processed event IDs, isolated from the payment service to keep failure domains separate.

Reconciliation pipeline: Apache Airflow or Dagster for scheduled settlement file ingestion, matching logic, and exception reporting, running daily after each settlement window closes.

What Could Go Wrong

Layer	Failure Mode Without It	What the Layer Prevents
Payment intent + idempotency	Network timeout causes a retry, and the user is charged twice.	Three-layer idempotency + write-ahead log prevents duplicate PSP submission.
PSP abstraction	PSP API change breaks payment logic. Hard decline is treated as soft decline.	Adapter layer isolates PSP-specific parsing. Canonical decline classification drives the correct retry.
Payment state machine	PSP confirms authorization; database shows PENDING; payment stuck indefinitely	Pull-based reconciliation resolves stuck states against the PSP source of truth.
Ledger (double-entry)	Balance drift: account shows $500, sum of entries shows $497.23.	Atomic double-entry is enforced at the DB transaction layer. A drift becomes mathematically impossible.
Webhook idempotency	Stripe retries the webhook 3x over 72 hours. The refund is processed three times.	Event ID dedup table. An idempotent handler returns 200 without re-processing.
PCI scope minimization	Internal systems store raw PAN. PCI DSS audit scope expands to the entire application stack.	PSP-hosted tokenization where raw card data never enters internal systems.
Settlement reconciliation	Captured transaction settles for the wrong amount, and the discrepancy is undetected for 30 days.	Daily automated matching against the PSP settlement file; exceptions surface the same day.

Build vs. Integrate: What to Build and What to Buy

Not every layer needs to be built from scratch. Doing so without actually needing to is going to cost a lot of time and money that you could spend on other parts of your financial application.

Modern PSPs and infrastructure platforms handle significant portions of this stack well.

Obviously, it depends on your specific requirements, but we recommend that you let the PSP own PCI compliance for card data capture (use hosted payment pages or JS libraries), card network connectivity, fraud scoring (Stripe Radar, Adyen RevenueProtect), chargeback dispute management, currency conversion, and international payment method support.

You can build other aspects yourself, like payment intent and idempotencies (PSPs don't provide this layer for your internal state), PSP abstraction and adapter patterns (necessary for multi-PSP portability), payment state machines with your business-specific transitions, double-entry ledger integrations, webhook processing with outboxes for downstream events, and the settlement reconciliation pipeline.

Payment orchestration platforms also work really well for teams that want pre-built multi-PSP routing (Spreedly, Primer, Corefy) or hosted reconciliation infrastructure without building the entire stack internally.

Final Thoughts

These failures that we covered above rarely surface in code review. Instead, we find that they surface in production, often weeks after deployment, when the right combination of network failure, concurrent requests, and PSP retry behavior finally produces the incident.

Engineers who build payment systems correctly have encountered these failure modes before.

At Trio, we place pre-vetted engineers who have built production payment processing systems across ACH, card networks, FedNow, and open banking rails.

Request a consult.

Find Out More!

Want to learn more about hiring?

Frequently Asked Questions

Alex

Co-founder

10 Years of Experience

Fintech leaders work with Alex to build engineering teams that scale securely and move fast. With over a decade in software outsourcing, he helps companies hire high-performing developers suited for regulated environments and complex financial systems. After co-founding Trio with his partner Daniel, Alex now focuses on helping fintech teams hire top software talent from Latin America and shares practical insights drawn from real hiring and delivery experience.

Expertise

JavaScript
NGX
HTML
Node.js
Vue.js

Subscribe to our newsletter