Understanding TCC in distributed transactions
June 4, 2023 · 910 words · 5 min
A distributed transaction is a database transaction in which two or more network hosts are involved.
As we know, network and hosts may unreachable due to some reasons, like power failure, hardware malfunction, etc.
In this article, I’ll share how to implement distributed transaction with TCC.
Problem
Imaging you are developing an e-commercial application. Typically, there are many services, including OrderService, StockService, ProductService. If customer buys some product and create an order, we need to create an order and reduct the stock of these products.
The following sequence diagram shows that if everything goes ok, it will create an order and ACID is ensured.
sequenceDiagram Customer->>+OrderService: Create an order OrderService->>+Order Database: Start transaction Order Database-->>-OrderService: Transaction started OrderService->>+Order Database: Store order Order Database-->>-OrderService: Order stored OrderService->>+StockService: Reduct stock of product StockService-->>-OrderService: Operation succeed OrderService->>+Order Database: Commit transaction Order Database-->>-OrderService: Transaction committed OrderService-->>-Customer: Order created
If database or downstream service fails, we can do a simple rollback to database and OrderService, that seems good. But network and hosts maybe unreachable, our rollback maybe lost. So what will happen? Nobody knows, maybe there is no order created, but stock still reduced, this is a terrible thing!
The following sequence diagram shows the transaction rollback but the stock reduced, which causes a inconsistency and breaks ACID.
sequenceDiagram Customer->>+OrderService: Create an order OrderService->>+Order Database: Start transaction Order Database-->>-OrderService: Transaction started OrderService->>+Order Database: Store order Order Database-->>-OrderService: Order stored OrderService->>+StockService: Reduct stock of product StockService-->>-OrderService: Operation succeed OrderService-->>OrderService: Internal operation failed OrderService-x StockService: Rollback OrderService->>+Order Database: Rollback transaction Order Database-->>-OrderService: Transaction rollbacked OrderService-->>-Customer: Order creation failed
Solution
TCC represents Try-Confirm-Cancel, any resource in distributed transaction should provide three stage:
- Try: Try to allocate resource for future use
- Confirm: All operation at stage Try are succeed, commit the distributed transaction
- Cancel: Something wrong, rollback
Take the e-commerce order just now as an example, the following sequence diagram shows the whole process in TCC.
sequenceDiagram Customer->>+OrderService: Create an order OrderService->>+Order Database: Try: Store orderOrder Database-->>-OrderService: Order stored alt Successful path OrderService->>+StockService: Try: Reserve some stock StockService-->>-OrderService: Stock reserved OrderService->>+StockService: Confirm: Mark stock reduced StockService-->>-OrderService: Stock reduced OrderService->>+Order Database: Confirm: Modify order status Order Database-->>-OrderService: Order modified else Failure path OrderService->>+StockService: Try: Reserve some stock StockService-->>-OrderService: Stock reserved OrderService-xStockService: Confirm: Mark stock reduced OrderService->>+StockService: Cancel: Restore stock StockService-->>-OrderService: Stock restored OrderService->>+Order Database: Cancel: Modify order status Order Database-->>-OrderService: Order modified end OrderService-->>-Customer: Order created or failed
If everything goes ok, our Try-Confirm will execute successfully, database is consistent, if something fails, our Try-Cancel will handle this situation, database is also consisten. But as mentioned above, network and hosts maybe unreachable, so our commit or rollback will lost.
General Problems
I’ll show some general problems and associated solution with TCC.
Order creation with EXECUTING state fails
Local transaction will rollback, make sense!
Call StockService to reverse some stocks fails
Local transaction will rollback, and a rollback also send to StockService, StockService can do rollback or do nothing
Call StockService to reverse some stocks successfully, but reply of StockService lost
This will cause Try to timeout, local transaction will rollback and a following rollback will send to StockService, StockService can do rollback.
To solve some temporary problem, we can do retry. There are two points we need to considerate when do retry.
Idempotency
Idempotency means that no matter how many requests are made, the result is the same as the first successful request.
For example, we want to reserve 10 units of a product, no matter how many times we call Try to StockService, finally only 10 units are reserved.
For every order, we create a unique id as idempotent key, and send to StockService, so if StockService processed current request, it will simply return the previous result.
Retry delay
There are many delay strategies:
- Immediately retry. Once previous request failed, do a retry immediately, this is not a good choice which may cause downstream service overload
- Fixed rate retry. Retry with a fixed rate like 10ms.
- Incremental retry. Every retry will increase delay sequentially, for example, sleep 10ms, 20ms, 30ms
- Expotential backoff retry. Every retry will increase delay expotentially, for example, sleep 10ms, 20ms, 40ms
Special Problems
Due to network issue, there are some special problems.
Empty Rollback
Try did not execute, but Cancel executed.
The following steps show how empty rollback occurs.
- Make a Try to downstream (packet lost)
- Transaction rollback, make a Cancel to downstream
- Downstream receives Cancel
The Cancel operaiton of downstream will fail because of no cancellable record found, but if we responses fail to upstream, upstream will do a retry which is a useless operation. So in Empty Rollback situation, we need to response success to upstream and downstream service do nothing to avoid network and compute resource waste.
Hanging transaction
Handing transaction means because of network issues, Cancel is executed before Try, if not handled properly, may result in inconsistencies.
The following steps show how Haning transaction occurs.
- Make a Try to downstream(congested network)
- Transaction rollback due to timeout of Try, make a Cancel to downstream
- Downstream receives Cancel, do nothing but respond success
- Downstream receives Congested Try, reserve resource
To solve this problem, downstream service need to record that current transaction was handled(by record transactionId) before reply Cancel call to upstream. Once Try received, check whether the current transaction has been handled.
Wrap Up
TCC is a good solution to implement distributed transaction in which there also are some problems. Idempotency and proper Retry make TCC better.