A Better Architecture for 12306.cn09 Jan 2014
Tens of millions Chinese are buying railway tickets from 12306.cn everyday. The site works well on regular days, but it faces serious performance problem every year with the arrival of Chinese Spring Festival, when hundreds of millions Chinese are buying tickets to return to their hometown.
This is a complex problem. The root cause of the problem is that there’re not enough tickets, so that people rush to buy the tickets. Only sufficient supply can fundementally solve the problem and dissipate the discontent of people. Another problem is, the tickets, which is a valuable resource, can be distributed in a lot of different ways, more efficient/just or not.
The two problems above are out of the scope of a technical discussion. In this article I want to propose a solution which can improve the responsiveness of the system, so that users will never get stuck or have to wait in a long queue when the system is under heavy load.
What does 12306 do
12306 has three major functionalities, listed as follows:
- Query available tickets
- Buy a ticket
- View order history and optionally cancel an order
User can query available tickets by date, time span of day, train number, train type, seat type, departure station and destination station.
If a ticket is available, user can buy a ticket and pay online. If the ticket is not paied within a time interval, the order will be automatically cancelled and the ticket will be availabe for sale again.
A peculiar trait of ticket is that it’s a splittable resource. For example, if there’s a ticket avaible from station A to station Z, user can buy part of it. Suppose the segment D-Q is sold, the remaining segments A-C and R-Z are split as new tickets, which can be furthur splitted.
When a hard-seat ticket with the segment D-Q of a travel of train is sold, the remaining hard-seat tickets of the travel of train should be decreased if the start station and end station S1-S2 of the query intersects with the segment D-Q.
User can view their order history, and optionally cancel a finished order, which is a rare behavior.
What problems 12306 face
We already know that the problem 12306.cn faces is that under heavy load, their website is unresponsive when buying a ticket.
As it’s a performance problem, we’ve to locate the bottle necks.
It’s easy to see that, the problem has nothing to do with computing capability. Technically, it’s very easy to scale computing nodes without state – just add more machines behind load balancers or job queues. Front-end optimization and static assets serving optimization can also be easily implemented. The problem is with the performance of storage. This analysis can be confirmed by how 12306 improved their infrastructure in recent years.
VMware vFabric® SQLFire is an in-memory distributed SQL database.
On the website, it doesn’t reveal enough details of the distributed system. For example, it claims to be “30x faster than Oracle 11g”, what’re the details for read speed and write speed respectively? As we know, due to the limit of CAP thereom, it’s impossible to linearly scale both read and write speed of the same data while maintaining strong consistency(due to
R + W > N and
W > N/2). It seems to me that in order to achieve strong consistency, vFabric SQLFire can only improve read efficiency, write efficiency will be much slower than a single in-memory SQL database due to the cost of maintaining consistency between nodes.
Before adopting vFabrick SQLFire, 12306 was using Sybase ASE, which formed a bottle neck of whole system. Though vFabrick SQLFire is said to make the system about 75 times faster, I strongly doubt that it improves both read and write speed. The news only mentioned improvements about querying tickets and managing orders, nothing is said about order processing speed.
Write speed must be worse than before. In case of Sybase ASE, when an order is created, only one machine needs to be updated. But now with vFabrick SQLFire, each write has to update several machines with complex protocols in order to maintain consistency. So I think that order processing speed should be at least several times slower if there’re 20 nodes(counter-balance by the change from disk storage to memory storage).
In this regard, we can say 12306 has chosen the wrong technology, which only solves part of the problem. They did it wrong because they didn’t understand well their own problems.
What’s the problem? Abstractly, the problem is that the storage system can’t be read and written fast enough. vFabrick SQLFire makes read better, but write is worse. A true solution is to make both read and write fast enough.
Regarding current infrastructure of 12306, the concrete cause of why write is slow is as follows:
Each write has to update several different storage nodes through complex protocol in order to maintain consistency
How to solve the problems
The problem is there and is fairely true. It seems helpless to solve these problems if we are lost in the details of the technology. We should start from requirements, and think abstractly what are the possibilities.
Let’s work backward with the problems by asking following questions.
Q: Why each write has to update every storage nodes synchronously?
A: If we don’t update every storage nodes synchronously, we can’t maintain consistency.
Q: If the tickets of a single train are stored on a single node with replication, different nodes store tickets of different set of trains, is there still the need to update every node synchronously?
A: Yes, that way write will be much faster and selling the same ticket twice can also be avoided. But we still have to update remaining tickets of affected stations synchronously in different storage nodes for querying, or we can’t maintain consistency.
Q: What’s the problem if we can’t maintain consistency?
A: User will see incorrect remaining tickets in querying.
Q: If we update remaining tickets information asynchronously to all querying nodes, will the data eventually be correct?
A: Yes, but it’s not real-time.
Q: How long do you think it will take for the data to eventually get correct?
A: About 10 seconds, or 1 minute.
Q: Imagine if you’re a buyer, can you accept that the querying of remaining tickets only reflect the state of 10 seconds ago?
A: That seems to be acceptable.
Q: So we’ve arrived at a better solution, right?
A: Eh, it seems so.
The dialog above suggests a solution. I explain the solution in details below.
The whole storage system is composed of three independent distributed systems:
- Tickets Sales System
- Tickets Querying System
- Order Management System
Tickets Sales System
Tickets Sales System has to be highly available and strictly consistent. We notice that the tickets of different trains are completely independent, and the total number of trains is relatively stable, and is limited. So we can distribute sets of train into different nodes, each node is solely reponsible for sales of the tickets of the assigned set of trains. Each node is replicated to achieve better availability.
A ticket resource can be represented as the tuple
(train_num, carriage_num, seat_num, origin_date, segment, seat_type, state). The fields are explained below:
train_num: unique identifier of a train, like bus number.
carriage_num: the carriage number of the train.
seat_num: the number of seat within the carriage.
origin_date: train departure date at the start station of train route.
segment: the valid interval of stations for this ticket.
seat_type: hard-seat, soft-seat, hard-sleep, soft-sleep, etc.
state: sold, reserved, available.
When tickets are first released, each ticket is inserted into the system with segment being the departure station and destination station of the train route.
If a ticket is sold with the same start station and end station as the segment, then the ticket is simply marked as sold. If only part of the segment is sold, the system first split the ticket into 2 or 3 tickets, delete current ticket and mark the corresponding new ticket as sold. For example, if a ticket is valid for stations from A to Z, but a user only buys F-Q, then the ticket is split into three tickets A-E, F-Q and R-Z with F-Q marked as sold, and the original ticket A-Z is deleted.
Updating of Tickets Querying System can be done periodically, say 1 or 5 seconds a time. Remaining tickets information can be represented as a set of following tuples:
train_num, start_station, dest_station, departure_date, [(seat_type, remaining), …]
If a train route has 20 stations, and tickets can be sold for the next 10 days, and a node is responsible for 20 trains, every Tickets Sales System node should keep about 20*10*C(20, 2) = 38000 such tuples. When a ticket is sold, there’re at most C(20, 2) = 190 tuples to be updated, which can be done quickly.
When updating the Tickets Querying System, only the change set of tuples is sent. At the peak time when tickets are released, the maximum of the change set is reached, which should be close to the full set of tuples as most tickets are of long distance. The change set is near to 0 when the sales stabilise. Generating the change set from the full set can be done very quickly in O(1).
The failure in updating the Tickets Querying System can be ignored, as later updates will ensure correct state.
When a ticket is sold, Orders Management System is updated asynchronously. In case of failure in updating Orders Management System(which is rare), the system allows manual handling the error in order to restore correct state.
When processing an order, only one node has to be updated, no change needs to be propagated to other nodes. If there’re N nodes, N order request can be processed in parallel, the processing will be N*M times faster(no synchronous propagation to M querying servers).
Tickets Querying System
Tickets Querying System is solely for querying. As there’re limited number of trains and stations, so the database is relatively small in size. That means inserts, updates and reads can be done quickly. Also notice that update and read to the system is frequent, while insert is rare. Inserts only occur when new tickets of trains are released every day.
Tickets Querying System can be implemented as a group of relational/SQL databases, used solely for querying tickets. Each node holds the same data and all nodes are equivalent.
There’s one critical table in the database, the schema is as follows:
train_num, start_station, dest_station, departure_date, [(seat_type, remaining), …]
The data of Tickets Querying System come from Tickets Sales System periodically. When handling updates, if the corresponding record exists, it’s updated accordingly. Otherwise, a new record is inserted.
The quality of the Tickets Querying System depends on the inconsistency window with Tickets Sales System:
- < 1 seconds - excellent
- < 5 seconds - good
- < 10 seconds - acceptable
- otherwise - problematic
If there’re N nodes, read speed will be N times faster compared to a single node. The read pressure to each node is 1/N compared to the case of a single node. Write pressure is also greatly reduced due to periodical updates from Tickets Sales System in contrast to update on selling of each ticket.
Suppose there’re 3000 trains everyday, and every train has 20 stop stations on average, and tickets can be sold for the next 10 days, so there’re 3000*10*C(20,2) = 5,700,000 records in database in total. In an update period, suppose 10% of the records are changed, then 5,700,000*0.1 = 570,000 records have to be updated. If the update period is 10 seconds, it means each querying node has to update 57,000 rows per second on average, this is an easy job for in-memory databases.
Note that 10% change ratio is an overestimate, in reality it should be one or two orders lower(as if people can buy tickets within 10 days, usually the tickets of the first 9 days are alreay sold out or stabilized before new tickets are released), so querying node can handle the write load without problem.
The analysis above also suggests possible solutions to alleviate write load of Tickets Sales System node when it becomes a problem. For example, release tickets of different trains at different time, or release tickets in several batches at several different time point.
Orders Management System
One important characteristic of the Orders Management System is that the order data of each user is completely independent. Another important characteristic is that write ratio is very low compared to read, though we don’t utilize this characteristic.
Orders Management System can be implemented as an eventually-consistent key-value distributed database, which is very similar to Dynamo. Such system is highly available, scalable and fast, with only a very small window of inconsistency, which is impossible to happen when a normal person is buying tickets normally.
The bottle neck of 12306 is with storage, concretely the efficiency of writes. Regarding storage, how to store data depends on the characteristics of the data and how the data will be used. The solution in this article is based on the independence of the data of different trains.
The solution also assumes that Tickets Querying System can tolerate a small window of inconsistency with the Tickets Sales System. It implies that the query result of remaining tickets only reflects the state a short while ago. If the window of inconsistency is small enough(<5 seconds), the solution is completely acceptable to end users.
Liu Xiaoya has contributed the idea of udpating Tickets Querying System periodically, it’s really a good idea.