AWS 비용 최적화 Part 1: 버즈빌은 어떻게 월 1억 이상의 AWS 비용을 절약할 수 있었을까
버즈빌은 2023년 한 해 동안 월간 약 1.2억, 연 기준으로 14억에 달하는 AWS 비용을 절약하였습니다. 그 경험과 팁을 여러 차례에 걸쳐 공유합니다. AWS 비용 최적화 Part 1: 버즈빌은 어떻게 월 1억 이상의 AWS 비용을 절약할 수 있었을까 (준비중) …
Read ArticleOne of the main products of Buzzvil is Honeyscreen which is a lockscreen rewards app. Honeyscreen users can get points by participating in advertisement campaigns on our lockscreen or offerwall, or by simply unlocking the device. Points can be used to purchase goods from the store. As this so-called “points system” is the core of this product, it is very important that we save accurate data about how many points a user has deposited and spent. Originally we used MySQL RDS to run the points system. However, as the scale of the service grew, the number of points-related requests to the database increased exponentially, which caused a very heavy load on the database. Since this kind of heavy load may cause troubles, we searched for an alternative that can handle this situation, and found DynamoDB – a NoSQL database provided by AWS. The biggest advantage of DynamoDB is that it is very easy to scale. So we are currently running the points system on DynamoDB to manage it in a more flexible way. In this article, I want to talk about how we designed the functions and structures of tables in DynamoDB to suit the points system’s needs. There were two kinds of tables in the original MySQL database. The simplified version of their structures are as follow:
This is the table where changes to a user’s points are recorded. In other words, this table has the history of a user’s deposits and withdrawals. So if a user click the “details” button in the app, the app sends a request to this table.
id | user_id | account_type | amount | date |
… | … | … | … | … |
1012 | 12345 | withdraw | 1000 | 2016-02-28 12:00:01 |
1013 | 39485 | impression | 10 | 2016-02-28 12:00:02 |
1014 | 20938 | action | 200 | 2016-02-28 12:00:02 |
… | … | … | … | … |
The sum of points a user has at this moment are saved in this table. So when the app shows the total points on the main page, it sends a request to this table.
id | user_id | sum |
… | … | … |
256 | 12345 | 1200 |
257 | 39485 | 560 |
258 | 20938 | 3700 |
… | … | … |
There are two tables but the data on each table need to be consistent with each other. Therefore transactions to these tables need to be atomic. So to speak, when a user deposits points, an “insert” query to the “Amount” table and an “update” query to the “Sum” table must be sent as a single transaction. In this way, even if one of queries fail, the tables still remain the same as before queries are sent. It is a big problem if a data is added to “Amount” table but not to “Sum” table accordingly, and vice versa. Atomic transactions prevent such problem. To follow the existing example, at first we designed tables in DynamoDB as follow:
Table Name | amount |
Partition Key | user_id |
Sort Key | date |
Attributes | amount, account_type |
Table Name | sum |
Partition Key | user_id |
Attributes |
However, soon we realized that designing tables this way causes many problems. The most critical one is that DynamoDB does not support atomic transactions. So tables cannot guarantee their data’s mutual consistency. So there may be a situation when a data is inserted to “Amount” table but the data in “Sum” table is not updated because transactions failed in the middle. Because of this, it became clear that we need to integrate two tables into one. The first try is as follows:
Table Name | amount_sum |
Partition Key | user_id |
Sort Key | date |
Attributes | amount, account_type, sum |
In contrast to two original tables, a single item has information about both the amount and the sum of a user’s points in this table. When some points are need to be deposited, the process is as follows: First, the client gets the most recent item of the user to get the current sum. In order to get the most recent item, the “limit” option is set to 1 and the ScanIndexForward is set to false. This way, the query gets a item by scanning the partition in reverse chronological order since items in a partition are in order according to their sort key values - date. Second, it calculates a new sum value by adding the amount that it is about to deposit. Last, it inserts a new item with new values into the table. In short, it first reads from the table and then inserts a new item. Since reading the table does not change data in table, the only operation that changes the table’s state is inserting a new item. When we had two tables, there were two such operations - insert (to “Amount” table) and update (to “Sum” table). However, now we have only one so the transaction is atomic. But we still have some problems left. First is a problem of overwriting. Because the table’s partition key and sort key are user_id and date, if there are concurrent requests to one user, the item inserted by one request overwrites the other. This problem can be easily solved by using DynamoDB’s conditional write feature. If we set a condition on insert function to insert items only when there is no existing combination of the same user_id and date on the table, the problem of overwriting is prevented. If there is already a same combination on the table, we can make it retry after increasing “date” by 1 second (or even millisecond) until the operation succeeds. This may cause a minor error in data, but we can guarantee that all items are inserted and not overwritten. (conditional write 관련 참고: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Expressions.SpecifyingConditions.html, https://java.awsblog.com/post/Tx3RRJX73ZNOVL/Using-Improved-Conditional-Writes-in-DynamoDB) The second problem concerns with consistency of data in the table. For example, let’s say there are two concurrent requests to deposit 100 points to a user. That user’s up-to-date “Sum” value is 1000. If we follow the process explained above, two requests read the sum value 1000 from the table almost at the same time. Then each request calculates a new sum value 1100 by adding 100, which is the “Amount” it is about to deposit, to 1000 it read from the table. After that, they inserts their items which contain a sum value of 1100. As result, although there were two requests each depositing 100 points, the user end up with the sum value increased only by 100. This problem occurs when requests are made concurrently in a parallel way, but not when they are sent serially. Therefore this is problem of isolating transactions. This can be solved by assigning “version” attribute to item as below.
Table Name | amount_sum_version |
Partition Key | user_id |
Sort Key | version |
Attributes | date, amount, account_type, sum |
This table uses version, instead of date, as sort key. Version can be any type as long as it plays the same role, but in this example let’s say its type is Number and it starts from 0. Now let’s take the same example as above using this table. Two requests each depositing 100 points are sent at the same time. They both read from the table simultaneously, each getting a sum value of 1000 and a version value of, let’s say 21. Then they both try to insert item that has a sum value of 1100 and a version value of 22. However, since our insert function utilizes “conditional write” feature hence there cannot be more than one item of a same user_id - version combination in the table, one of those requests fails. To retry, the failed request reads the value from the table again, getting a sum value of 1100 and a version value of 22 this time. Then it inserts a new item with a sum 1200 and a version 23. If it fails again, it repeats the same process until it succeeds (or until it meets the limit that the programmer has set). This way the so-called isolation problem can be solved while still maintaining chronological order of items since the higher version is always created after the lower version. Now all the problems with creating a new item in the table are solved. But there is a remained issue about reading the table’s data. In the table structure described above, it is very inefficient to read overall data in chronological order because they are scattered to millions of user_id partitions. Possible ways are to scan the whole table and then sort them in order of time or to iterate every user_id partition to get data within the time range. In order to get the timeline of data more efficiently, we added one more attribute and one Global Secondary Index (GSI) as below.
Table Name | amount_sum_version_scatter |
Partition Key | user_id |
Sort Key | version |
Attributes | date, amount, account_type, sum, scatter |
Index Name | scatter-date-index |
Partition Key | scatter |
Sort Key | date |
It will be very simple if we set an arbitrary attribute as the partition key of our GSI and data as the sort key and then assign a same value to that arbitrary attribute to gather all data in one partition. However, this will cause the so-called “hot partition” problem, which is throttling of the database when every item is written in a same partition. To prevent such a problem, the scatter value is assigned a random value within a certain range. This way, items are distributed to several partitions so the possibility of throttling is decreased significantly. After setting up the table like this, we can query data by iterating through these “scatter” partitions. The reason why this is much more efficient way compared to querying every user_id partition is that the number of “scatter” partitions (100 or less) is smaller than number of users and the number of “scatter” partitions is completely under control of the programmer. (참고: https://blogs.aws.amazon.com/bigdata/post/Tx3KPZDXIBJEQ4B/Scaling-Writes-on-Amazon-DynamoDB-Tables-with-Global-Secondary-Indexes)
So far was the introduction to how we use DynamoDB to manage Honeyscreen’s points system. Since there is plenty of well-written information on DynamoDB, this article only dealt with problems occurred when implementing the points system. DynamoDB has it own advantages and disadvantages. However, DynamoDB is constantly improving and overcoming its weaknesses with help of AWS and other developers. For example, the “conditional write” feature was introduced in 2014. You can check the AWS official blog to learn about new services and features. In Buzzvil, we are utilizing a variety of services provided by AWS. We will keep you posted.
버즈빌은 2023년 한 해 동안 월간 약 1.2억, 연 기준으로 14억에 달하는 AWS 비용을 절약하였습니다. 그 경험과 팁을 여러 차례에 걸쳐 공유합니다. AWS 비용 최적화 Part 1: 버즈빌은 어떻게 월 1억 이상의 AWS 비용을 절약할 수 있었을까 (준비중) …
Read Article들어가며 안녕하세요, 버즈빌 데이터 엔지니어 Abel 입니다. 이번 포스팅에서는 데이터 파이프라인 CI 테스트에 소요되는 시간을 어떻게 7분대에서 3분대로 개선하였는지에 대해 소개하려 합니다. 배경 이전에 버즈빌의 데이터 플랫폼 팀에서 ‘셀프 서빙 데이터 …
Read Article