Optimize design to scale millions of users on AWS

Deployment on aws cloud is easy to configure to auto scale on spikes. But It may also become complex to manage, performance impact and high cost if your apps start scaling to millions of users / transactions, with increasing aws resources, multiple services and internal teams. Hence it is very important to focus on aws design/architecture, to achieve smooth scaling and optimization at scale.

Photo from aws.amazon.com

Here are important points to consider from day 1 in your architecture:

  1. The first best practice to start the design on aws is to ask couple of questions and record their answers

for e.g. What are scaling factors for difference workloads (or services) i.e. I/O, latency along with how does ‘A’ communicate with ‘B’ and dependencies matrix, regulatory constraints, how much of users or transactions to grow? Know who can access what? Define your situation possibility vs probability? How much data you can afford to recreate or lose in case of disaster (i.e. RPO — recover point)? How quickly you should recover (i.e. Recovery time)?

2. Loosely coupled network architecture to scale, and understand your security and compliance requirement beforehand otherwise later it would add the complexity.

Give proper planning on network design e.g. VPC and subnets, and how you want to secure or regulate the application. For e.g. If you want to provide internet access to your apps in very controlled way then you may want to setup separate VPC with Internet gateway attached, having set of proxies forwarding requests to other VPCs.

Another e.g. Isolation of environments i.e. different VPC for Dev, Prod and so on. In general, these are the models:

a. Single VPC with resource sharing — Flat architecture where lot of interdependencies in workloads. You could do VPC sharing through Resource Access Manager.

b. Multiple VPC with single accounts — To Isolate workloads (e.g. microservices), environments etc.

c. Multiple VPC with multiple accounts — Multiple business units, multiple teams etc. You would need sound knowledge to manage this model as complexity arises on networking between VPCs /accounts and increasing IP management.

3. If you are using lambda and scaling with increasing records, then consider isolating Lambda functions within VPC with large CIDR to ensure that if you are scaling, you don’t run out of IP addresses. Of course, this depends on large scale scaling and spikes. At the same time, for other services do ensure you understand the soft and hard limits to avoid getting stuck later.

4. Do prepare tagging strategy before you start building setup on AWS. You would find it most useful when you scale to millions of transactions, and multiple services, to measure metrics on platform. For e.g.

a. Tags to be used for cost allocation (AWS cost explorer and cost usage report support the ability to break down AWS costs by tag),

b. Tags to be used for Automation (service-specific tags are often used to filter resources during infrastructure automation activities),

c. Tags to be used for Access Control (IAM user can include conditions to limit access to specific environment i.e. development, test or production).

Another example, Tag S3 objects to work with lifecycle rules, to perform automated actions on a subset of your data with object tags. For e.g. Transition all objects tagged “user : retail” to S3 Glacier

5. Most of the time, scaling is based on cloud watch metrics measured on CPU > X% or Memory >%. It would be good to set scaling on Request count or user concurrency for e.g. how many request? and then take the decision to handle concurrency. If you are expecting really high flow say 1 million / min then you may want to pre define the number of CPUs needed and pre-warm strategy (as auto scaling to desire capacity takes several minutes (< 5 min) to spin new instance(s), and that should not conflict with the expected performance during increasing / peak traffic load)

6. There are multiple services to connect with your on-premise data center e.g. Direct connect & VPN. Identify your exact need, and analyze on performance and cost trade-offs e.g. Direct connect has higher per hour cost but low data transfer out, very low latency, private connectivity, very high bandwidth i.e. 100 GBPS+. Whereas VPN offers 1.25 GBPS per tunnel, internet-based, fast setup.

7. If you are using Kinesis Data streams, then increasing number of shards may not give performance at scale. Batch size matters a lot. For e.g. 10 shards with batch size 1 wouldn’t give better performance with increasing load. Also Kinesis do not have default implementation of Dead Letter Queue (DLQ) like in Rabbit MQ, which you may plan to implement using SQS / other. It is very important for e.g. In a batch of records, if any one record fails to process, instead to stuck the entire batch, this failed record is moved to DLQ. Later all the failed messages can be analyzed and fixed by the team.

8. If you are using S3 bucket for storage (standard storage class by default). Then do add the lifecycle policies from day one and choose proper storage class e.g. S3 Glacier as per the data access frequency / archive requirements. By doing so, you can easily save at least 20% of the storage cost. (Note: Lifecycle rules to move objects in different storage class or expire them based on usage. Storage class are — standard, Infrequent access, OneZone-IA, Glacier, Deep Glacier).

9. While you prepare infra to scale, also consider the strategy to design application to control turn on / off ability of different services. For e.g. with spike of millions of users on your platform, you would also want to save resources by turning off non-critical services. Make sure the key theme of your application is not impacted. For e.g. I am on Netflix to watch uninterrupted and smooth videos is the primary reason to be there, recommendations / listing are the secondary.

10. Do consider EC2 spot fleets for different workloads. This would really help you on cost and diversify instance types. (Note that spot instances can be taken away any time, hence choose right set of workloads on them)

11. Multi-AZ (Availability Zone) is important for high availability design. At the same time, rightly plan and spread your workloads, as data transfer across AZs has cost, but within AZ is free.

12. Monitor S3 security setting to protect your data, for e.g. use amazon macie, to continuously scan PII and find where that information in your bucket, to make sure that have strict conditions defined for your buckets, one of the condition could be to encrypt. Another tool could be Access analyzer which identify if your resources grant pubic access.

There are many other practices across different aws services, and it would be good to follow them to design your architecture to scale with optimal performance, reliability, operational ease, security and cost optimization.

Happy scaling!

References:

1. Tagging Best practices : aws-tagging-best-practices.pdf (awsstatic.com)

2. AWS reinvent sessions

Technology Management and scale with cloud & devops practices