AWS Console Overview
Fixing Prod Infra issues
Key aim to ensure users get back access to our application in the shortest possible time.
- Don't try to fix the issue.
- Always alleviate (meringankan) the issue so the service gets back online/become usable.
Two main services we use:
- EC2 (Elastic Compute)
- RDS (Relational Database Storage)
EC2
Check backend instance
Check backend CPU Utilization graph in AWS
- set it to maximum (don't look at average because you need CPU readings when its at it's peak, at specific times)
- look and see if any abnormal spikes
- check the date and time to determine with your intuition if it's a more than usual peak hour traffic vs an infra issue
- more than usual: could be a marketing campaign that we didnt know of
- infra issue: you cannot reach the service, 503 errors from API or frontend web servers
- in laravel, web server is nginx (puma equivalent for php) hosted in its own docker container
Eyeball backend memory usage
- "eyeball" is to look at a reading (i.e. htop) to get a feeling about the current situation into instance cause
htopreports current memory usage - When you run
htop, only start eyeballing after letting it render and stabilise after a few seconds - Record the max reading you can mentally track over a period of time into a Google Sheets
- The reading will keep changing every second, so you mentally track which reading was the highest in a period of X seconds, however long you can.
Check frontend instance
- ssh into frontend instance
- check nginx logs
sudo docker logs nginx --tail 1000- not really useful because it only shows the requests no errors
- when in doubt, redeploy all containers
- nginx with
sudo docker nginx restart
- nginx with
RDS
Check sessions
High session count can mean:
- too many connections open from laravel to the database
- too many php threads running, typically when scaling/performance tuning for a flash event with more than usual traffic.
- queries are taking a lot of time to return/complete, causing connections to remain open.
- Refer to AWS AAS definitions
Check CPU Load
When looking at the graph in AWS Console, remember to set it to report maximum cpu load.
High CPU Load:
- Queries are compute intensive (e.g. N+1 problems, large joins)
- Too many processes are running at the same time
SHOW PROCESSLISTinmysqlconsole
Check the top sql
AWS Console will show you the top 10 sql statements in AAS unit.
-
Over a 5-second interval, if the AAS reading is 2, it means that 2 queries are being executed every 5-second interval.
-
You can select what interval you want to see in the AWS Web Console.
-
You generally want to quickly eyeball 1s, 5s, 60s intervals to get good feel of the different interval counts
Average Active Session:
- Number of queries being executed over a period of time.
- Database has 8 CPUs, so maximum number of queries that can be executed by our RDS instance is 8.
Database Connection:
- number of connections open to the application
- e.g. Rails can configure the number of maximum connections it can open to the database.
Look at the highest query to determine which part of the codebase it relates to.
- For example, if its query
select * from notifications...,- we know it has something to do with notifications in the codebase, greatly reducing our scope of investigation
Trace what endpoint is calling it, and determine if you can totally not call the endpoint in the frontend.
- if you can, you simply comment out the call in the frontend and deploy.
- this will eliminate the problem query from being executed in the db, freeing up db cpu, memory, and session resources.
HAProxy
Switch the instance inside backend gig_api by commenting/uncomments
[!note] important Only
api-2andapi-3should be used interchangeably
api-1instance size is one-step smaller thanapi-2andapi-3api-2andapi-3are the same sized instance- deployment settings for the
php-fcmservice is already tuned for api-2 and api-3 instance size.- due to us using bash scripts, the tuning configurations are hardcoded
Example
The following is the setting in Haproxy before the change.
backend gig_api
balance roundrobin
option forwardfor # store client ip address in X-Forwarded-For
http-response set-header X-Robot-Tag "noindex, nofollow, noarchive, nositelinkssearchbox, nosnippet"
# server "jodgig.prod.api" 10.0.155.72:8000 check verify none weight 50 # 2 cores
# server "jodgig.prod.api-2" 10.0.152.51:8000 check verify none # 8 cores
server "jodgig.prod.api-3" 10.0.148.235:8000 check verify none # 8 cores
The change is a simple comment/uncomment change.
For example, if api-3 instance is unreachable (cannot ssh in) or Instance State in aws console is NOT Running
- comment
api-3and uncommentingapi-2to get
backend gig_api
balance roundrobin
option forwardfor # store client ip address in X-Forwarded-For
http-response set-header X-Robot-Tag "noindex, nofollow, noarchive, nositelinkssearchbox, nosnippet"
# server "jodgig.prod.api" 10.0.155.72:8000 check verify none weight 50 # 2 cores
- # server "jodgig.prod.api-2" 10.0.152.51:8000 check verify none # 8 cores
+ server "jodgig.prod.api-2" 10.0.152.51:8000 check verify none # 8 cores
- server "jodgig.prod.api-3" 10.0.148.235:8000 check verify none # 8 cores
+ # server "jodgig.prod.api-3" 10.0.148.235:8000 check verify none # 8 cores
Check haproxy stats page to make sure the instance is green in colour
- green means haproxy can connect to the applications (which are running inside docker containers. Docker runs on the host)