AWS Console Overview

Fixing Prod Infra issues

Key aim to ensure users get back access to our application in the shortest possible time.

Don't try to fix the issue.
Always alleviate (meringankan) the issue so the service gets back online/become usable.

Two main services we use:

EC2 (Elastic Compute)
RDS (Relational Database Storage)

EC2

Check backend instance

Check backend CPU Utilization graph in AWS

set it to maximum (don't look at average because you need CPU readings when its at it's peak, at specific times)
look and see if any abnormal spikes
check the date and time to determine with your intuition if it's a more than usual peak hour traffic vs an infra issue
- more than usual: could be a marketing campaign that we didnt know of
- infra issue: you cannot reach the service, 503 errors from API or frontend web servers
  - in laravel, web server is nginx (puma equivalent for php) hosted in its own docker container

Eyeball backend memory usage

"eyeball" is to look at a reading (i.e. htop) to get a feeling about the current situation into instance cause htop reports current memory usage
When you run htop, only start eyeballing after letting it render and stabilise after a few seconds
Record the max reading you can mentally track over a period of time into a Google Sheets
- The reading will keep changing every second, so you mentally track which reading was the highest in a period of X seconds, however long you can.

Check frontend instance

ssh into frontend instance
check nginx logs sudo docker logs nginx --tail 1000
- not really useful because it only shows the requests no errors
when in doubt, redeploy all containers
- nginx with sudo docker nginx restart

RDS

Check `sessions`

High session count can mean:

too many connections open from laravel to the database
too many php threads running, typically when scaling/performance tuning for a flash event with more than usual traffic.
queries are taking a lot of time to return/complete, causing connections to remain open.
Refer to AWS AAS definitions

Check `CPU Load`

When looking at the graph in AWS Console, remember to set it to report maximum cpu load.

High CPU Load:

Queries are compute intensive (e.g. N+1 problems, large joins)
Too many processes are running at the same time
- SHOW PROCESSLIST in mysql console

Check the top sql

AWS Console will show you the top 10 sql statements in AAS unit.

Over a 5-second interval, if the AAS reading is 2, it means that 2 queries are being executed every 5-second interval.
You can select what interval you want to see in the AWS Web Console.
You generally want to quickly eyeball 1s, 5s, 60s intervals to get good feel of the different interval counts
Average Active Session:
- Number of queries being executed over a period of time.
- Database has 8 CPUs, so maximum number of queries that can be executed by our RDS instance is 8.
Database Connection:
- number of connections open to the application
  - e.g. Rails can configure the number of maximum connections it can open to the database.

Look at the highest query to determine which part of the codebase it relates to.

For example, if its query select * from notifications...,
- we know it has something to do with notifications in the codebase, greatly reducing our scope of investigation

Trace what endpoint is calling it, and determine if you can totally not call the endpoint in the frontend.

if you can, you simply comment out the call in the frontend and deploy.
- this will eliminate the problem query from being executed in the db, freeing up db cpu, memory, and session resources.

HAProxy

Switch the instance inside backend gig_api by commenting/uncomments

[!note] important Only api-2 and api-3 should be used interchangeably

api-1 instance size is one-step smaller than api-2 and api-3
api-2 and api-3 are the same sized instance
deployment settings for the php-fcm service is already tuned for api-2 and api-3 instance size.
- due to us using bash scripts, the tuning configurations are hardcoded

Example

The following is the setting in Haproxy before the change.

backend gig_api
  balance roundrobin
  option forwardfor # store client ip address in X-Forwarded-For
  http-response set-header X-Robot-Tag "noindex, nofollow, noarchive, nositelinkssearchbox, nosnippet"
  # server "jodgig.prod.api"   10.0.155.72:8000 check verify none weight 50 # 2 cores
  # server "jodgig.prod.api-2" 10.0.152.51:8000 check verify none # 8 cores
   server "jodgig.prod.api-3" 10.0.148.235:8000 check verify none # 8 cores

The change is a simple comment/uncomment change.

For example, if api-3 instance is unreachable (cannot ssh in) or Instance State in aws console is NOT Running

comment api-3 and uncommenting api-2 to get

backend gig_api
  balance roundrobin
  option forwardfor # store client ip address in X-Forwarded-For
  http-response set-header X-Robot-Tag "noindex, nofollow, noarchive, nositelinkssearchbox, nosnippet"
  # server "jodgig.prod.api"   10.0.155.72:8000 check verify none weight 50 # 2 cores
-  # server "jodgig.prod.api-2" 10.0.152.51:8000 check verify none # 8 cores
+  server "jodgig.prod.api-2" 10.0.152.51:8000 check verify none # 8 cores
-   server "jodgig.prod.api-3" 10.0.148.235:8000 check verify none # 8 cores
+  # server "jodgig.prod.api-3" 10.0.148.235:8000 check verify none # 8 cores

Check haproxy stats page to make sure the instance is green in colour

green means haproxy can connect to the applications (which are running inside docker containers. Docker runs on the host)

Fixing Prod Infra issues​

EC2​

Check backend instance​

Check backend CPU Utilization graph in AWS​

Eyeball backend memory usage​

Check frontend instance​

RDS​

Check sessions​

Check CPU Load​

Check the top sql​

HAProxy​