The Toughest SRE Interview Questions I Faced — You Won’t Believe!
1. Is a five nines SLO good or bad?
Expected Answer: Explain that a five nines SLO (99.999% availability) is extremely demanding and typically applied to critical services where downtime must be minimized to near zero. They should also discuss the challenges and trade-offs involved, including the operational overhead and cost of maintaining such high availability.
Important Points to Mention:
- Five nines means that the service can only be down for approximately 5.26 minutes per year.
- While it’s ideal for critical services like payment systems, maintaining five nines can be costly and complex.
- The higher the availability target, the more expensive and difficult it becomes to manage failures, requiring robust automation, redundancy, and rapid recovery mechanisms.
Example You Can Give:
For a global financial service handling millions of transactions, five nines may be necessary to avoid losses and maintain trust. However, for internal tools or less critical applications, it might be overkill, driving unnecessary complexity and cost.
Hedge Your Answer:
While a five nines SLO is impressive, it’s important to assess whether the business or user impact justifies the operational effort. In some cases, a slightly lower SLO (e.g., 99.9%) might provide sufficient reliability without the overhead.
2. Why is Configuration as Code important?
Expected Answer: Discuss the benefits of Configuration as Code (CaC), such as version control, repeatability, consistency, and automation. They should emphasize how CaC allows for quick recovery, easy audits, and collaboration across teams by storing configurations in code repositories.
Important Points to Mention:
- Configuration as Code ensures consistency across environments by versioning configurations.
- It supports automation, enabling faster and more reliable deployments.
- Changes are trackable, providing transparency and auditability.
- It reduces human error by removing the need for manual configuration.
Example You Can Give:
In one project, we used Terraform for managing infrastructure as code. This allowed us to consistently replicate our production environment in staging, leading to fewer issues during deployment and quicker recovery from configuration drift.
Hedge Your Answer:
While Configuration as Code offers many advantages, it requires disciplined management. Poorly written or reviewed configuration code can lead to significant downtime, so proper validation and testing processes are critical.
3. Should I automate everything or just some things?
Expected Answer: Should explain that automation is crucial for repetitive tasks, error-prone processes, and scaling, but not everything should be automated. They should discuss when to automate (e.g., frequent tasks) and when manual intervention is still necessary (e.g., one-off incidents or creative problem-solving).
Important Points to Mention:
- Automate tasks that are repetitive, time-consuming, or prone to error.
- Avoid automating complex tasks that require human judgment or creativity.
- Consider the cost and effort required to automate — if it takes more time to automate than to perform the task manually, it might not be worth it.
Example You Can Give:
We automated our database backups and server provisioning process, which saved us significant time. However, we chose not to automate incident escalations, as human judgment is required to assess the severity of issues.
Hedge Your Answer:
While automation is powerful, it’s essential to maintain flexibility for tasks that require human oversight. Over-automation can lead to brittle systems where manual intervention becomes difficult in emergencies.
4. Can you explain the CAP theorem?
Expected Answer: Explain that the CAP theorem states that in distributed systems, you can only guarantee two out of three: Consistency, Availability, and Partition Tolerance. They should describe each term and discuss the trade-offs between them in the context of real-world systems.
Important Points to Mention:
- Consistency: All nodes in the system return the same data at any given time.
- Availability: Every request receives a response (success/failure), but not necessarily the most up-to-date one.
- Partition Tolerance: The system continues to operate even if there’s a network partition between nodes.
- In real-world systems, trade-offs must be made depending on the use case, as no system can achieve all three simultaneously.
Example You Can Give:
In a system like Cassandra, which is designed for high availability and partition tolerance, the trade-off is eventual consistency. This means the system can tolerate network issues and remain available, but not all nodes may immediately reflect the latest data.
Hedge Your Answer:
While the CAP theorem simplifies understanding trade-offs, modern systems often aim for ‘soft guarantees’ to get closer to achieving all three properties. These approaches, like eventual consistency, offer practical solutions in large-scale distributed systems.
5. Give me a non-technical explanation for immutability.
Expected Answer: Explain immutability in a simple, relatable way, like the idea that once something is created, it cannot be changed. They can use analogies such as writing data in stone versus using a whiteboard where changes can be made.
Important Points to Mention:
- Immutability means once something is created, it cannot be altered.
- In systems, this prevents side effects and simplifies debugging, as you always know the state hasn’t changed unexpectedly.
- It’s often used in database records, logs, or infrastructure (e.g., immutable servers).
Example You Can Give:
Think of immutability like writing a letter and mailing it. Once the letter is sent, you can’t change its content. If you want to update the message, you’ll need to write a new letter.
Hedge Your Answer:
While immutability offers clarity and predictability, it may introduce overhead since every change requires creating a new version. This can lead to higher storage costs and complexity in systems that frequently update data.
6. Explain an SLO, SLI, and error budget.
Expected Answer: Should explain that an SLO (Service Level Objective) is a specific goal for system performance, while an SLI (Service Level Indicator) is the metric used to measure performance. The error budget defines how much unreliability is acceptable, balancing innovation and reliability.
Important Points to Mention:
- SLI: A measurable value e.g., uptime, response time that reflects the service’s performance.
- SLO: A target or threshold e.g., 99.9% availability that the service is expected to meet.
- Error Budget: The allowable downtime or failure time that fits within the SLO. If the error budget is exhausted, no further risky changes should be made until reliability is restored.
Example You Can Give:
For our web application, we set an SLO of 99.9% availability, with an SLI measuring the actual uptime. Our error budget allowed for 0.1% downtime per month, meaning about 43 minutes of downtime was acceptable. Once we exceeded that, we froze all risky changes until the system stabilized.
Hedge Your Answer:
While SLOs and error budgets help maintain service reliability, setting overly aggressive SLOs can slow down innovation and make error budgets harder to manage. It’s important to find a balance between reliability and the ability to deploy new features.
7. Tell me how would you set an SLO, SLI, and error budget for X service.
Expected Answer: Should describe setting SLOs based on business impact, user expectations, and historical performance. SLIs should be chosen to reflect the most critical aspects of service reliability, and the error budget should align with the organization’s tolerance for risk and downtime.
Important Points to Mention:
- Understand the critical metrics for the service e.g., availability, response time
- Set an SLO that meets customer expectations while balancing operational capacity.
- Calculate the error budget based on the SLO and use it to inform decision-making on risk and change velocity.
Example You Can Give:
For an e-commerce website, the SLI could be uptime, with an SLO of 99.95% availability during peak hours. The error budget allows for approximately 22 minutes of downtime each month, and once that is consumed, we pause new feature releases to focus on stability.
Hedge Your Answer:
Setting an SLO requires ongoing iteration, as business priorities and user expectations evolve. Initially, the target might be too aggressive or lenient, so regular review and adjustment are key to maintaining a realistic and effective SLO.
8. Leet code or coding assessment
Expected Answer: Discuss their experience with coding assessments, focusing on problem-solving skills such as data structures, algorithms, and system design. They should mention how coding challenges reflect real-world problem-solving, especially in troubleshooting and optimization.
Important Points to Mention:
- Leetcode-style questions assess algorithmic thinking, not just memorization.
- Coding assessments focus on solving problems under constraints, which is a core SRE skill when responding to incidents.
- Experience with questions related to arrays, sorting, and optimization (but typically not as much with graphs or trees in SRE interviews).
Example You Can Give:
I’ve completed numerous coding assessments focused on sorting and searching algorithms, which have helped me in day-to-day SRE tasks like optimizing log search operations or troubleshooting slow queries.
9. Trivial questions like “what is the angle of 3:15 on a clock?”
Expected Answer: Demonstrate their ability to think through unexpected or non-technical questions by breaking the problem down logically. They should walk through the calculation step-by-step, showing how they approach solving non-standard problems under pressure.
Important Points to Mention:
- The clock is divided into 360 degrees, with 12 hours corresponding to 360 degrees.
- Each hour represents 30 degrees (360 / 12).
- The minute hand moves 6 degrees per minute (360 / 60).
- At 3:15, the hour hand would be slightly ahead of the 3 (by 7.5 degrees, since the hour hand moves 0.5 degrees per minute), while the minute hand would be on the 90-degree mark.
Example You Can Give:
The hour hand at 3 is positioned at 90 degrees, and since it’s 15 minutes past, the hour hand has moved 7.5 degrees ahead (0.5 degrees per minute). The minute hand is at the 90-degree mark. So, the angle between the hour and minute hands is 90–7.5 = 82.5 degrees.
Hedge Your Answer:
Although this problem isn’t related to SRE work, it’s important to stay calm and approach such questions methodically. Thinking clearly under pressure and breaking down the problem is key, even when faced with unfamiliar or trivial questions.
10. Behavioral: What makes you an SRE, and why?
Expected Answer: You should focus on their skills in reliability engineering, passion for problem-solving, and experience with systems and automation. They should explain how their skills and mindset align with SRE principles like service availability, automation, and incident response.
Important Points to Mention:
- Experience with incident management, monitoring, and maintaining service reliability.
- Passion for optimizing systems and reducing toil through automation.
- Desire to balance innovation with stability and ensure smooth operations.
Example You Can Give:
I enjoy solving complex system issues and building automation to reduce manual work. I thrive in high-pressure situations, where maintaining service reliability and handling incidents is crucial. SRE work fits my skills in system design, monitoring, and continuous improvement.
Hedge Your Answer:
Being an SRE isn’t just about technical skills, but also about mindset. It’s about continuously improving systems, being proactive in preventing failures, and collaborating with teams to build more resilient architectures.
11. What do you do in your spare time — like projects or attending events related to SRE?
Expected Answer: You should talk about any personal projects, open-source contributions, or participation in tech communities that are related to reliability engineering. They should also mention attending meetups, conferences, or workshops to stay up-to-date with the latest trends in the SRE space.
Important Points to Mention:
- Personal projects related to automation, monitoring, or infrastructure.
- Participation in open-source projects or contributions to the SRE community.
- Attending conferences, webinars, or reading technical blogs to stay informed about new tools and techniques in the industry.
Example You Can Give:
In my spare time, I work on open-source projects related to Kubernetes monitoring and observability. I also attend local DevOps and SRE meetups to discuss best practices with peers and stay updated on the latest tools in the reliability space.
Hedge Your Answer:
While I focus on building my technical skills during my free time, it’s also important to balance work with other interests. Maintaining a healthy work-life balance helps ensure I stay engaged and avoid burnout.