Go Back to open positions

Site Reliability Engineer

Lisbon (Hybrid)

Engineering

Shield is a global startup, with offices in TLV, NYC, LDN, and LIS. 

We’re rapidly growing and looking for another important piece of the puzzle. 

Is it you? 

We are looking for a Site Reliability Engineer to join our team!

Let’s get down to business: 

What you’ll do:

  • Design and maintain scalable, reliable AWS infrastructure.
  • Monitor health, performance, security, and capacity of production environments.
  • Develop and manage monitoring, alerting, and logging systems for proactive issue resolution.
  • Review and refine existing alerts, collaborating with developers to automate responses and enable self-healing systems.
  • Develop and maintain monitoring dashboards that provide clear and actionable insights into application reliability, system performance, and capacity utilization.
  • Conduct capacity planning and performance tuning to optimize system performance and resource utilization.
  • Fine-tune efficiency tools (e.g., Karpenter, KEDA, HPA) based on workload patterns.
  • Automate repetitive tasks and processes to streamline operations and improve efficiency.
  • Manage routine operational activities, including log reviews, system checks, and verification of automated processes.
  • Participate in incident response and resolution, including rapid troubleshooting, root cause analysis, and contributing to post-mortem reviews.
  • Maintain and improve incident response procedures and runbooks to ensure efficient and effective handling of incidents.
  • Continuously evaluate and adopt new technologies and methodologies to enhance infrastructure and operations.
  • Continuously monitor resource utilization and cloud expenditure within all production environments.
  • Implement operational cost optimization measures, such as rightsizing resources based on utilization data and terminating orphaned/unused resources.
  • Monitor and manage capacity utilization and cloud service quotas in production environments to ensure availability and performance.
  • Identify and remediate (or escalate) configuration drift from security and compliance baselines.

Experience and Skills:

  • Bachelor’s degree in Computer Science, Information Technology, or a related field.
  • 4+ years of experience as a site reliability or platform engineer, preferably in a fast-scaling environment.
  • Hands-on experience with Terraform and Terragrunt.
  • Extensive knowledge of Kubernetes and containerization technologies.
  • Hands-on experience with the Prometheus stack.
  • Ability to design and develop code using Python or Go.
  • Strong inclination toward automating manual tasks and processes to improve operational efficiency.
  • Excellent troubleshooting abilities with a methodical approach to diagnosing and resolving issues.
  • In-depth knowledge of cloud services, particularly AWS, including best practices in security and compliance.
  • Strong communication skills to collaborate effectively with both technical and non-technical stakeholders.


Oh hey, you made it all the way here! 

So, in case you were wondering, Shield is how compliance teams in financial services can finally read between the lines to see what their employee communications are really saying.  

Our platform analyzes digital interactions to fight financial crimes and mitigate a toxic workplace environment. 

Shield is a post Series B startup ($35M) with some of the largest financial organizations in the world as investors and customers. 

Shielders listen more intently. Pay closer attention to the details. Make the extra effort. Care. It’s what we do at Shield every day. And not just for our customers, but for everyone we work with. It’s all about creating a world where people understand and trust each other. 

Shield is set to do good in the world, we help protect market integrity and people’s financial assets.

Apply to this job