As Site Reliability Engineer, you will have to operate and maintain LANDI Global infrastructures. Your main responsibilities will be to:
- Build, operate, and maintain our platform infrastructures, across various environments
- Collaborate with R&D team to ensure availability, reliability, and scalability of our platforms
- Implement and configure monitoring and alerting systems for visibility and resolution procedures to ensure timely remediation of platform failures
- Implement and maintain Disaster Recovery plans to ensure business continuity
- Analyse and present performance and cost optimization for the platforms
- Design and implement automated testing, continuous integration, and continuous delivery frameworks and processes for deployment efficiency
- Manage change management and incident reporting processes to anticipate and respond to incidents to confirm with platform SLA
- Provide operational support for platforms and support resolving production issues as an escalation point to the team
- Participating in 24/7 on-call rotation
- Support deployment of environments as new clients are onboarded
EXPERIENCES
- At least 5 years or more experience in similar capacity
- Excellent oral and written communication in English.
PREFERRED SKILLS
Candidates should ideally have experience in some of the following technologies:
- Experience in various cloud technologies (e.g. AWS, Azure)
- Experience in distributed Linux/Unix operating systems
- Experience in high-level programming or scripting languages
- Experience in monitoring tools (e.g. Prometheus, Grafana, Zabbix)
- Experience in configuration management tools (e.g. Ansible, Chef, Puppet)
- Experience in SQL databases (e.g. Postgres, MySQL)
- Experience in load balancing and reverse proxies (e.g. Nginx)
- Experience in CI/CD tools (e.g. Jenkins, GitLab)
- Experience in Containerization (e.g. Dockers, K8s)