We're looking for an experienced Service Continuity Specialist to maintain and enhance the client's platform's production reliability as its scope, volume, and user base will continue to expand throughout the program.
Your role:
- Work with the Platform Engineering team to continuously automate tasks related to production infrastructure, deployment pipelines, and system stability to improve operational efficiency.
- Maintain a strong relationship with end-users, ensuring the right level of feedback is collected. Provide clear communication on system status and remediation plans. A high level of customer-facing communication is expected in this role.
- Identify and address potential system bottlenecks and failure points before they escalate into incidents.
- Develop, maintain, and upgrade tools to ensure optimal observability across the platform, including user interfaces, analytics, and reporting systems.
- Contribute to infrastructure capacity planning and to the implementation of our Disaster Recovery strategy.
- Establish and track KPIs and SLAs.
Technical Skills:
- Prior experience in a site reliability-engineering role, preferably within a commodity trading or similar organization. The candidate will be able to demonstrate their expertise in SRE concepts, including, but not limited to, practical experience with maintaining SLA/SLI/SLOs, and automating operational processes.
- At least 5 years of experience maintaining decentralized or microservices systems in a production environment.
- In-depth understanding of microservices-based systems, including designing, deploying and managing distributed, scalable services.
- Proficiency in the .NET ecosystem, with experience in:
a) Event-driven architecture and data processing, utilizing frameworks such as Azure Event Hubs or Apache Kafka.
b) RESTful APIs and gRPC;
c) Graph QL.
- Experience with relational and document-based databases.
- Experience with cloud PaaS and IaaS (Microsoft Azure preferred), and experience in maintaining containerized microservice architectures using technologies like Docker and Kubernetes.