COMPANY DESCRIPTION
NTUC Enterprise Co-operative Limited is the holding entity and single largest shareholder of the NTUC group of Social Enterprises. We aim to create a greater social force to do good by harnessing the capabilities of the social enterprises to meet pressing social needs in areas like health and eldercare, childcare, daily essentials, cooked food, and financial services. Serving over two million customers, NTUC Enterprise wants to enable and empower all in Singapore to live better and more meaningful lives.
The NTUC Enterprise Centre of Excellence for Data, Digitalisation and Technology leads the transformation of the NTUC Social Enterprises by leveraging digital technologies to become more nimble, adaptable, and innovative in todays digital age. The NTUC Enterprise Centre of Excellence for Data, Digitalisation and Technology has been registered as NTUC Enterprise Nexus, a wholly owned subsidiary of NTUC Enterprise.
DESIGNATION : Reliability Engineer (AWS/ GCP) (1 year contract)
RESPONSIBILITIES
NTUC Enterprise Nexus Co-operative Limited is currently hiring for Reliability Engineer to join Digital Product Development organization.
The team combines software and system engineering to architect and run large-scale, distributed, and fault-tolerant systems. The primary teams goal is to ensure sustainably achieve product reliability through software engineering practices, architecture patterns, culture embracement, process standardization, automation framework, education, and sharing. The team practices industry reliability frameworks such as Service Level Objectives (SLOs) and Service Level Indication (SLIs), release engineering, IaC, and operations automation. The team will empower our product developers in the Product Development Life Cycle to ensure product reliability, it is not limited to building self-serve tools/processes, and an infrastructure foundation that allows the product team to constantly deliver a high-reliability system.
The ideal Reliability Engineer candidate is either a software engineer with a good DevOps mindset or a highly skilled system administrator with knowledge of programming and operations automation. You must be the person who likes to solve complex problems with simplicity in mind, work around the clock to ensure system reliability, enjoy collaborating with other teams to embrace reliability discipline and frameworks.
As a Reliability Engineer, you have the opportunity to manage the complex challenges of the Social Enterprise System that are unique to NTUC Enterprise Nexus Co-operative Limited , while using your expertise in coding, algorithm, complexity analysis, and large-scale system design.
You will be reporting to the Architecture & Reliability Lead.
Work with product developers to ensure that the software delivery pipeline is as reliable
as possible.
Responsible to drive practices that ensure reliability of the product.
Collaborate closely with product developers to ensure that the designed solution
responds to non-functional requirements such as availability, performance, security, and
maintainability.
Responsible for availability, latency, performance, efficiency, monitoring, emergency
response, and system capacity planning.
To improve the whole lifecycle of services from inception and design, through
deployment, operation, and refinement.
Support services before they go live through activities such as system design consulting,
developing software platforms and frameworks, system capacity planning and
post-mortems.
Maintain services once they are launched by measuring and monitoring availability,
latency, and overall system health.
Scale systems sustainably through mechanisms like automation; evolve systems by
pushing for changes that improve reliability and velocity.
Practice sustainable incident response and blameless postmortems.
Documenting tribal knowledge.
Advocate for Reliability Engineering practices
Empower product developers to manage their applications with governance and control
QUALIFICATIONS
Experience in analyzing and troubleshooting systems.
Understanding of Infrastructure monitoring, logging, alerting release, and configuration management.
Understanding of networking (e.g. TCP/IP, routing, network topology, load balancers, DNS, NTP).
Experience in one of the following: Python, Java, Go, Perl, Ruby, or shell scripting.
Experience in Public Cloud, AWS, and/or GCP.
Experience maintaining Internet-facing production-grade applications.
Experience with software deployment and/or orchestration technologies, e.g., Puppet, Chef, Salt, Ansible, Docker, Kubernetes, Terraform.
Experience in CI/CD (e.g., JIRA, Git, Jenkins, Nexus, ...)
Experience in standard IT security practices (e.g., encryption, certificates, key management)
Excellent communication, and problem-solving skills with strong attention to detail.
Flexibility to work non-business hours that may include weekends and/or holidays
Self-starter who is able to identify and perform tasks with minimal supervision