Site Reliability Engineer - Machine Learning Systems

HireIO, Inc.

Early Applicant

16 days ago
Be among the first 50 applicants

Exp: 3-5 Years

Full time

Singapore

Job Description

Responsibilities

Responsible for ensuring our ML systems are operating and running efficiently for large model deployment, training, evaluation, and inference
Responsible for the stability of offline tasks/services in multi-data center, multi-region, and multi-cloud scenarios
Responsible for resource management and planning, cost and budget, including computing and storage resources
Responsible for global system disaster recovery, cluster machine governance, stability of business services, resource utilisation improvement and operation efficiency improvement
Build software tools, products and systems to monitor and manage the mL infrastructure and services efficiently
Be part of the global team roster that ensures system and business on-call support.

Requirements

Qualifications:

Bachelor's degree or above, majoring in Computer Science, computer engineering or related fields;
Strong proficiency in at least one programming language such as Go/Python/Shell in Linux environment;
Strong hands-on experience with Kubernetes and containers skills, and have 3 years of relevant operation and maintenance experience;

Preferred Qualifications

Possess excellent logical analysis ability, able to reasonably abstract and split business logic, a strong sense of responsibility, good learning ability, communication ability, self-driven and good team spirit;
Have good documentation principles and habits to be able to write and update workflow and technical documentation as required on time.
Engage in the operation and maintenance of large-scale ML distributed systems;
Experience in operation and maintenance of GPU servers

More Info

Industry:Other

Job Type:Permanent Job

Skills Required

GPU servers

Shell

Containers

Kubernetes

Python

Date Posted: 08/11/2024

Job ID: 99652683

Report Job

About Company

HireIO, Inc.Job Source: www.linkedin.com

Hi , want to stand out? Get your resume crafted by experts.

Similar Jobs

Site Reliability Engineer Machine Learning Systems

HireIO Inc Company Name Confidential

3-5 yrs

Singapore

3 weeks ago

Site Reliability Engineer Machine Learning Systems Singapore

ByteDanceCompany Name Confidential

3-5 yrs

Singapore

1 months ago

Last Updated: 08-11-2024 07:26:11 PM

Home Jobs in Singapore Site Reliability Engineer - Machine Learning Systems

Jobs by Skill - IT

Jobs by Skill - Non IT

International Jobs

Do you want to see more relevant and perfect job for you?

Beware of Scammers

We don’t charge any money for job offers

What it feels like to have

48% more interview calls?

To get 5X more recruiter views on your profile

Site Reliability Engineer - Machine Learning Systems

Job Description

More Info

Skills Required

About Company

Similar Jobs

Site Reliability Engineer Machine Learning Systems

Site Reliability Engineer Machine Learning Systems Singapore

Backend Engineer Machine Learning Systems TikTok Infrastructure

Site Reliability Engineer Node Operator Restaking systems AVSs