Family Group: Administration
The System Administrator in the AI/ML setting is responsible for maintaining, managing, and optimizing the IT infrastructure that supports AI and ML applications. This includes ensuring system reliability, security, and performance, as well as supporting data scientists and engineers in their workflow.
Infrastructure Management:
Provision, configure, and maintain servers, storage, and networking equipment.
Ensure high availability and disaster recovery of AI/ML infrastructure.
Implement and manage virtualized environments (e.g., OpenStack, Docker, Kubernetes).
System Monitoring and Maintenance:
Monitor system performance and ensure system uptime and availability.
Perform regular system updates, patches, and security configurations.
Troubleshoot hardware and software issues in a timely manner.
Automation and Scripting:
Develop and maintain scripts to automate repetitive tasks and improve system efficiency.
Implement Infrastructure as Code (IaC) practices.
Security Management:
Implement and maintain security best practices, including firewall management, intrusion detection/prevention, and vulnerability assessments.
Ensure compliance with data privacy regulations and organizational security policies.
Collaboration and Support:
Work closely with data scientists, engineers, and other stakeholders to support their technical needs.
Provide technical support for AI/ML tools and platforms such as TIBCO, BitBucket, AWS SageMaker, Databricks, and Informatica IDMC.
Participate in the development and maintenance of CI/CD pipelines.
Resource Optimization:
Monitor and optimize resource utilization to ensure cost-effective operations.
Manage and maintain cloud resources and services in in local DC AWS.