Optiver
Shanghai, China
Key Responsibilities
Building the compute platform and machine learning libraries for large scale machine learning and simulation workloads
Focus on compute platform stability and efficiency on both CPU and GPU clusters, making the platform observable and scalable
Utilize cluster monitoring and profiling tools to identify bottlenecks and optimize both infrastructure and software system
Troubleshoot and resolve issues related to OS, storage, network, and GPUs
Challenges You Will Tackle: design, build and improve our compute platform for PB scale data model training and simulations with a wide range of machine learning models by leveraging our existing research infrastructure.
Requirements:
Solid experience in running production machine learning infrastructure at a large scale
Experience in designing, deploying, profiling and troubleshooting in Linux-based computing environments
Proficiency in containerization, parallel computing and distributed training...