| 47 Visitas |
0 Candidatos |
Descripción del puesto:
Team Intro: Within the Seed-Infra-Training team, this sub team is responsible for ByteDance's large model training platform. We internally support ByteDance's basic large model training and generative AI business, supporting pre-training and post-training of language models, multi-modal understanding, video generation, etc. We have built a multi-tenant and multi-cloud heterogeneous GPU computing platform for customers, providing a series of stable, efficient, observable and diagnosable framework system platform components to help and support the expansion of large model training to Wanka and beyond
Requerimientos del candidato/a:
Qualifications: - Currently in BS/MS program in distributed, parallel computing principles and know the recent advances in computing, storage, networking, and hardware technologies. - Familiarity with orchestration frameworks such as Kubernetes, Kubeflow, or Volcano - Proficient in at least one deep learning framework (e.g., PyTorch, Megatron, DeepSpeed, vLLM) - Experience with at least one major machine learning framework Preferred Qualifications: - Knowledge of fault tolerance and system reliability - Experience with large-scale training and LLM systems - Background in AIOps and resource scheduling - Papers selected by top conferences such as OSDI/SOSP/NSDI/ATC/Eurosys/SysML
| Origen: | Web de la compañía |
| Publicado: | 27 Nov 2025 (comprobado el 14 Dic 2025) |
| Tipo de oferta: | Prácticas |
| Sector: | Internet / Nuevos Medios |
| Idiomas: | Inglés |
Empresas |
Ofertas |
Países |