Research Scientist - AI Infrastructure PhD 2026 San Jose,CA

Gathering your results ...

Job Details

Research Scientist - AI Infrastructure PhD 2026

ByteDance San Jose, CA

Days Posted: 19 days

Experience Level:Not Specified

Employment Type:Not Specified

Pay Range: Not Specified

We are looking for talented individuals to join our team in 2026. As a graduate, you will get opportunities to pursue bold ideas, tackle complex challenges, and unlock limitless growth. Launch your career where inspiration is infinite at ByteDance. Successful candidates must be able to commit to an onboarding date by end of year 2026. Please state your availability and graduation date clearly in your resume. On the AI Infra Team, you'll be immersed in the robust and scalable infrastructure that powers our cutting-edge artificial intelligence (AI) and machine learning (ML) initiatives. You will work closely with our AI/ML researchers, data scientists, and software engineers to create an efficient, high-performance environment for training, inference, and data processing. Your expertise will be critical in enabling the next generation of AI-driven products and services. Responsibilities The ideal candidate should be an expert in at least one of the following fields to define and design the next-gen AI Infrastructure: <ul> <li>Infrastructure Design & Architecture </li><li>Lead end-to-end design of scalable, reliable AI infrastructure (AI accelerators, compute clusters, storage, networking) for training and serving large ML workloads. </li><li>Define and implement service-oriented, containerized architectures (Kubernetes, VM frameworks, unikernels) optimized for ML performance and security. </li><li>Performance Optimization </li><li>Profile and optimize every layer of the ML stack-ML Compiler, GPU/TPU scheduling, NCCL/RDMA networking, data preprocessing, and training/inference frameworks. </li><li>Develop low-overhead telemetry and benchmarking frameworks to identify and eliminate bottlenecks in distributed training and serving. </li><li>Distributed Systems & Scalability </li><li>Build and operate large-scale deployment and orchestration systems that auto-scale across multiple data centers (on-premises and cloud). </li><li>Champion fault-tolerance, high availability, and cost-efficiency through smart resource management and workload placement. </li><li>Data Pipeline & Workflow Engineering </li><li>Architect and implement robust ETL and data ingestion pipelines (Spark/Beam/Dask/Flume) tailored for petabyte-scale ML datasets. </li><li>Integrate experiment management and workflow orchestration tools (Airflow, Kubeflow, Metaflow) to streamline research-to-production. </li><li>Collaboration & Mentorship </li><li>Partner with ML researchers to translate prototype requirements into production-grade systems. </li><li>Mentor and coach engineers on best practices in performance tuning, systems design, and reliability engineering. </li></ul>

POST A JOB

It's completely FREE to post your jobs on ZiNG! There's no catch, no credit card needed, and no limits to number of job posts.

The first step is to SIGN UP so that you can manage all your job postings under your profile.

If you already have an account, you can LOGIN to post a job or manage your other postings.

Thank you for helping us get Americans back to work!