(at Grand Canyon, Arizona, US)

Xikai Meng   孟西恺

I'm currently a Master of Engineering student in Computer Engineering at the University of Washington, Seattle (Sep 2025 – Dec 2026). Before that, I received my B.S. in Applied Physics from the Honors School of Harbin Institute of Technology (2020 – 2024).

I have ~2 years of ML Systems / AI Infra experience across AMD, SenseTime and Baichuan AI, with contributions to open-source LLM inference stacks (SGLang, Quark, OpenPPL).

Research Interest: LLM Inference, Speculative Decoding, Quantization, Heterogeneous Computing
Job Interest: ML Systems Engineer, LLM Infra / Inference, AI Compiler & Serving

+1 (206) 797-8906
mengnoah78952@gmail.com  |  xikaim@uw.edu (UW)

GitHub (NoahM)
LinkedIn
CV (PDF)

News


Education

University of Washington, Seattle, US (Sep 2025 – Dec 2026)

M.Eng. in Computer Engineering
  • Focus: ML Systems, LLM Inference, GPU Computing
  • Coursework (planned / in progress): Parallel Computing, Advanced Computer Architecture, Deep Learning Systems

Harbin Institute of Technology, Harbin, China (Sep 2020 – Jun 2024)

B.S. in Applied Physics, Honors School of HIT (minor in Information Engineering)
  • Overall GPA: 3.52/4.0
  • Prize of the Mathematical Contest In Modeling
  • Golden Medal in Tsinghua X-institute AI Summercamp Competition (1/24 across all groups)
  • People's Scholarship (11/71 across all department students)
  • IELTS 7.0 (all sub-scores ≥ 6.5)

UC San Diego, San Diego, US (Sep 2022 – Mar 2023)

Visiting Student, Department of Computer Science & Engineering
  • Core CS coursework:
    CSE12 Data Structures & OOD (A), CSE120 Operating Systems, CSE151 Deep Learning, CSE123 Computer Networks
  • Other courses:
    PHYS140 Thermodynamics & Statistical Physics, PHYS161 General Relativity and Black Holes (A), PHIL The Meaning of Life
  • Exchange life [pdf]

Professional Experience

Advanced Micro Devices (AMD), Beijing, China (Sep 2024 – Sep 2025)

Machine Learning Engineer, Model Compression Team
  • Developed neural-network quantization algorithms (AWQ, QuaRot, DuQuant) on AMD Ryzen AI; research on low-bit attention inference acceleration.
  • Optimized the ONNX path of AMD's Quark Model Optimizer via W8A8 auto-search to improve accuracy/throughput trade-offs.
  • Initiated and led SD-HC, a speculative-decoding system co-scheduled across CPU / NPU / GPU on AIPC platforms.

SenseTime, Beijing, China (Jun 2024 – Sep 2024)

HPC / AI System Research Intern, LLM Team
  • Performance tuning on OpenPPL, SenseTime's in-house inference framework.
  • Optimized CUDA operators to accelerate edge-device LLM deployment; explored additional serving optimization techniques.
  • Authored internal write-ups analyzing DeepSeek architectural choices and compute characteristics.

Baichuan AI, Beijing, China (Sep 2023 – Mar 2024)

AI Infra Intern, LLM Team
  • Helped design the inference platform architecture on top of Text Generation Inference, FasterTransformer, vLLM.
  • Worked on speculative decoding with tree-attention — our method (Clover) outperforms the baseline by up to 91% on Baichuan-Small and 146% on Baichuan-Large, exceeding Medusa by up to 37% and 57% respectively.
  • Participated in model inference cost assessment and built statistical prediction models.

Publications

Selected Projects

Eagle-3 Quantization on SGLang — with University of Waterloo, Aug 2025 – Present

  • Reproduced and optimized the Eagle-3 speculative-decoding framework with quantized draft / target models, analyzing compute complexity and KV-cache footprint across Prefill and Decode stages.
  • Benchmarked INT8, 4-bit AWQ / NF4 / GPTQ; identified AWQ on both base and draft models as the best accuracy / memory trade-off.

SD-HC: Heterogeneous Functional Pipelining for Speculative LLM Decoding on AI PCs — AMD, Feb 2025 – Aug 2025  |  code (soon)

SD-HC system diagram (placeholder — drop sd_hc_system.png into assets/image/)

  • Goal: fast and accurate edge LLM inference on AI PCs. Built a functional pipeline across iGPU + NPU + CPU on AMD Ryzen AI 9 HX370.
  • Functional placement: FP16 target on NPU (large shared memory, longer context), INT4 quantized draft on iGPU (concurrent draft execution), reject sampling on CPU (less compute / data transfer, reduces NPU idle time).
  • Two contributions: (1) heterogeneous speculative sampling with quantization-error mitigation; (2) dual-pipeline architecture with error-handling mechanisms.
  • Results: TTFT down to ~49% of NPU-FP16 and ~22% of NPU-INT4; throughput ~7.6× over NPU-FP16 and ~2.3× over iGPU-INT4 at comparable accuracy; scales super-linearly with prompt length, sustains >5 tok/s at 2048 input tokens.

Open-source contributions to LLM inference stacks

  • SGLang — Eagle-3 quantization integration & benchmarks.
  • AMD Quark Model Optimizer — W8A8 auto-search on ONNX.
  • SenseTime OpenPPL — CUDA-operator tuning for edge-device LLM serving.

Systems & algorithms practice projects

  • C++ thread pool, mini-Redis, JSON parser & query engine.
  • GNN-based recommendation system research [slides].

Blog

Skills

Computer Science Learning

Self-studied a CS-major curriculum alongside my Physics degree — Stanford 106L (C++), Harvard CS50, UCSD CSE120 (OS), MIT 6.S081, CMU 15-445 (DB), MIT 6.824 (Distributed), Stanford CS144 (Networks), UCB CS61B, and Andrew Ng's Deep Learning specialization.

Miscellaneous