Xikai Meng's Personal Website

News

[2025/09] Started my M.Eng. in Computer Engineering at the University of Washington, Seattle.

[2025/08] Started research collaboration with University of Waterloo on Eagle-3 + Quantization within the SGLang framework.

[2025/02] Started SD-HC project at AMD — speculative decoding on heterogeneous CPU/NPU/GPU AIPC platforms.

[2024/09] Joined AMD as a Machine Learning Engineer in the Model Compression Team (Beijing).

[2024/06] Graduated from the Honors School of Harbin Institute of Technology with a B.S. in Applied Physics.

[2024/06] Joined SenseTime as an HPC/AI System Research Intern in the LLM Team.

[2023/09] Joined Baichuan AI as an AI Infra Intern; contributed to the Clover speculative-decoding work.

Education

University of Washington, Seattle, US (Sep 2025 – Dec 2026)

M.Eng. in Computer Engineering

Focus: ML Systems, LLM Inference, GPU Computing
Coursework (planned / in progress): Parallel Computing, Advanced Computer Architecture, Deep Learning Systems

Harbin Institute of Technology, Harbin, China (Sep 2020 – Jun 2024)

B.S. in Applied Physics, Honors School of HIT (minor in Information Engineering)

Overall GPA: 3.52/4.0
Prize of the Mathematical Contest In Modeling
Golden Medal in Tsinghua X-institute AI Summercamp Competition (1/24 across all groups)
People's Scholarship (11/71 across all department students)
IELTS 7.0 (all sub-scores ≥ 6.5)

UC San Diego, San Diego, US (Sep 2022 – Mar 2023)

Visiting Student, Department of Computer Science & Engineering

Core CS coursework:
CSE12 Data Structures & OOD (A), CSE120 Operating Systems, CSE151 Deep Learning, CSE123 Computer Networks
Other courses:
PHYS140 Thermodynamics & Statistical Physics, PHYS161 General Relativity and Black Holes (A), PHIL The Meaning of Life
Exchange life [pdf]

Professional Experience

Advanced Micro Devices (AMD), Beijing, China (Sep 2024 – Sep 2025)

Machine Learning Engineer, Model Compression Team

Developed neural-network quantization algorithms (AWQ, QuaRot, DuQuant) on AMD Ryzen AI; research on low-bit attention inference acceleration.
Optimized the ONNX path of AMD's Quark Model Optimizer via W8A8 auto-search to improve accuracy/throughput trade-offs.
Initiated and led SD-HC, a speculative-decoding system co-scheduled across CPU / NPU / GPU on AIPC platforms.

SenseTime, Beijing, China (Jun 2024 – Sep 2024)

HPC / AI System Research Intern, LLM Team

Performance tuning on OpenPPL, SenseTime's in-house inference framework.
Optimized CUDA operators to accelerate edge-device LLM deployment; explored additional serving optimization techniques.
Authored internal write-ups analyzing DeepSeek architectural choices and compute characteristics.

Baichuan AI, Beijing, China (Sep 2023 – Mar 2024)

AI Infra Intern, LLM Team

Helped design the inference platform architecture on top of Text Generation Inference, FasterTransformer, vLLM.
Worked on speculative decoding with tree-attention — our method (Clover) outperforms the baseline by up to 91% on Baichuan-Small and 146% on Baichuan-Large, exceeding Medusa by up to 37% and 57% respectively.
Participated in model inference cost assessment and built statistical prediction models.

Publications

Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge.
Baichuan AI authors incl. X. Meng et al., 2024. [paper]

Eagle-3 Quantization for Speculative Decoding. In submission, 2025.

SD-HC: Heterogeneous Functional Pipelining for Speculative LLM Decoding on AI PCs.
X. Meng, S. Fu, Z. Li, W. Wang, C. Li, S. Tiwari, P. Zheng. In submission, 2025. [code (coming soon)]

Y. Li, S. Yuan, X. Meng, Y. Wang, J. Li, "Security Research of Intelligent Image Recognition System," Tsinghua Summer Conference on Communications and Networking, 2021. []

Selected Projects

Eagle-3 Quantization on SGLang — with University of Waterloo, Aug 2025 – Present

Reproduced and optimized the Eagle-3 speculative-decoding framework with quantized draft / target models, analyzing compute complexity and KV-cache footprint across Prefill and Decode stages.
Benchmarked INT8, 4-bit AWQ / NF4 / GPTQ; identified AWQ on both base and draft models as the best accuracy / memory trade-off.

SD-HC: Heterogeneous Functional Pipelining for Speculative LLM Decoding on AI PCs — AMD, Feb 2025 – Aug 2025 | code (soon)

SD-HC system diagram (placeholder — drop sd_hc_system.png into assets/image/)

Goal: fast and accurate edge LLM inference on AI PCs. Built a functional pipeline across iGPU + NPU + CPU on AMD Ryzen AI 9 HX370.
Functional placement: FP16 target on NPU (large shared memory, longer context), INT4 quantized draft on iGPU (concurrent draft execution), reject sampling on CPU (less compute / data transfer, reduces NPU idle time).
Two contributions: (1) heterogeneous speculative sampling with quantization-error mitigation; (2) dual-pipeline architecture with error-handling mechanisms.
Results: TTFT down to ~49% of NPU-FP16 and ~22% of NPU-INT4; throughput ~7.6× over NPU-FP16 and ~2.3× over iGPU-INT4 at comparable accuracy; scales super-linearly with prompt length, sustains >5 tok/s at 2048 input tokens.

Open-source contributions to LLM inference stacks

SGLang — Eagle-3 quantization integration & benchmarks.
AMD Quark Model Optimizer — W8A8 auto-search on ONNX.
SenseTime OpenPPL — CUDA-operator tuning for edge-device LLM serving.

Systems & algorithms practice projects

C++ thread pool, mini-Redis, JSON parser & query engine.
GNN-based recommendation system research [slides].

Skills

Languages: Python, C/C++, CUDA, Verilog, Bash, SQL, Java, JavaScript, MATLAB, R, HTML/CSS

ML Frameworks: PyTorch, TensorFlow, ONNX

LLM Infra: SGLang, vLLM, TGI, FasterTransformer, MLC-LLM, OpenPPL, Quark

Tools: Git, LaTeX, Flask, Bootstrap

Spoken Languages: Mandarin (native), English (IELTS 7.0, proficient)

Miscellaneous

I enjoy reading widely and am good at summarizing.

Strong, sustained passion for technology — especially LLM systems.

I also love traveling, camping, and being close to nature.

Friends (click to expand, random order)

THU: Guanxing Lu
CMU: Hanshi Sun

Xikai Meng 孟西恺