619 Commits

Author SHA1 Message Date
Elham
9843e332da [CPU][Perf] Add fast vectorized exp impl from Arm Optimized Routines (#30068)
Signed-off-by: Ubuntu <ubuntu@ip-10-252-30-150.eu-west-1.compute.internal>
Signed-off-by: Elham Harirpoush <elham.harirpoush@arm.com>
Co-authored-by: Ubuntu <ubuntu@ip-10-252-30-150.eu-west-1.compute.internal>
2025-12-05 13:09:20 +00:00
Zhang Xiangze
13ea39bc09 [CPU]Parallelize over tokens in int4 moe (#29600)
Signed-off-by: Zhang Xiangze <Xiangze.Zhang@arm.com>
2025-12-02 06:21:39 +00:00
Hendrik Holtmann
c0dfc89485 SM120 / NVFP4: add device guard and runtime SM dispatch to cutlass_scaled_fp4_mm (#29711)
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-12-01 17:24:18 -08:00
Jinzhen Lin
1656ad3704 [Kernel][Quantization] add w4a8 support for marlin kernel (#24722)
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Michael Goin <mgoin@redhat.com>
2025-11-29 07:19:33 -08:00
Li, Jiang
e2f56c309d [CPU] Update torch 2.9.1 for CPU backend (#29664)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-11-28 13:37:54 +00:00
Jinzhen Lin
a67dec7cba [Bugfix] fix IMA issue in certain cases of the moe marlin kernel (#28619)
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2025-11-26 19:02:21 -08:00
Pleaplusone
d9d342d214 [Performance][MLA][ROCm] Remove redundant D2D copy in deepseek (#27457)
Signed-off-by: ganyi <ygan@amd.com>
2025-11-26 12:45:28 +08:00
Michael Goin
e502098643 [Kernel] Add NVFP4 MoE CUTLASS support for SM120 (#29242)
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
2025-11-25 06:59:07 -08:00
Pleaplusone
77e10c9cab [Perf][Deepseek] optimize gather_and_maybe_dequant_cache kernel's perf for extremely long sequence (#28029)
Signed-off-by: ganyi <ygan@amd.com>
2025-11-24 19:05:46 -07:00
R3hankhan
4de87866a8 [CPU][IBM Z] Fix BF16 support and vectorize math operations for s390x (#28926)
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
2025-11-24 12:08:09 +00:00
Fadi Arafeh
730bd35378 [perf][cpu] Accelerate paged attention GEMMs (QK, PV) on Arm CPUs with NEON (#29193)
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
2025-11-22 09:04:36 -08:00
Jane (Yuan) Xu
e6309acdba Simplify from_blob usage in get_cuda_view_from_cpu_tensor (#29027)
Signed-off-by: Jane Xu <janeyx@meta.com>
2025-11-22 10:35:32 +00:00
skaraban3807
f1805db1a6 [Perf] These changes enhance the NUMA functionality of vllm for systems with more than one NUMA nodes per socket (#25559)
Signed-off-by: Siddappa Karabannavar <siddappa.karabannavar@amd.com>
2025-11-21 14:13:52 +00:00
zhrrr
a982f5b5ea [kernel][perf] support uncontiguous input for rms_norm kernel (#28103)
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Signed-off-by: izhuhaoran <izhuhaoran@qq.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2025-11-20 19:39:09 -08:00
Pleaplusone
06c20c9904 [ROCm] Add AMD GPU support on Deepseek v3.2 and SparseMLA (#26670)
Signed-off-by: ganyi <ygan@amd.com>
2025-11-20 02:54:01 -08:00
Vensen
fb8851f254 [Bugfix][cache_kernels]: Fix OOB in cache_kernels.cu (#28760)
Signed-off-by: vensen <vensenmu@gmail.com>
Signed-off-by: Vensenmu <vensenmu@gmail.com>
2025-11-20 02:52:02 -08:00
Boyuan Feng
a903d59ffa cleanup at::Tag::needs_fixed_stride_order (#28974)
Signed-off-by: Boyuan Feng <boyuan@meta.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-11-20 02:51:36 -08:00
j20120307
bbc6c2f1e5 [CI/Build] Fix broken build on Apple M1 (#28999)
Signed-off-by: Kan Zhu <j20120307@gmail.com>
2025-11-19 11:07:22 +00:00
ihb2032
8151609583 refactor(cpu_types_scalar.hpp): Unify scalar loop implementations using unroll_loop (#28847)
Signed-off-by: ihb2032 <1355790728@qq.com>
Co-authored-by: lyd1992 <liuyudong@iscas.ac.cn>
2025-11-19 11:05:44 +00:00
Li, Jiang
20852c8f4c [CPU] Refactor CPU WNA16 (#28826)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-11-19 10:32:00 +08:00
tiehexue
e42bd8c2e3 Cast return value to int64_t for cache size (#28814)
Signed-off-by: tiehexue <tiehexue@hotmail.com>
2025-11-17 16:02:32 +00:00
Li, Jiang
577bb34fff [CPU][Bugfix] Fix _to_list in CPU model runner (#28824)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-11-17 07:47:24 +00:00
Jane (Yuan) Xu
74b5267d3a Use narrow over indexing in hadacore_transform to prep for ABI stable (#28756)
Signed-off-by: Jane Xu <janeyx@meta.com>
2025-11-15 01:10:15 -08:00
Sage Moore
8977ffb5e6 [ROCm][Bugfix] Fix compilation errors with fused_qknorm_rope_kernel.cu (#28682)
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-11-14 11:06:01 -08:00
czhu-cohere
cdd7025961 [kernel] Improve FP8 PTPC on Hopper for larger shapes (#28692)
Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
2025-11-14 09:59:11 -08:00
Michael Goin
622e6106a9 [CPU][Bugfix] Fix Apple Silicon M1 compilation failure (#28681)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-11-14 09:49:55 +08:00
Varun Sundar Rabindranath
fe1cd7704d [Performance][B200] silu_mul_quant: pack scales in int32 (#28358)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-11-13 10:16:55 -08:00
Jane (Yuan) Xu
06c4873d95 Rewrite C++ meta funcs to Python (#28595)
Signed-off-by: Jane Xu <janeyx@meta.com>
2025-11-14 00:52:50 +08:00
Akash kaothalkar
86d15bfd8d [Hardware][PowerPC] Fix fp16 compilation error for Power in cpu attention backend and bump oneDNN version (#28535)
Signed-off-by: Akash Kaothalkar <akash.kaothalkar@ibm.com>
Co-authored-by: Akash Kaothalkar <akash.kaothalkar@ibm.com>
2025-11-13 13:32:21 +00:00
usberkeley
4ab34f6ef1 Add NUMA node validation for CPU thread binding (#28555)
Signed-off-by: Bradley <bradley.b.pitt@gmail.com>
2025-11-13 07:03:52 +00:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟
4ca5cd5740 [Core][AMD] Migrate fully transparent sleep mode to ROCm platform (#12695)
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: kliuae <kuanfu.liu@embeddedllm.com>
2025-11-12 15:24:12 -08:00
TJian
edb59a9470 [ROCm] [Bugfix] Fix fused_qknorm_rope_kernel rocm compatibility (#28500)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
2025-11-12 05:01:14 -08:00
Li, Jiang
7f829be7d3 [CPU] Refactor CPU attention backend (#27954)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-11-12 09:43:06 +08:00
Xin Yang
6c3c0f8235 [Kernel] Optimize rms_norm kernel (#27931)
Signed-off-by: Xin Yang <xyangx@amazon.com>
2025-11-11 18:02:23 +00:00
zhrrr
68c09efc37 [Kernel][Perf] fuse QK Norm and RoPE into one cuda kernel for Qwen Model (#27165)
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
2025-11-11 12:00:31 -05:00
Michael Goin
f9a4087182 Remove weight_scale.T special case for SM90 Block FP8 CUTLASS kernel (#28431)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-11-11 11:46:04 -05:00
Varun Sundar Rabindranath
b039bfda8f [Bugfix] Fix persistent_masked_m_silu_mul_quant tests (#28366)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-11-10 09:21:52 -08:00
ElizaWszola
171133f929 [Bugfix] Fix test fused quant layernorm tests (#27865)
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2025-11-08 14:31:33 -08:00
Michael Goin
0852527647 [Perf][DeepSeek] Add sigmoid+bias fusion to fused_grouped_topk from TRTLLM (#28124)
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2025-11-07 18:20:55 -08:00
Pavani Majety
72b1c2ae2c [Bugfix] Use latency MOE backend as default for Flashinfer and other misc fixes (#27439)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
2025-11-07 04:18:39 -08:00
courage17340
981cadb35c [Bugfix][Kernel] fix merge attn states when both prefix and suffix are empty (#28181)
Signed-off-by: courage17340 <courage17340@163.com>
2025-11-06 17:52:13 +08:00
lyrisz
97e3dda84b [Perf] SM100 - add swap AB optimization to CUTLASS FP8 GEMM (#27284)
Signed-off-by: Faqin Zhong <faqin.zhong@gmail.com>
Co-authored-by: Faqin Zhong <zhofaqin@amazon.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-11-04 07:49:25 -08:00
xiangze-arm
f32cbc9a0c [CPU]Improve dynamic 4bit moe performance (#27240)
Signed-off-by: Zhang Xiangze <Xiangze.Zhang@arm.com>
2025-11-04 06:33:23 +00:00
gnovack
294c805f1d Early exit for MoE LoRA kernels (#27131)
Signed-off-by: gnovack <gnovack@amazon.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-11-03 20:22:17 +08:00
Asaf Joseph Gardin
00b31a36a2 [V1] [Hybrid] Mamba1 Automatic Prefix Caching (#26377)
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>
2025-11-02 04:16:23 -08:00
Fadi Arafeh
2080b05099 [cpu][fix] Fix onednn_mm crash on consecutive matmuls with same M,K,N and different dtype (#27472)
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
2025-10-24 15:57:48 +00:00
Xiangyu Li
5cc6bddb6e [Kernel] Add GPTQv2 format support for low-bit or asymmetric quantization, by adapting gptq_gemm (#26092) 2025-10-23 23:26:13 -04:00
gnovack
8e4ca4d14e Bugfix - pass 'max_num_tokens_padded' into 'moe_lora_align_block_size' (#27311)
Signed-off-by: gnovack <gnovack@amazon.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-10-22 12:23:57 +00:00
Lain
09a7e6f617 [Deepseek v3.2] Remove extra logics in indexer (#26465)
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Signed-off-by: Lain <siyuanf@nvidia.com>
Co-authored-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-10-21 23:34:03 +00:00
Daniel Cámpora
80e9452984 [Deepseek v3.2] Optimize top_k_per_row (#26763)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-10-21 08:30:07 +00:00