Elham
|
9843e332da
|
[CPU][Perf] Add fast vectorized exp impl from Arm Optimized Routines (#30068)
Signed-off-by: Ubuntu <ubuntu@ip-10-252-30-150.eu-west-1.compute.internal>
Signed-off-by: Elham Harirpoush <elham.harirpoush@arm.com>
Co-authored-by: Ubuntu <ubuntu@ip-10-252-30-150.eu-west-1.compute.internal>
|
2025-12-05 13:09:20 +00:00 |
|
Zhang Xiangze
|
13ea39bc09
|
[CPU]Parallelize over tokens in int4 moe (#29600)
Signed-off-by: Zhang Xiangze <Xiangze.Zhang@arm.com>
|
2025-12-02 06:21:39 +00:00 |
|
Hendrik Holtmann
|
c0dfc89485
|
SM120 / NVFP4: add device guard and runtime SM dispatch to cutlass_scaled_fp4_mm (#29711)
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
|
2025-12-01 17:24:18 -08:00 |
|
Jinzhen Lin
|
1656ad3704
|
[Kernel][Quantization] add w4a8 support for marlin kernel (#24722)
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Michael Goin <mgoin@redhat.com>
|
2025-11-29 07:19:33 -08:00 |
|
Li, Jiang
|
e2f56c309d
|
[CPU] Update torch 2.9.1 for CPU backend (#29664)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
|
2025-11-28 13:37:54 +00:00 |
|
Jinzhen Lin
|
a67dec7cba
|
[Bugfix] fix IMA issue in certain cases of the moe marlin kernel (#28619)
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
|
2025-11-26 19:02:21 -08:00 |
|
Pleaplusone
|
d9d342d214
|
[Performance][MLA][ROCm] Remove redundant D2D copy in deepseek (#27457)
Signed-off-by: ganyi <ygan@amd.com>
|
2025-11-26 12:45:28 +08:00 |
|
Michael Goin
|
e502098643
|
[Kernel] Add NVFP4 MoE CUTLASS support for SM120 (#29242)
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
|
2025-11-25 06:59:07 -08:00 |
|
Pleaplusone
|
77e10c9cab
|
[Perf][Deepseek] optimize gather_and_maybe_dequant_cache kernel's perf for extremely long sequence (#28029)
Signed-off-by: ganyi <ygan@amd.com>
|
2025-11-24 19:05:46 -07:00 |
|
R3hankhan
|
4de87866a8
|
[CPU][IBM Z] Fix BF16 support and vectorize math operations for s390x (#28926)
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
|
2025-11-24 12:08:09 +00:00 |
|
Fadi Arafeh
|
730bd35378
|
[perf][cpu] Accelerate paged attention GEMMs (QK, PV) on Arm CPUs with NEON (#29193)
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
|
2025-11-22 09:04:36 -08:00 |
|
Jane (Yuan) Xu
|
e6309acdba
|
Simplify from_blob usage in get_cuda_view_from_cpu_tensor (#29027)
Signed-off-by: Jane Xu <janeyx@meta.com>
|
2025-11-22 10:35:32 +00:00 |
|
skaraban3807
|
f1805db1a6
|
[Perf] These changes enhance the NUMA functionality of vllm for systems with more than one NUMA nodes per socket (#25559)
Signed-off-by: Siddappa Karabannavar <siddappa.karabannavar@amd.com>
|
2025-11-21 14:13:52 +00:00 |
|
zhrrr
|
a982f5b5ea
|
[kernel][perf] support uncontiguous input for rms_norm kernel (#28103)
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Signed-off-by: izhuhaoran <izhuhaoran@qq.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
|
2025-11-20 19:39:09 -08:00 |
|
Pleaplusone
|
06c20c9904
|
[ROCm] Add AMD GPU support on Deepseek v3.2 and SparseMLA (#26670)
Signed-off-by: ganyi <ygan@amd.com>
|
2025-11-20 02:54:01 -08:00 |
|
Vensen
|
fb8851f254
|
[Bugfix][cache_kernels]: Fix OOB in cache_kernels.cu (#28760)
Signed-off-by: vensen <vensenmu@gmail.com>
Signed-off-by: Vensenmu <vensenmu@gmail.com>
|
2025-11-20 02:52:02 -08:00 |
|
Boyuan Feng
|
a903d59ffa
|
cleanup at::Tag::needs_fixed_stride_order (#28974)
Signed-off-by: Boyuan Feng <boyuan@meta.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
|
2025-11-20 02:51:36 -08:00 |
|
j20120307
|
bbc6c2f1e5
|
[CI/Build] Fix broken build on Apple M1 (#28999)
Signed-off-by: Kan Zhu <j20120307@gmail.com>
|
2025-11-19 11:07:22 +00:00 |
|
ihb2032
|
8151609583
|
refactor(cpu_types_scalar.hpp): Unify scalar loop implementations using unroll_loop (#28847)
Signed-off-by: ihb2032 <1355790728@qq.com>
Co-authored-by: lyd1992 <liuyudong@iscas.ac.cn>
|
2025-11-19 11:05:44 +00:00 |
|
Li, Jiang
|
20852c8f4c
|
[CPU] Refactor CPU WNA16 (#28826)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
|
2025-11-19 10:32:00 +08:00 |
|
tiehexue
|
e42bd8c2e3
|
Cast return value to int64_t for cache size (#28814)
Signed-off-by: tiehexue <tiehexue@hotmail.com>
|
2025-11-17 16:02:32 +00:00 |
|
Li, Jiang
|
577bb34fff
|
[CPU][Bugfix] Fix _to_list in CPU model runner (#28824)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
|
2025-11-17 07:47:24 +00:00 |
|
Jane (Yuan) Xu
|
74b5267d3a
|
Use narrow over indexing in hadacore_transform to prep for ABI stable (#28756)
Signed-off-by: Jane Xu <janeyx@meta.com>
|
2025-11-15 01:10:15 -08:00 |
|
Sage Moore
|
8977ffb5e6
|
[ROCm][Bugfix] Fix compilation errors with fused_qknorm_rope_kernel.cu (#28682)
Signed-off-by: Sage Moore <sage@neuralmagic.com>
|
2025-11-14 11:06:01 -08:00 |
|
czhu-cohere
|
cdd7025961
|
[kernel] Improve FP8 PTPC on Hopper for larger shapes (#28692)
Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
|
2025-11-14 09:59:11 -08:00 |
|
Michael Goin
|
622e6106a9
|
[CPU][Bugfix] Fix Apple Silicon M1 compilation failure (#28681)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-11-14 09:49:55 +08:00 |
|
Varun Sundar Rabindranath
|
fe1cd7704d
|
[Performance][B200] silu_mul_quant: pack scales in int32 (#28358)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
|
2025-11-13 10:16:55 -08:00 |
|
Jane (Yuan) Xu
|
06c4873d95
|
Rewrite C++ meta funcs to Python (#28595)
Signed-off-by: Jane Xu <janeyx@meta.com>
|
2025-11-14 00:52:50 +08:00 |
|
Akash kaothalkar
|
86d15bfd8d
|
[Hardware][PowerPC] Fix fp16 compilation error for Power in cpu attention backend and bump oneDNN version (#28535)
Signed-off-by: Akash Kaothalkar <akash.kaothalkar@ibm.com>
Co-authored-by: Akash Kaothalkar <akash.kaothalkar@ibm.com>
|
2025-11-13 13:32:21 +00:00 |
|
usberkeley
|
4ab34f6ef1
|
Add NUMA node validation for CPU thread binding (#28555)
Signed-off-by: Bradley <bradley.b.pitt@gmail.com>
|
2025-11-13 07:03:52 +00:00 |
|
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟
|
4ca5cd5740
|
[Core][AMD] Migrate fully transparent sleep mode to ROCm platform (#12695)
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: kliuae <kuanfu.liu@embeddedllm.com>
|
2025-11-12 15:24:12 -08:00 |
|
TJian
|
edb59a9470
|
[ROCm] [Bugfix] Fix fused_qknorm_rope_kernel rocm compatibility (#28500)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
|
2025-11-12 05:01:14 -08:00 |
|
Li, Jiang
|
7f829be7d3
|
[CPU] Refactor CPU attention backend (#27954)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
|
2025-11-12 09:43:06 +08:00 |
|
Xin Yang
|
6c3c0f8235
|
[Kernel] Optimize rms_norm kernel (#27931)
Signed-off-by: Xin Yang <xyangx@amazon.com>
|
2025-11-11 18:02:23 +00:00 |
|
zhrrr
|
68c09efc37
|
[Kernel][Perf] fuse QK Norm and RoPE into one cuda kernel for Qwen Model (#27165)
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
|
2025-11-11 12:00:31 -05:00 |
|
Michael Goin
|
f9a4087182
|
Remove weight_scale.T special case for SM90 Block FP8 CUTLASS kernel (#28431)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-11-11 11:46:04 -05:00 |
|
Varun Sundar Rabindranath
|
b039bfda8f
|
[Bugfix] Fix persistent_masked_m_silu_mul_quant tests (#28366)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
|
2025-11-10 09:21:52 -08:00 |
|
ElizaWszola
|
171133f929
|
[Bugfix] Fix test fused quant layernorm tests (#27865)
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
|
2025-11-08 14:31:33 -08:00 |
|
Michael Goin
|
0852527647
|
[Perf][DeepSeek] Add sigmoid+bias fusion to fused_grouped_topk from TRTLLM (#28124)
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
|
2025-11-07 18:20:55 -08:00 |
|
Pavani Majety
|
72b1c2ae2c
|
[Bugfix] Use latency MOE backend as default for Flashinfer and other misc fixes (#27439)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
|
2025-11-07 04:18:39 -08:00 |
|
courage17340
|
981cadb35c
|
[Bugfix][Kernel] fix merge attn states when both prefix and suffix are empty (#28181)
Signed-off-by: courage17340 <courage17340@163.com>
|
2025-11-06 17:52:13 +08:00 |
|
lyrisz
|
97e3dda84b
|
[Perf] SM100 - add swap AB optimization to CUTLASS FP8 GEMM (#27284)
Signed-off-by: Faqin Zhong <faqin.zhong@gmail.com>
Co-authored-by: Faqin Zhong <zhofaqin@amazon.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
|
2025-11-04 07:49:25 -08:00 |
|
xiangze-arm
|
f32cbc9a0c
|
[CPU]Improve dynamic 4bit moe performance (#27240)
Signed-off-by: Zhang Xiangze <Xiangze.Zhang@arm.com>
|
2025-11-04 06:33:23 +00:00 |
|
gnovack
|
294c805f1d
|
Early exit for MoE LoRA kernels (#27131)
Signed-off-by: gnovack <gnovack@amazon.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
|
2025-11-03 20:22:17 +08:00 |
|
Asaf Joseph Gardin
|
00b31a36a2
|
[V1] [Hybrid] Mamba1 Automatic Prefix Caching (#26377)
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>
|
2025-11-02 04:16:23 -08:00 |
|
Fadi Arafeh
|
2080b05099
|
[cpu][fix] Fix onednn_mm crash on consecutive matmuls with same M,K,N and different dtype (#27472)
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
|
2025-10-24 15:57:48 +00:00 |
|
Xiangyu Li
|
5cc6bddb6e
|
[Kernel] Add GPTQv2 format support for low-bit or asymmetric quantization, by adapting gptq_gemm (#26092)
|
2025-10-23 23:26:13 -04:00 |
|
gnovack
|
8e4ca4d14e
|
Bugfix - pass 'max_num_tokens_padded' into 'moe_lora_align_block_size' (#27311)
Signed-off-by: gnovack <gnovack@amazon.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
|
2025-10-22 12:23:57 +00:00 |
|
Lain
|
09a7e6f617
|
[Deepseek v3.2] Remove extra logics in indexer (#26465)
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Signed-off-by: Lain <siyuanf@nvidia.com>
Co-authored-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
|
2025-10-21 23:34:03 +00:00 |
|
Daniel Cámpora
|
80e9452984
|
[Deepseek v3.2] Optimize top_k_per_row (#26763)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
|
2025-10-21 08:30:07 +00:00 |
|