基于 BaziQA-Benchmark 论文的实时评测结果,对 6 款主流大语言模型在 200 道全球命理师大赛真题上的八字推理能力进行排名。包含综合排行榜、九大领域准确率、逐年趋势及统计显著性分析。DeepSeek-Chat-V3 以 36.7% 的宏平均准确率领先,所有模型均显著优于 25% 随机基线。

Live evaluation results based on the BaziQA-Benchmark paper, ranking 6 leading LLMs on 200 professional BaZi competition questions. Includes overall rankings, domain-level accuracy across 9 areas, yearly trends, and statistical significance analysis. DeepSeek-Chat-V3 leads with 36.7% macro accuracy, and all models significantly outperform the 25% random baseline.

BaZi-Benchmark 实时评测: AI 八字推理能力排行榜

BaZi-Benchmark Live Evaluation: AI BaZi Reasoning Leaderboard

来自上海交通大学的研究团队发表 BaziQA-Benchmark 论文,用 200 道全球命理师大赛真题同时测试 AI 与人类顶级命理师。结果显示: 八字推理极其困难,人类冠军准确率也仅 37.5%~50%,而最强通用 AI 在 2023 年仅比冠军低 1.5%。AuraMate 灵伴基于结构化推理协议(SRP)打造可解释、可追溯的推理引擎,在多个年份超越大赛季军,达到与人类顶级命理师比肩的表现。

Our research team from Shanghai Jiao Tong University published the BaziQA-Benchmark paper testing AI and champion-level human fortune-tellers with 200 professional competition problems. BaZi reasoning is extraordinarily hard — even human champions only reach 37.5%–50% accuracy — yet in 2023 the best general AI was just 1.5% behind the champion. AuraMate productizes our Structured Reasoning Protocol into a more transparent, traceable reasoning engine, surpassing 3rd-place finishers in multiple years and reaching performance on par with top practitioners.

探索

BaZi-Benchmark 实时评测: AI 八字推理能力排行榜

AI 算八字到底准不准? 我们让 AI 和人类顶级命理师正面对决