HumanEval

A benchmark of 164 hand-written Python programming problems, each with unit tests. Score = % of problems where the model's generated code passes the tests. The de-facto benchmark for code-generation quality. DeepSeek V4 (92.1) and DeepSeek Coder V3 (89.4) currently top open models.