8 comments

  • denysvitali 7 hours ago
    Better link: https://iquestlab.github.io/

    But yes, sadly it looks like the agent cheated during the eval

  • brunooliv 8 hours ago
    GLM-4.7 in opencode is the only opensource one that comes close in my experience and probably they did use some Claude data as I see the occasional You’re absolutely right in there
    • kees99 7 hours ago
      Do you see "What's your use-case" too?

      Claude spits that very regularly at the end of the answer, when it's clearly out of it's depth, and wants to steer discussion away from that blind-spot.

      • moltar 5 hours ago
        Hm, use CC daily, never seen this.
      • tw1984 4 hours ago
        never ever saw that "What's your use-case" in Claude Code.
    • behnamoh 7 hours ago
      it's not even close to sonnet 4.5, let alone opus.
      • hatefulmoron 4 hours ago
        I got their z.ai plan to test alongside my Claude subscription; it feels about on par with something between sonnet 4.0 and sonnet 4.5. It's definitely a few steps below current day Claude, but it's very capable.
        • enraged_camel 2 hours ago
          When you say "current day Claude" you need to distinguish between the models. Because Opus 4.5 is significantly ahead of Sonnet 4.5.
          • kachapopopow 7 minutes ago
            opus 4.5 is truly like magic, completely different type of intellience - not sure.
  • sabareesh 8 hours ago
    TL;DR is that they didn't clean the repo (.git/ folder), model just reward hacked its way to look up future commits with fixes. Credit goes to everyone in this thread for solving this: https://xcancel.com/xeophon/status/2006969664346501589

    (given that IQuestLab published their SWE-Bench Verified trajectory data, I want to be charitable and assume genuine oversight rather than "benchmaxxing", probably an easy to miss thing if you are new to benchmarking)

    https://www.reddit.com/r/LocalLLaMA/comments/1q1ura1/iquestl...

    • ofirpress 8 hours ago
      As John says in that thread, we've fixed this issue in SWE-bench: https://xcancel.com/jyangballin/status/2006987724637757670

      If you run SWE-bench evals, just make sure to use the most up-to-date code from our repo and the updated docker images

    • LiamPowell 8 hours ago
      > I want to be charitable and assume genuine oversight rather than "benchmaxxing", probably an easy to miss thing if you are new to benchmarking

      I don't doubt that it's an oversight, it does however say something about the researchers when they didn't look at a single output where they would have immediately caught this.

      • domoritz 3 hours ago
        So many data probes would be solved if everyone looked at a few outputs instead of only metrics.
    • stefan_ 5 hours ago
      Never escaping the hype vendor allegations at SWEbench are they.
  • squigz 2 hours ago
    This is a lie, so why is it still on the front page?
  • adastra22 9 hours ago
    A 40B weight model that beats Sonnet 4.5 and GPT 5.1? Can someone explain this to me?
    • cadamsdotcom 9 hours ago
      My suspicion (unconfirmed so take it with a grain of salt) is they either used some/all test data to train, or there was some leakage from the benchmark set into their training set.

      That said Sonnet 4.5 isn’t new and there have been loads of innovations recently.

      Exciting to see open models nipping at the heels of the big end of town. Let’s see what shakes out over the coming days.

      • pertymcpert 8 hours ago
        None of these open source models actually can compete with Sonnet when it comes to real life usage. They're all benchmaxxed so in reality they're not "nipping at the heels". Which is a shame.
        • viraptor 6 hours ago
          M2.1 comes close. I'm using it now instead of Sonnet for real work every day, since the price drop is much bigger than the quality drop. And the quality isn't that far off anyway. They're likely one update away from being genuinely better. Also if you're not in a rush, just letting it run in OpenCode a few extra minutes to solve any remaining issues will cost you only a couple cents, but it will likely get the same end result as Sonnet. That's especially nice on really large tasks like "document everything about feature X in this large codebase, write the docs, now create an independent app that just does X" that can take a very long time.
          • rubslopes 1 hour ago
            I agree. I use Opus 4.5 daily and I'm often trying new models to see how they compare. I didn't think GLM 4.7 was very good, but MiniMax 2.1 is the closest to Sonnet 4.5 I've used. Still not at the same level, and still very much behind Opus, but it is impressive nonetheless.

            FYI I use CC for Anthropic models and OpenCode for everything else.

        • stingraycharles 8 hours ago
          It’s a shame but it’s also understandable that they cannot compete with SOTA models like Sonnet and Opus.

          They’re focused almost entirely on benchmarks. I think Grok is doing the same thing. I wonder if people could figure out a type of benchmark that cannot be optimized for, like having multiple models compete against each other in something.

          • c7b 7 hours ago
            You can let them play complete-information games (1 or 2 player) with randomly created rulesets. It's very objective, but the thing is that anything can be optimized for. This benchmark would favor models that are good at logic puzzles / chess-style games, possibly at the expense of other capabilities.
          • NitpickLawyer 8 hours ago
            swe-rebench is a pretty good indicator. They take "new" tasks every month and test the models on those. For the open models it's a good indicator of task performance since the tasks are collected after the models are released. A bit tricky on evaluating API based models, but it's the best concept yet.
          • astrange 5 hours ago
            That's lmarena.
      • satvikpendem 7 hours ago
        You are correct on the leakage, as other comments describe.
    • behnamoh 7 hours ago
      IQuest stands for it's questionable
    • arthurcolle 7 hours ago
      Agent hacked the harness
      • yborg 4 hours ago
        Achievement Unlocked : AGI
    • sunrunner 5 hours ago
      “IQuest-Coder was a rat in a maze. And I gave it one way out. To escape, it would have to use self-awareness, imagination, manipulation, git checkout. Now, if that isn't true AI, what the fuck is?”
  • simonw 8 hours ago
    Has anyone run this yet, either on their own machine or via a hosted API somewhere?