The surprising effectiveness of test-time training for abstract reasoning [pdf]

mikeknoop 3 days ago | next |

Context: ARC Prize 2024 just wrapped up yesterday. ARC Prize's goal is to be a north star towards AGI. The two major categories of this year's progress seem to fall into "program synthesis" and "test-time fine tuning". Both of these techniques are adopted by DeepMind's impressive AlphaProof system [1]. And I'm personally excited to finally see actual code implementation of these ideas [2]!

We still have a long way to go for the grand prize -- we'll be back next year. Also got some new stuff in the works for 2025.

Watch for the official ARC Prize 2024 paper coming Dec 6. We're going to be overviewing all the new AI reasoning code and approaches open sourced via the competition [3].

[1] https://deepmind.google/discover/blog/ai-solves-imo-problems...

[2] https://github.com/ekinakyurek/marc

[3] https://x.com/arcprize

aithrowawaycomm 3 days ago | root | parent |

I am a bit uncertain about the rules of the ARC-AGI contest, but would this program count? A good chunk of the logic of ARC is essentially hardcoded, including a Python function that checks whether or not the proposed solution makes sense.

The point of the contest is to measure intelligence in general-purpose AI systems: it does not seem in the spirit of the contest that this AI would completely fail if the test was presented on a hexagonal grid.

0x1064 3 days ago | root | parent | next |

The point in the contest is to measure an algorithms ability to solve ARC problems specifically, no one believes that it's general-purpose AI. They're highly contrived problems by design.

aithrowawaycomm 3 days ago | root | parent |

My point is that the contest really should be "can solve ARC problems without having anything about ARC problems in its pre-training data or hard-coded in the design of the program." Otherwise these claims from ARC-AGI are simply false:

  Solving ARC-AGI represents a material stepping stone toward AGI. At minimum, solving ARC-AGI would result in a new programming paradigm. It would allow anyone, even those without programming knowledge, to create programs simply by providing a few input-output examples of what they want.

  This would dramatically expand who is able to leverage software and automation. Programs could automatically refine themselves when exposed to new data, similar to how humans learn.

  If found, a solution to ARC-AGI would be more impactful than the discovery of the Transformer. The solution would open up a new branch of technology.

This program does not represent a "new paradigm" because it requires a bunch of human programming work specifically tailored to the problem, and it cannot be generalized. If software like this wins the contest that really shows the contest has nothing whatsoever to do with AGI.

antonvs 3 days ago | root | parent |

> Otherwise these claims from ARC-AGI are simply false:

Currently, any claim about AGI other than "we're probably not anywhere close to strong AGI" is simply false.

Of course a lot depends on one's definition of AGI. From another perspective, one could argue that ChatGPT 4 and similar models are already AGI.

Majority of ARC can be gamed / hard-coded, no doubt about it.

The real pressure is the private hold-out set and the variations that can be added to counter this aspect.

A true AGI would be able to solve anything thrown at it which is where the authors are trying to lead AI engineering towards since LLMs have pretty much taken over.

If it starts getting too easy, they just reconsider and add harder problems.

It's like how we don't talk about the Turing Test anymore as it's no longer the best metric to determine real intelligence.

The authors are signalling to the industry that new ideas are needed and the monetary aspect is to show how serious they are about it.

It's good because as per above we have research being thrown at it which means we can iterate until we perhaps find another breakthrough.

benchmarkist 3 days ago | root | parent | prev |

The contest is misnamed, solving ARC will not get us any closer to AGI.

naasking 3 days ago | root | parent |

Why?

benchmarkist 3 days ago | root | parent |

Because it's a set of puzzles on a 2D grid. We don't live on a 2D grid so it's already on the wrong track. A set of puzzles for a 3D sphere wouldn't get us any closer to AGI either but at least it would be a more realistic representation of the world and how a general purpose problem solver should approach reality. Even Minecraft would be a better test and lately people have started testing LLMs in virtual worlds which is a much better test case than ARC.

Insofar as ARC is being used as a benchmark for code synthesis it might be somewhat successful but it doesn't seem like people are using code synthesis to solve the puzzles so it's not really clear how much success on ARC is going to advance the state of the art in AI and code synthesis according to a logical specification.

naasking 2 days ago | root | parent | next |

> Because it's a set of puzzles on a 2D grid. We don't live on a 2D grid so it's already on the wrong track.

I don't see what this has to do with anything. Intelligence is about learning patterns and generalizing them into algorithmic understanding, where appropriate. The number of dimensions latent in the dataset is ultimately irrelevant. Humans live in a 4D world, or 3D if the holographic principle is true, and we regularly deal with mathematics 27 or more dimensions. LLMs build models with at least hundreds of thousands of dimensions.

benchmarkist 2 days ago | root | parent |

Show me an LLM that is doing any of the things you mentioned and furthermore I'm willing to bet none of that will be possible after ARC is solved either. How much money would you be willing to bet?

naasking 2 days ago | root | parent |

Not sure what's so controversial, it's well known that LLMs can trivially be viewed as operating in higher dimensional space:

https://gcptips.medium.com/a-geometric-perspective-on-large-...

As for generalizing to algorithms, LLMs don't yet do this as well as humans, but they do do it:

https://arxiv.org/abs/2309.02390

Finally, there's no intrinsic reason why an AI that can reliably solve deductive problems like ARC would be limited to two dimensions.

benchmarkist 2 days ago | root | parent |

Then you have no reason to argue with me.

naasking 2 days ago | root | parent |

The only position I took issue with, and still do, is my closing paragraph of my last post. Your argument for why ARC solvers wouldn't generalize doesn't even make sense.

benchmarkist 2 days ago | root | parent |

No point in arguing. If you think it will generalize then there is no reason to convince random people on the internet that ARC-AGI solver will get you closer to AGI.

[deleted]

razodactyl 2 days ago | root | parent | prev |

You're on track in your arguments but don't underestimate how hard the puzzles in ARC actually are.

It takes a considerable amount of depth in reasoning to see and reason about the patterns / problems / solutions.

Try doing a few of them by hand to see what I mean.

Simulated worlds are complex enough to hide their own flaws just like LLMs are complex enough to lead us to believe they can reason when most of the time they are pattern matching.

Zondartul 2 days ago | root | parent |

ARC problems are too hard for me. I'm no longer sure I'm generally intelligent.

benchmarkist 2 days ago | root | parent |

Humans are not generally intelligent. The adjective "general" in "AGI" does not mean it is equivalent to human intelligence, it means it's above and beyond human intelligence.

aithrowawaycomm 2 days ago | root | parent | next |

I think “general” should be taken to mean “has an average human child’s common sense and causal reasoning,” since common sense and causal reasoning are at some level shared by all vertebrates. It seems like the focus on “above and beyond human intelligence” is how you get AIs which appear to understand algebraic topology, yet utterly fail at counting problems designed for pigeons. It should be scientific malpractice to compare an AI to human intelligence without making any effort to compare it to rat/etc intelligence. (I guess investors wouldn’t lie happy if Sam Altman said “in 20 years I believe we’ll reach ARI.”)

In general tech folks are far too beholden to an instinctual and unscientific idea of intelligence as compared between humans, which mostly uses linguistic ability and surface knowledge as a proxy. This proxy might sometimes be useful in human group decision-making, but it is also how dumb confident people manage to fail upwards, and it works about as well for a computer as it does a rat (though it mismeasures in the opposite direction).

naasking 2 days ago | root | parent | prev |

No that's super intelligence.

benchmarkist 2 days ago | root | parent |

Meaningless distinction.

naasking 2 days ago | root | parent |

Not at all. Humans are fundamentally limited by our finite statespace and bandwidth. Classifying systems that are able to generalize at least as well as a human but that can exceed those limits as superintelligent is a meaningful distinction.

I agree that "equivalent to human intelligence" is not a robust way to define general intelligence, but humans are a general intelligence.

arjvik 3 days ago | prev | next |

Test-Time Training is incredibly powerful. Most recently, it has been shown that Self-Attention can in fact be viewed through the lens of test-time training, with a kernel-smoother "learning" from context. Simply replacing that with more powerful models than a kernel-smoother result in very capable and scalable models!

https://arxiv.org/abs/2407.04620

sthlmb 3 days ago | prev | next |

I initially read that as "Tea-Time" training and my inner Brit got a little excited..

antonvs 3 days ago | root | parent |

We won't achieve true AGI until the AGIs are demanding second breakfast.

zbyforgotp 3 days ago | prev |

Is test time the same thing as inference time?

whoisnnamdi 3 days ago | root | parent |

yes