What is the Erdos problem LLM experiment on Hacker News?

It's a Show HN experiment asking whether a large language model can help tackle a problem in the orbit of Paul Erdős-style mathematics. And the interest comes from putting LLMs on a hard, research-flavored challenge instead of another routine benchmark set. That's more revealing. Even if the results are still preliminary.

Can LLMs solve advanced math research problems today?

Not reliably on their own, no. They can suggest ideas, draft proof outlines, or point to avenues worth testing, but mathematicians still have to verify every claim with care. Here's the thing. Research-level math demands exactness, and current language models still lose that thread too often.

Why are Erdos problems a useful test for language models?

They're useful because they often demand deep reasoning, clever constructions, and unfamiliar proof techniques. That mix exposes model weaknesses fast. A standard school-math dataset won't do that as well. So a model can sound convincing. Yet still fail badly when novelty rises.

How do researchers verify LLM-generated mathematical arguments?

They verify them by checking each logical step, testing counterexamples, and sometimes translating claims into proof assistants. Human review stays central because one subtle mistake can collapse the whole argument. Worth noting. Formal tools can make the process tighter, but they don't replace expert judgment.

What is the likely future for language models in mathematical discovery?

The likeliest future looks collaborative rather than fully autonomous. LLMs may assist with conjecture generation, literature synthesis, and proof formalization, while humans or symbolic systems handle strict validation. We'd say that split fits the evidence much better than claims of solo machine discovery.

Erdos problem solved with LLMs? What the experiment shows

⚡ Quick Answer

The Erdos problem LLM experiment is interesting because it tests whether language models can assist with hard mathematical reasoning, not just explain known results. But one experiment does not prove LLMs can independently solve advanced math research problems; it shows they may be useful collaborators in tightly guided settings.

If an Erdős problem fell to an LLM, we'd be looking at a history-book headline. That's why a Show HN post about yet another run at an Erdős-style problem with language models drew eyes fast. But math doesn't care about confidence. It asks for proof, structure, and no hand-waving at all. That's the real test.

What the Hacker News Erdos problem LLM experiment is really testing

The Hacker News Erdős problem LLM experiment is really asking whether large language models can add anything to mathematical discovery when the pressure rises. Big claim. In practice, the question is tighter: can a model produce conjectures, proof sketches, counterexamples, or search directions that still look useful after an expert tears through them. That's a consequential bar. Paul Erdős became famous not only for sheer output, but for problems that reward deep combinatorial insight, so an Erdős-style challenge makes a smart stress test for LLMs. Worth noting. The distinction matters because language models often do well on textbook-flavored math, then stumble when they have to build a fresh proof. Research from OpenAI, Google DeepMind, and academic groups like Epoch AI repeatedly suggests that benchmark wins in math don't transfer cleanly to open-ended research work. So the Show HN post matters less as a victory lap and more as a probe into where current systems bend, and where they simply snap.

Can LLMs solve advanced math research problems on their own

No, LLMs probably can't solve advanced math research problems on their own in a dependable way right now. Not quite. They can produce arguments that look elegant on the page, but advanced mathematics punishes even tiny logical holes, and models still create those holes far too often. That's the whole issue. DeepMind's AlphaGeometry and Google DeepMind's formal reasoning work make clear that performance improves when systems combine symbolic search with learned components, instead of leaning on language generation by itself. That's a bigger shift than it sounds. That should cool the hype around pure-chatbot discovery claims. A mathematician reading an LLM proof sketch has to check every step, test edge cases, and often rebuild the argument from scratch anyway. So the model may save time. Or just move the work around. We'd argue the current generation looks more like an imaginative but shaky research assistant than an autonomous theorem prover.

Related:🔗when self-correction helps

Why LLMs on Erdos problems expose the gap between fluency and proof

LLMs on Erdős problems expose the gap between fluency and proof, because mathematical writing can sound persuasive long before it becomes correct. Simple enough. Language models are tuned to continue patterns, not to keep an airtight deductive state intact across long chains of reasoning. And that design choice shows up fast in combinatorics and number theory. Terence Tao has written and spoken about using language models for exploratory assistance while warning that verification still matters, which lines up with what many working mathematicians think. Worth noting. Here's the thing: a plausible lemma isn't a proved lemma. An AI experiment built around an Erdős problem may offer a fresh angle or shrink literature-review work, but the standard at the end stays exactly the same. Math is brutally democratic. Either the argument holds, or it doesn't.

How language models for mathematical discovery may still become useful

Language models for mathematical discovery may still turn out useful as idea generators, literature guides, and formalization aides. That's the lane with real traction. It's a smaller claim, yes, but also a more believable one. Projects tied to Lean, Isabelle, and other proof assistants already hint at a future where models suggest steps while formal systems check validity. We'd argue that's far more credible than raw chat output standing alone. Google DeepMind's AlphaProof work and ongoing theorem-proving efforts point the same way: hybrid systems tend to outperform freeform text-only methods on hard reasoning tasks. So if you're asking whether the Show HN experiment matters, we'd say yes, but not because it proves full autonomy. It matters because it points to a collaborative workflow where LLMs widen the search space and humans or formal tools close it. That's worth watching.

Key Statistics

The MATH benchmark paper reported that earlier large language models often scored below expert-human levels on competition-style mathematics, even when they sounded confident.That gap helps explain why open-ended research math remains a much steeper challenge than polished benchmark prompts.

Google DeepMind reported in 2024 that AlphaGeometry solved a majority of International Mathematical Olympiad geometry problems from a curated benchmark set.The result matters because it came from a hybrid reasoning system, not from freeform language generation alone.

OpenAI and other labs have shown strong gains on GSM8K and similar math benchmarks, where top systems now exceed 90% under some prompting setups.Those numbers are impressive, but they mostly reflect structured problem solving rather than original mathematical discovery.

Formal proof ecosystems such as Lean’s mathlib have grown to tens of thousands of theorems contributed by a global community.That scale points to a realistic path for AI in mathematics: models that work alongside formal systems and human experts, not apart from them.

Frequently Asked Questions

✦

Key Takeaways

✓The Hacker News Erdős problem LLM experiment offers more signal than spectacle.
✓LLMs on Erdős problems can assist reasoning, but they still need heavy verification.
✓Advanced math research exposes the limits of fluent text generation very quickly.
✓Mathematical discovery needs proof, not persuasion, and models often blur the two.
✓The real value likely sits in collaboration workflows, not solo AI breakthroughs.

← Back to Blogs More in NLP Research →