#ai-benchmarking

[ follow ]
#generative-ai
fromTechCrunch
2 months ago
Artificial intelligence

A high schooler built a website that lets you challenge AI models to a Minecraft build-off | TechCrunch

fromTechCrunch
2 months ago
Artificial intelligence

A high schooler built a website that lets you challenge AI models to a Minecraft build-off | TechCrunch

#ai-research
Artificial intelligence
fromTechCrunch
3 months ago

People are using Super Mario to benchmark AI now | TechCrunch

Researchers find Super Mario Bros. more challenging for AI than Pokémon, revealing limitations of reasoning models in real-time gameplay.
Artificial intelligence
fromTechCrunch
1 month ago

AI benchmarking platform Chatbot Arena forms a new company | TechCrunch

Chatbot Arena is establishing a company to elevate its AI benchmarking capabilities while ensuring impartiality in its evaluations.
Artificial intelligence
fromTechCrunch
3 months ago

People are using Super Mario to benchmark AI now | TechCrunch

Researchers find Super Mario Bros. more challenging for AI than Pokémon, revealing limitations of reasoning models in real-time gameplay.
Artificial intelligence
fromTechCrunch
1 month ago

AI benchmarking platform Chatbot Arena forms a new company | TechCrunch

Chatbot Arena is establishing a company to elevate its AI benchmarking capabilities while ensuring impartiality in its evaluations.
fromtechcrunch.com
1 month ago

Debates over AI benchmarking have reached Pokemon

Last week, a post on X claimed Google's Gemini model surpassed Anthropic's Claude model in Pokemon, stirring controversy over AI benchmarks and implementation.
Artificial intelligence
#artificial-intelligence
Artificial intelligence
fromTechCrunch
3 months ago

Anthropic used Pokemon to benchmark its newest AI model | TechCrunch

Anthropic's Claude 3.7 Sonnet successfully demonstrated advanced AI capabilities by playing Pokémon Red, showcasing improved reasoning skills over previous versions.
fromTechCrunch
3 months ago

These researchers used NPR Sunday Puzzle questions to benchmark AI 'reasoning' models | TechCrunch

The challenges posed by the Sunday Puzzle are beneficial for AI benchmarking, as they require insight and reasoning beyond mere rote memory.
Artificial intelligence
fromTechCrunch
3 months ago

Perplexity launches its own freemium 'deep research' product | TechCrunch

Perplexity has become the latest AI company to release an in-depth research tool called Deep Research, which aims to provide expert-level answers with real citations.
Miscellaneous
[ Load more ]