OpenAI introduces GeneBench-Pro to test AI research judgment

Article

FREE Breaking News Alerts from StreetInsider.com!

StreetInsider.com Top Tickers, 7/22/2026

1. RTX
2. INTC
3. BA
4. ZION
5. DPZ

6. NVDA
7. MGY
8. ALK
9. ORCL
10. SPY

Top News Most Read Special Reports

June 30, 2026 1:07 PM EDT

E-mail

Investing.com -- OpenAI released GeneBench-Pro on Tuesday, a benchmark designed to test whether artificial intelligence models can make the judgment calls required in computational biology research.

The benchmark includes 129 problems across genomics, quantitative biology, and translational medicine. Each problem provides models with a dataset, experimental context, and a target question. Models must explore the data, choose an analytical approach, and provide a final answer.

OpenAI sent 82 of the 129 questions to external domain experts, including graduate students, postdoctoral researchers, industry scientists, and professors. Reviewers assessed each problem's realism and whether the target answer was identifiable.

Alexander Strudwick Young, assistant professor in human genetics at UCLA, said the problems would have been challenging for a graduate student to complete without feedback from an experienced supervisor.

Each problem is built synthetically, with OpenAI controlling the full data-generation process. This allows the company to grade correctness against known targets and ensure that reasonable differences in analytical choices still produce accepted results.

OpenAI's GPT-5.6 Sol achieved a pass rate of 28.7% at the highest reasoning level and 31.5% with Pro mode enabled. GPT-5 scored below 5% when OpenAI began building the original GeneBench. At the lowest reasoning level, GPT-5.6 Sol achieved a single-digit pass rate.

Competitor models at best matched the performance of the corresponding GPT model at the time of release. Opus 4.8 achieved 16.0%, while Gemini 3.5 Flash scored 8.1%, Gemini 3.1 Pro scored 3.1%, Grok 4.3 scored 1.5%, GLM 5.2 scored 4.6%, and DeepSeek V4 Pro scored 2.4%.

Reviewers estimated that a typical GeneBench-Pro problem would take a human expert around 20 to 40 hours to complete. At $200 per hour, that puts the human labor cost of a single problem in the thousands of dollars. Current inference costs are only several dollars per problem.

OpenAI is open-sourcing 10 representative questions on Hugging Face and will provide a 50-question subset to Artificial Analysis for independent benchmarking.

Serious News for Serious Traders! Try StreetInsider.com Premium Free!

Create E-mail Alert Related Categories

Investing

Related Entities

Maynard Um, Mark Zuckerberg, ARK

Sign up for StreetInsider Free!

Receive full access to all new and archived articles, unlimited portfolio tracking, e-mail alerts, custom newswires and RSS feeds - and more!

OpenAI introduces GeneBench-Pro to test AI research judgment

You May Also Be Interested In

Create E-mail Alert Related Categories

Related Entities

Sign up for StreetInsider Free!

Free News Feed