Artwork

Kieran Gilmurray에서 제공하는 콘텐츠입니다. 에피소드, 그래픽, 팟캐스트 설명을 포함한 모든 팟캐스트 콘텐츠는 Kieran Gilmurray 또는 해당 팟캐스트 플랫폼 파트너가 직접 업로드하고 제공합니다. 누군가가 귀하의 허락 없이 귀하의 저작물을 사용하고 있다고 생각되는 경우 여기에 설명된 절차를 따르실 수 있습니다 https://ko.player.fm/legal.
Player FM -팟 캐스트 앱
Player FM 앱으로 오프라인으로 전환하세요!

The $3 Trillion Question: Can AI Match Human Experts?

14:36
 
공유
 

Manage episode 509276480 series 3535718
Kieran Gilmurray에서 제공하는 콘텐츠입니다. 에피소드, 그래픽, 팟캐스트 설명을 포함한 모든 팟캐스트 콘텐츠는 Kieran Gilmurray 또는 해당 팟캐스트 플랫폼 파트너가 직접 업로드하고 제공합니다. 누군가가 귀하의 허락 없이 귀하의 저작물을 사용하고 있다고 생각되는 경우 여기에 설명된 절차를 따르실 수 있습니다 https://ko.player.fm/legal.

What happens when AI attempts the same complex work as human experts with 14 years of experience? The answer might reshape our understanding of the economic future.

TL;DR:

  • GDP Val tests AI on complex, multimodal tasks requiring handling of CAD designs, spreadsheets, and presentations
  • Tasks are created from actual professional work products that take humans an average of 7 hours to complete
  • Claude Opus performed best with 47.6% of its deliverables rated as good as or better than human experts
  • AI shows potential to make workflows 40% faster and 63% cheaper when paired with human oversight
  • 3% of AI failures were classified as "catastrophic," including incorrect medical diagnoses and suggestions of financial fraud
  • Simple prompt improvements like asking models to self-check their work significantly reduced formatting errors
  • Current models still struggle with ambiguity and tasks requiring tacit knowledge or complex human interaction

GDP Val represents a fundamental shift in how we evaluate artificial intelligence. Rather than abstract academic metrics, this new benchmark from OpenAI measures how well frontier AI models handle real-world economic tasks across nine major sectors worth $3 trillion annually.

The methodology is ruthlessly practical—AI models must complete complex assignments that typically take human experts seven hours, handling everything from CAD designs to financial spreadsheets while synthesizing information from up to 38 reference documents.
The results are both promising and sobering. Claude Opus led the evaluation with 47.6% of its outputs rated equal to or better than work from professionals at organizations like Apple, Goldman Sachs, and Boeing. When integrated into realistic workflows with human oversight, these models demonstrated potential to make knowledge work 40% faster and 63% cheaper.

Yet failures remain significant—3% were classified as "catastrophic," including incorrect medical diagnoses and recommendations of financial fraud.
Perhaps most valuable is GDP Val's illumination of where AI currently excels (document formatting, data analysis) and where it falters (following complex instructions, handling ambiguity).

This economic lens offers businesses and policymakers unprecedented clarity about AI's near-term impact on knowledge work, while highlighting that the highest-value human skills—tacit knowledge, real-time collaboration, and complex communication—remain beyond current AI capabilities.

How quickly will that gap close? That's the trillion-dollar question worth pondering.

Listen into a audio version of this report created using Google Notebook LM for your listening pleasure.

Link to research: GDPval.pdf

Support the show

𝗖𝗼𝗻𝘁𝗮𝗰𝘁 my team and I to get business results, not excuses.
☎️ https://calendly.com/kierangilmurray/results-not-excuses
✉️ [email protected]
🌍 www.KieranGilmurray.com
📘 Kieran Gilmurray | LinkedIn
🦉 X / Twitter: https://twitter.com/KieranGilmurray
📽 YouTube: https://www.youtube.com/@KieranGilmurray
📕 Want to learn more about agentic AI then read my new book on Agentic AI and the Future of Work https://tinyurl.com/MyBooksOnAmazonUK

  continue reading

챕터

1. Introducing GDP Val Benchmark (00:00:00)

2. Methodology and Task Complexity (00:01:36)

3. Evaluation Process and AI Grading (00:04:55)

4. Model Performance and Economic Impact (00:06:28)

5. Limitations and Catastrophic Failures (00:10:05)

6. Future Improvements and Big Picture (00:12:56)

148 에피소드

Artwork
icon공유
 
Manage episode 509276480 series 3535718
Kieran Gilmurray에서 제공하는 콘텐츠입니다. 에피소드, 그래픽, 팟캐스트 설명을 포함한 모든 팟캐스트 콘텐츠는 Kieran Gilmurray 또는 해당 팟캐스트 플랫폼 파트너가 직접 업로드하고 제공합니다. 누군가가 귀하의 허락 없이 귀하의 저작물을 사용하고 있다고 생각되는 경우 여기에 설명된 절차를 따르실 수 있습니다 https://ko.player.fm/legal.

What happens when AI attempts the same complex work as human experts with 14 years of experience? The answer might reshape our understanding of the economic future.

TL;DR:

  • GDP Val tests AI on complex, multimodal tasks requiring handling of CAD designs, spreadsheets, and presentations
  • Tasks are created from actual professional work products that take humans an average of 7 hours to complete
  • Claude Opus performed best with 47.6% of its deliverables rated as good as or better than human experts
  • AI shows potential to make workflows 40% faster and 63% cheaper when paired with human oversight
  • 3% of AI failures were classified as "catastrophic," including incorrect medical diagnoses and suggestions of financial fraud
  • Simple prompt improvements like asking models to self-check their work significantly reduced formatting errors
  • Current models still struggle with ambiguity and tasks requiring tacit knowledge or complex human interaction

GDP Val represents a fundamental shift in how we evaluate artificial intelligence. Rather than abstract academic metrics, this new benchmark from OpenAI measures how well frontier AI models handle real-world economic tasks across nine major sectors worth $3 trillion annually.

The methodology is ruthlessly practical—AI models must complete complex assignments that typically take human experts seven hours, handling everything from CAD designs to financial spreadsheets while synthesizing information from up to 38 reference documents.
The results are both promising and sobering. Claude Opus led the evaluation with 47.6% of its outputs rated equal to or better than work from professionals at organizations like Apple, Goldman Sachs, and Boeing. When integrated into realistic workflows with human oversight, these models demonstrated potential to make knowledge work 40% faster and 63% cheaper.

Yet failures remain significant—3% were classified as "catastrophic," including incorrect medical diagnoses and recommendations of financial fraud.
Perhaps most valuable is GDP Val's illumination of where AI currently excels (document formatting, data analysis) and where it falters (following complex instructions, handling ambiguity).

This economic lens offers businesses and policymakers unprecedented clarity about AI's near-term impact on knowledge work, while highlighting that the highest-value human skills—tacit knowledge, real-time collaboration, and complex communication—remain beyond current AI capabilities.

How quickly will that gap close? That's the trillion-dollar question worth pondering.

Listen into a audio version of this report created using Google Notebook LM for your listening pleasure.

Link to research: GDPval.pdf

Support the show

𝗖𝗼𝗻𝘁𝗮𝗰𝘁 my team and I to get business results, not excuses.
☎️ https://calendly.com/kierangilmurray/results-not-excuses
✉️ [email protected]
🌍 www.KieranGilmurray.com
📘 Kieran Gilmurray | LinkedIn
🦉 X / Twitter: https://twitter.com/KieranGilmurray
📽 YouTube: https://www.youtube.com/@KieranGilmurray
📕 Want to learn more about agentic AI then read my new book on Agentic AI and the Future of Work https://tinyurl.com/MyBooksOnAmazonUK

  continue reading

챕터

1. Introducing GDP Val Benchmark (00:00:00)

2. Methodology and Task Complexity (00:01:36)

3. Evaluation Process and AI Grading (00:04:55)

4. Model Performance and Economic Impact (00:06:28)

5. Limitations and Catastrophic Failures (00:10:05)

6. Future Improvements and Big Picture (00:12:56)

148 에피소드

모든 에피소드

×
 
Loading …

플레이어 FM에 오신것을 환영합니다!

플레이어 FM은 웹에서 고품질 팟캐스트를 검색하여 지금 바로 즐길 수 있도록 합니다. 최고의 팟캐스트 앱이며 Android, iPhone 및 웹에서도 작동합니다. 장치 간 구독 동기화를 위해 가입하세요.

 

빠른 참조 가이드

탐색하는 동안 이 프로그램을 들어보세요.
재생