What this article will cover:
- The key updates from the announcement
- The practical implications of GPT-4o – especially across different industries
- What we find exciting
What this article won’t cover:
- Technical details or deep-dives
- Detailed Benchmarks
- Speculation
Introduction:
OpenAI just announced their latest model – GPT-4o in their spring update. This was clearly the gpt2-chatbot that was doing the rounds of the internet these past weeks. The oversimplified summary is that ChatGPT now has native multimodality built into it. Their blog does a very good job of covering the key points and Sam Altman’s blog post adds important focus to their announcement. Will try and break down the key parts:
- ChatGPT can now see and hear. It had multimodal capabilities earlier too, but that was via connecting various moving parts together under the hood to make it seem like it was multimodal. That meant it was slower and would lose important context in the pipeline. With the ‘Omni’ update Chat GPT can now do all this natively. It doesn’t need the help of other tools. This means its faster too.
- They’ve accomplished this by training the LLM on audio and image data over and above the text data that it was previously trained on. So everything is processed by the same neural network.
- The latest update also makes the LLM faster and cheaper. Super important for the real-life usecases where response times (latency) matters.
- ChatGPT now has human-like emotions. It’s much more human sounding in its response. (It’s still a bot, but more like ‘her‘.) There are definitely some creepy intonations, but this is a BIG step forward. (Again, super important for real-world use cases)
- And last, but not the least, IT’S FREE! For limited use, of course. But it’s free for people to try. So please go ahead and do so! (Note this is still being rolled out.)
Here’s a quick preview of it’s ‘Vision’ capabilities: A game of Rock, Paper, Scissors. The emotion is really off the scale!
While OpenAI are saying this is their best model yet, there are folks who have been testing it and finding that on specific hard code tasks, it does a little worse than GPT-4T. The thread below gets into specifics if you’re interested:
Early Results From Our First Eval Of GPT-4o – Hard Coding and Reasoning
— Bindu Reddy (@bindureddy) May 13, 2024
GPT-4o
Successful tasks – 79 / 96
Coding tasks – 52 / 65
GPT-4
Successful tasks – 90/ 96
Coding tasks – 60 / 65
The model is way faster, but it's unclear why it's doing much worse on hard tasks 😢…
But aside from that, the final verdict – which we currently believe to be true is that GPT-4o is the best language model out there and does best on knowledge-first tasks. For complex tasks, GPT-4T or Claude Opus are still the best models. Of course, since we like opensource, Llama3 isn’t that far behind and given that it’s technically free, it deserves a mention here.
GPT-4o – Best Model For Knowledge Use-Cases, Not For Agentic Ones.
— Bindu Reddy (@bindureddy) May 14, 2024
It's good at retrieval and language understanding; it's the best language model out there.
However, it has issues similar to those in the April version of the OAI model. It goes into loops, becomes lazy, and…
Real World Use-Cases: As showcased by the OpenAI team
- Real-time language translation: You can use this like a universal language translator (almost).
- Doing homework in a much more natural way. Since GPT-4o can see and hear things, it’s easier to point to a piece of homework and have it explain the subject to you
- Accessibility guide – eyes for the blind (this is amazing!)
- It can be your meeting assistant – this is killer because it really puts companies like Limitless and other work-first AI use cases at risk
- There are a few other use cases that the OpenAI team talks about in their blog post – like generating 3D text, turning photos into caricatures, generating images and maintaining consistency in characters through other poses.
Looking at how this is likely to impact different industries:
Customer Service and Support: We already knew that customer support was going to move to AI. Many companies and start-ups have built their solutions as add-ons to foundational models that provide AI based customer support. With the GPT-4o update, the foundational model, itself, will now be able to do most of these tasks. The scope of customer support will also move beyond just voice to video. Having trouble with your Ikea Furniture? No problem, have the IKEA-GPT-4o bot take a look. You can build alongside the bot. It’s also much more human-like. So hopefully irate customers won’t be as annoyed that they’ve been palmed off to an incompetent bot. (Competence yet to be proved…)
Education & Coaching: Imagine having an AI tutor who can see how you’re behaving, understand your emotions, decode your responses and see where you’re going wrong. All in real-time. Now this is a much more palatable AI teacher than what we’ve had so far. Again, all this in a foundational model. Right out of the box. Homework will be a lot easier. But beyond that, a competent Multimodal AI can be a great asset even for vocational training or other skills that require a tutor to give feedback based on physical progress. While OpenAI showed off some interview prep use cases, those could be achieved earlier via patching a few different tools together. The real genius is in the ability of the AI to understand emotion and tailor their responses to a very nuanced degree. Just imagine a sweet and loving singing teacher, rather than the generally grumpy irate Russian teacher that we’re used to. (Or maybe that’s just us😅 )
Gaming and Entertainment: While we hate to admit this, this is just the beginning for AI Waifus. Beyond that though, having realistic AI within games that can respond to sight and sound will add more depth to the games themselves. Of course having your personal AI entertainer can also be a thing. We’re hoping someone will build a cool AI entertainer for my 2y.o to enjoy.
Conclusion:
Of course, this is just the start. Sam Altman has been saying that they like to iterate fast and ship quickly. We also know that Google io is about to start anytime now and OpenAI wanted to ensure they were the first to strike with the best assistant on the market.
What are you most excited to see in this update and how do you think this can be used in your industry? Let us know!