The Hype, the Hubris, and the Hard Lessons of AI Development

By Annie Cushing
Founder & Senior AI Strategist at Annielytics

Annie tackles the promises and pitfalls of AI with a mix of enthusiasm and tough love. She highlights powerful, real-world applications – like cancer detection and endangered language preservation – while calling out the lack of strategy and transparency in AI development, especially in leaderboard manipulation and inflated model costs.

Is FOMO hitting you hard after Missing SEO Week 2025? It's not too late to attend in 2026.

SEO Week 2025 set the bar with four themed days, top-tier speakers, and an unforgettable experience. For 2026, expect even more: more amazing after parties, more activations like AI photo booths, barista-crafted coffee, relaxing massages, and of course, the industry’s best speakers. Don’t miss out. Spots fill fast.

ABOUT Annie Cushing

Annie is a long-time analyst turned data scientist who is passionate about AI innovation, strategy, and app development. I’ve created a suite of free tools for other practitioners to benefit from and write about hot topics around AI.

OVERVIEW

In her return to the conference stage, Annie delivers a sharp, engaging talk that explores both the transformative potential and the very real pitfalls of artificial intelligence. She kicks things off with a powerful reminder of AI’s capacity for good, from cancer detection breakthroughs at Harvard Medical School to drastically reduced infant mortality in Malawi and the preservation of endangered Indigenous languages. Annie shares her personal investment as a cancer survivor and uses compelling global examples to show how AI, when applied responsibly, can be life-saving, innovative, and empowering.

But the second half of her talk is a reality check. Drawing from her work as an AI strategist, Annie calls out the widespread lack of strategy among AI app developers and the troubling opacity of model providers. She takes aim at leaderboard manipulation, inflated model pricing, and the overreliance on foundation models when simpler, cheaper alternatives exist. With examples ranging from misused AI in educational tools to questionable performance claims from major players like OpenAI and xAI, Annie urges the audience to stop buying into the hype and start applying critical thinking. Her tools and free apps are designed to help practitioners navigate the complexity of model selection and evaluation without getting burned.

Recommended Resources:

DOWNLOAD THE DECK

Talk
Highlights

AI can do tremendous good, but only with intentional application:

From cancer diagnosis to endangered language preservation, there are real-world examples of AI driving meaningful, positive outcomes.

Most AI app developers lack strategy, and model providers often lack transparency:

Misaligned model selection, inflated pricing, and unverifiable leaderboard claims can lead to wasted resources and poor outcomes if not carefully vetted.

Not every problem needs AI – sometimes simpler is smarter:

Use traditional machine learning methods or APIs when they’re more cost-effective, reliable, and easier to maintain than flashy AI implementations.

Presentation Snackable

Is FOMO hitting you hard after Missing SEO Week 2025? It's not too late to attend in 2026.

Transcript

Annie Cushing: Thank you. Yeah. This is my first conference since, like, 2018. I feel like I’ve been in witness protection. So thanks for having me. Okay. So I am going to talk about some of the hype, the hubris. I promise you it’s not – I don’t subscribe to the hysteria of, you know, hide your family, hide your kids, hide your wives, you know, AI is killing everybody out there, you know. But I wanna start with before I get into some of the issues, I wanna start with I mean, it is like what a time to be alive. There is so much good that AI is accomplishing in the world.

Harvard Medical School developed an AI model that they dubbed Chief that reportedly achieved 94% accuracy in identifying cancer. And unlike other AI models in the cancer space that usually identify a handful of cancer types, this was successfully tested on 19 different cancer types. As a cancer survivor, that is really exciting. In fact, I was treated right here at Memorial Sloan Kettering. One of the scariest things my oncologist said to me was, Akisha, you’re really healthy and and that’s good, but we have no data to guide your treatment. I was like, you have no idea how much that that rocks my world. So I’m actually part of a 10-year study so that they will have data. Harvard Medical School developed an AI oops. Sorry. It didn’t – researchers at Rutgers University developed AI tools to help prevent collisions with whales in the North Atlantic. So that’s been a huge issue, with the critically endangered North Atlantic whale. There are only 70 reproducing whales left. That’s how critically endangered it is. So with that preservation in mind, they combined two databases and now they are using AI to guide ships so that they can avoid the whales. So there was a children’s hospital in Texas that donated in an AI well, it’s AI software for a hospital in Malawi, Africa which really struggled with stillborns and an infant mortality, specifically neonate neonatal deaths. In the three years since they donated this software, there has been an 82% drop in stillborns and neonatal deaths. This one is especially cool. Of the 4,000 indigenous languages worldwide, one dies every two weeks. So there’s an AI engineer, he he has the coolest name, Michael Running Wolf, and he founded Indigenous in AI, and his team has been building speech recognition models to preserve more than 200 endangered indigenous languages in North America.

You can track good news like this with my it’s totally free. It’s an AI timeline. Specifically, it’s an AI for good tag. You can also use it to stay abreast of AI news and developments. You can filter by company. So if you wanna track DeepSeek, not saying why you would necessarily wanna track that, but you can filter by company. I have just under 100 tags that you can filter by or you can query it. With all of the positives, there are challenges associated with AI that require vigilance. I’ll make a case for two in particular that are specific to AI app development. So I work as an AI strategist. I’ve been out of the SEO space for quite a while now. But the two biggest issues I see are one, lack of strategy, that’s on the development side. And then lack of transparency, that’s on the model provider side. And you’ll see in this presentation, those issues can literally cost companies millions of dollars.

The first issue I’m going to touch on is those building AI apps, many times they just don’t know how to pick the best model for a specific task. So a model that might work swimmingly for one task might completely bite the dirt for another task and they can be hamstrung with issues like hallucinations, lag time, unnecessary expense. We will see very specific examples of unnecessary expense. One term I’m going to be throwing around quite a bit is leaderboard. So let’s establish what a leaderboard is. A leaderboard is just a dashboard that compares the performance of models against a set of evaluation tasks. So basically, what these leaderboard providers, these benchmark providers are doing is they’re taking a bunch of tasks and they’re stress testing these models. And and then they assign them scores, you know, for these tasks and then they rank them with other models. These leaderboards tend to be pretty low fidelity. They generally kind of focus around a table and a few of them have charts but most of them are a table and they vary wildly from one leader board to another. So I spent the entire month of December just digging into these all these different leader boards to learn how to use them to create an app that will help reduce your learning curve. It’s also a free app. I’ll show you that. This is a pretty typical leaderboard. This is the Math Eval leaderboard. As the name suggests, this is very helpful if you’re doing any kind of app that includes math, which I’ve worked on. Math is very still very, very challenging for AI models. And this is this is pretty common. It’s not even the the worst example, but they have these filters at the top and benchmarks ability and grade, those are all pretty straightforward. But then you get to shots and there’s nothing that guides the user like, okay, what would be the benefit to me changing from the default overall highest to zero shot or few shot? And so, it’s like, you know, we’re going through these. I was like, ugh, where are the info icons? You know, like, why do I I kept having to go back and forth between, you know, Claude saying, what what does this mean? What does this do? You know, it’s like, how how would I use this? And I just built this massive JSON object and that’s what I used to build the app, which you’ll see in a minute. Here’s another example. This is the V Bench leaderboard. This is a leaderboard for video generation models. It’s a decent leaderboard. Again, another table. But there’s some ambiguity in the filters. So, if you’re looking at this, it’s not apparently clear. Like, what’s the difference between this select quality dimension, select semantic dimensions versus evaluation dimension? Things like that. I’m like, oh, if it just had like an info icon with some tool tips, like that would be really helpful because then I would know, okay, from a business perspective, what would be the benefit to switching up this filter? This is very common. This is also from the bench. Oops. Sorry. Not to pick on the bench, but you will typically see in these leaderboards that it has all of these benchmarks and again, no info icons, no tool tips. You you look at them and you’re like, and some of them are pretty esoteric like object class, multiple objects, human action. I mean, are some kind of evaluation task that it’s measuring the models on, but you’re like, what does that actually mean? This particular leaderboard, it actually has a really good filter evaluated by V Bench team. Again, I think this would be really impactful evaluated to have an info icon to explain what this means. What it’s saying is if you go with the default, it filters out self reporting models. And that is one of the biggest issues that’s causing transparency concerns with the model providers because a lot of these leaderboards, they allow models to self report.

So there’s no transparency. There’s no accountability. These model providers, they’re making outrageous claims. And you as you as you’ll see later in the presentation, like, you go back to these leaderboards and you’re like, well, that’s weird. Why don’t I see this score that you proclaim that all of these publishers just pair it and, you know, and and there’s just no there there’s no verification. So I personally wish that leaderboards would not allow any models to self report. I just think they are putting way too much trust in humanity. But at minimum, if they do allow self reporting, they should absolutely disclose that in my opinion. But when it comes to the info icons, you know, now I’ve built a handful of apps, I on the side of including info icons and tool tips for everything. I mean, in my opinion, if you’re going to on one side or the other, err on the side of being obnoxiously helpful. Don’t make people leave your tool and try to figure out what exactly it means because these model decisions have significant implications. They can they they impact the lag time of the apps. They impact the cost of the apps. It I mean, it it impacts so many things downstream.

Live Code Bench, okay, they are brutal. So those red highlights, those are not my highlights. What they used to do, they no longer do it anymore, which makes me so sad, they used to flag models that they suspected of cheating. And in the AI world, we don’t actually call it cheating, we call it model contamination because we are princesses. But it is cheating. So what they what they decided is that if a model ran its evaluation shortly after the evaluation or the the validation dataset was published and it performed unusually well or its answers were too close to the answers that they published, they’re like, I think they cheated. And you can see in the the ranks, they don’t even assign a rank. Like, if they thought you cheated, you didn’t you didn’t get a rank. So it it was awesome, but they they stopped doing that.

Okay. So I built this app. It’s totally free. I built three apps to help people understand, like, how do you pick an AI model and then you’ll see one, how do you how do you pick a machine learning model and then the AI timeline is like I said, it’s completely free. The way it works is you choose the task that or tasks that your AI app needs to perform, and then you select what benchmarks you want to compare. Now the tasks are listed alphabetically, but the benchmarks are actually listed in frequency because all of these benchmarks have all all of these leaderboards, excuse me, have quality benchmarks. Very few of them include context window. And when you do that, it will generate this network graph. I love network graphs for datasets where you have, like, a variable number of data points. So you can see over within the bottom right hand corner, the Helmlight leaderboard, that has a lot of benchmarks around it. The gray nodes, which are just, you know, it’s just geek speak for dots. The gray dots are the benchmarks and the green nodes are the leaderboards. And but, like, if you look up at the cost and and speed, you don’t have as many benchmarks around those. When you generate this, depending on how many nodes you have on the screen, they can get a little tangled up. So you just you but it’s totally interactive. So you can just click and drag them until you detangle them. And because I am a hardcore perfectionist, I didn’t like how the chart then kinda canted to the left or the right, so I added an orientation slider which you can minimize afterwards. That’s more important on mobile, so I optimized all of my apps for mobile and wanted to optimize the the space. Now, if you select the green nodes, it will trigger this leaderboard card. And this is a pretty typical leaderboard card. You have a brief summary of what that leaderboard does, then you have links to the leaderboard as well as the leaderboard’s methodology. Most times that’s going to be an archive paper. Pretty dreadful reads, but, you know, it’s there for the taking. Then the middle column, those are the tips that I have compiled from just going through the leaderboard, finding the gotchas like, oh, wow. Okay. I wasn’t expecting this. I’ll add that. You know, any any tips like a lot of times there are filters that are hidden under a like more options link and I’m like, ah, these filters are amazing. These should be, you know, surfaced to the top. And then any like practical applications. So I’m, you know, working with these AI apps quite a bit. And so I’ll just share some of my experience. Like, hey, you know what? This is really worth the money on this one. You may wanna really focus on this or, you know, if you filter this chart this way, this could help you, you know, identify hallucinations, things like that. And then in the right column is just an alphabetized list of the benchmarks specific to that particular task. So if you’re looking at quality for agentic models, they’re gonna be specific to agentic leaderboards and quality.

Okay. Next issue. Organizations spend way too much money on foundation models. This is a huge, huge issue. We’ll look at a dramatic example of how easy that is to do in a bit, but this is still somewhat dramatic here. So this screenshot is from one of my favorite leaderboards. It’s called the artificial analysis leaderboard. And they actually have many different leaderboards. So you go to artificial analysis, and then, like, this specific chart came from text to image leaderboard. So these are all text to image models. And one of my favorite charts in that leaderboard is this quality versus price chart. I go right for that chart, whatever I’m doing. I love it because it puts price along the x axis and then whatever the performance measurement is on the right axis. I explain what all of these what what all these like performance measurements actually mean, how they measure them, stuff like that in in the app. But one thing I love about this is that it provides that like a these visual cues of like a wall of fame and a wall of shame, like you can’t really see the the gray yeah. You I can see it more there. So the green quadrant shows you, okay, these are the models that are the top performers and the gray, not so much. So one thing I did, and now this has been consistent ever since I started building the app, was December. You will frequently see along this line at about ten fifty, multiple models that are performing on a very similar quality level. Talk with my hands too much. But Ideogram, Midjourney, Dolly, they’re coming in at about $80. Now, this particular leaderboard, they measure it by 1,000 images generated. By sharp contrast, Flux 1, which is an open model, is coming in just under $8. Same quality, 10% of the price.

Now, one one thing you need to be aware of when you’re dealing with these cost metrics, they do a little bit of mathematical witchcraft to to pull this in because you typically have input tokens, output tokens, and different pricing models for them. And so all of the leaderboards I’ve included in my in my app, they all operate under the assumption that the output tokens are three times the cost of the input tokens. They’re actually closer to four times and and north of that, but it’s still better than having to have two different charts where you have input costs and output costs versus performance. So I’ll allow it. Also, keep in mind that model pricing can vary pretty wildly from one API provider to another. So what I did in this chart is I filtered for the LAMA 3.3 70 billion parameter model that’s Meta’s open model. And each one of those data points is a separate API provider. And you can see the price varies from twenty cents per million tokens to ninety four cents per million tokens. But this is a good example to to highlight because in a case like this, there may be other factors that make that ninety four cents worth it. So what I did in the same tool, I looked at the output speed. And the output speed for Cerebras, which was the one that came in, the API provider that came in at ninety four cents, was more than ten times faster than the next closest model. So if speed is particularly important for your model, the juice might be worth the squeeze.

Okay. Now this is where the inmates take over the asylum. This is an agent leader board called Galileo. It’s a really good leader board. They combine date four different datasets. So they have quite a wide range of tasks. AgenTik models are models that perform some kind of task. Like maybe it schedules something on your calendar, books a flight, it runs a web search, like that. Excuse me. And here I drew the line at 90. So basically, all of these models up north of 90, they’re like the geeks in your school who ruined the curve for everyone else. And you can see a number of models line up along that and and or higher. What is crazy is you might wanna notice that x axis, that’s a log scale. You have a log scale, specifically a log ten scale, and you have that when there’s a really big differential in cost. You don’t want to be the reason some analyst needs to throw the x axis on a log scale. Y axis, that’s a different that’s a different game. But if you look up here, Gemini 2.0 flashlight, that that model is coming in at thirteen cents per million tokens by sharp contrast. At an even lower performance level, GPT 4.5 preview is coming in at just under $94 per million tokens. So, by way of summary, Gemini 2.0 flashlight achieved a higher performance at 0.1% of the cost of GPT 4.5. Now, if you wanted a model that really knocks it out of the park, like Claude 3.7 SONNET right now, that is the model to beat. I think it came in it’s coming in at about a 97. But it’s only $6 per million tokens, which is still just 6% of GPT 4.5. So if you look over on that right side, it’s it’s OpenAI. OpenAI is the reason that some analysts had to put this scale on a on a log scale. Otherwise, it just would have gone on five ever. Okay. So if you wanna check out the Galileo dashboard, you can just fire up a filter by chain agents. That’s the task. And you can see with this particular dashboard, there are a lot of tips. It’s a thick and chewy cookie.

Okay. Next up, overuse of AI when alternative options are actually better. And by better, I mean cheaper, more reliable, less less prone to hallucination, etc., etc. So I’m gonna share a few examples working having worked on on some apps. The first one is it comes from the education space. I’ve actually worked on four different projects in the education space because right now the education space is absolutely desperate for AI models. But we were working on this one app and it was really important that the app maintained compliance with US standards. So we were in an on on-site meeting with a client and someone on the client side said, how are you going to ensure that the app maintains compliance? And the lead data scientists on the project said, we’re gonna use GPT 4. And how it was like, no. I mean, it hadn’t been updated at that point, I think in about nine months. It didn’t even perform web searches. But even if it did and I mean no disrespect to the data science, it say data scientists, I mean, we’re all crawling along this learning curve. But in this particular case, this was a this was a pretty egregious misstep because for something that is updating all the time like US state standards, you need an API. And there were two different APIs we had to to choose from. So we had to course correct on on that one. Another education app, this was a student-facing app and it was a homework app and history is so boring, you know, for students. So we were trying we were banding about ideas for how to make history more interesting. Someone threw out the idea, what if we used AI to generate political cartoons? I was like, or, you know, I just did a quick search while we were in the meeting because I was looking to see, is there an API that will actually return political cartoons, authentic political cartoons where the people don’t have six fingers, you know. And so, I I looked I I just did a quick search. The Library of Congress has a database of more than nine thousand political cartoons from the late seventeen hundreds to current and it’s free. So I was like, alternatively, how about we use this API and pull in actual political cartoons from that day? So if you have a student who’s learning about the Bay of Pigs disaster, they could just, you know, whimsically say, okay, I wanna see a political cartoon from that time.

Another example, I was part of an email chain where, like, when prospective clients would would come through, we would just talk about like different ideas of how to approach a project. And there was a client, they weren’t looking for AI. They just said, we have a customer churn issue. And in this chat, especially like upper management, they were like, what? We’ll sell them on AI. And I’m like, well, unless they wanna talk about their churn issue, I don’t think that this is an AI problem. This is this is garden variety logistic regression. You can ask ChatGPT what that is. But, like, in this particular case, like, you could use a machine learning model. You don’t need to use AI. First of all, AI would be outrageously expensive, you know, to throw all of this data, but it’s just an it’s it’s unnecessary. Now, if you wanted to add a chat component to it, you could do that at a fraction of the cost. And in that case, you like in the app, you would choose chat and look at the benchmarks specific to chat. One really good one is this it’s a benchmark called IfEval. And it’s offered by Hugging Face. It it’s all in in the app. But it just measures how well a model performs on a specific request and includes the formatting of the response.

Okay. Last example. We had client come through. They were building they wanted to build an AI app for dermatologists that would help them diagnose like rarer skin conditions. And they had some labeled images. They were pretty nasty of like different, you know, skin diseases. But and and again, you know, no disrespect to my teammates, but they were like, that’s all on AI. I was like, alternatively, we could use a convolutional neural network. So I knew in being on these projects, I knew how challenging it was to just like always be contrarian and say, well, yeah, because I’m really excited about AI, but if you can do something better for cheaper, then you should do it. You you should do that. So I built this was the last of my trifecta apps, also totally free. It’s a machine learning model picker. And the way that it works, I’ll just walk through like an example of if you were to try to figure out what machine learning model would be best for labeled images. So you just pick your category and and then when you choose a category, the subcategories update their specific to that category. If you don’t know what something is, you could just hover over it and you get a tool tip and explain it like I’m five tool tip. And then when you click the or select the generate flowchart button, it opens up this flowchart that answer asks you a series of yes/no questions. And depending on what you answer, it ultimately leads you to a model. So in this particular case, are you working with structured data? Well, structured data is like CSVs and stuff. So, no, we’re working with images. So we go down here. Are you analyzing text data? No. We’re analyzing images. But if something isn’t clear, you just click on it and a tool tip. And then ultimately, it’s like, oh, okay. You want a CNN. Now if you select that node, you open up a model card that has the name, it has a description, then I also threw in, like, example use cases, even alternative models that you could use, as well as the difficulty level. So the difficulty level, I have a methodology page in the footer for all of my apps. And so you can see how I how I determine difficulty level. And then a link to a Python and an R tutorial.

Okay. Last up. Accept model creator performance claims as facts. Again, this is a really is a really big issue. Okay. So February 2nd of this year, OpenAI announced its deep research agent. We’ve heard a little bit about deep research already. They claimed it achieved a score of 26.6 in the humanities last exam benchmark. It’s the most difficult benchmark for models. It’s supposed to measure how close we get to AGI, which is artificial general intelligence. Everyone’s saying, oh, they you know, we’re so close to AGI. Anyone who has ever used AI to code knows we are nowhere near AGI. Okay? So I hate to rain on that parade, but we are nowhere near AGI. But they also claimed a 72.57 on the Gaia benchmark. So I didn’t find it until February 6th, but I was like, we will see. So I looked up humanity’s last exam, their leader board, didn’t see it. So then I looked up the Gaia leader board and I’m like, what the hell? Didn’t see it. So one one caveat is humanity’s last exam, their leaderboard, they’ve already said they are not going to update it regularly, but Gaia is updated very regularly. So the next day, I emailed a member of the Gaia team. And this person said that OpenAI never submitted the its deep research model to be tested and went so far as to question the veracity of their claims. I extracted one sentence from that email. Their results have little legitimacy as one, they did not submit officially, so we have no guarantee that they used the correct scripts. Two, the validation set is public, so can be cheated on often accidentally. They’re being very gracious, very generous. Whereas, the test set has never been revealed. And that is the de facto standard. So what these these benchmark providers do, they publish their validation dataset. And then the model creators, they just, like, pinky promise, like, we definitely didn’t train our model on your validation dataset.

Okay. So there’s one leaderboard. It’s the MMLU leaderboard that it actually discloses the data source, and it discloses if the if the model was self reported or submitted their model for independent testing. And I just highlighted the ones that actually submitted their model for independent testing. You can see they are in the majority. Now imagine if we allowed students to just, like, self report their SAT results to universities. They’d be like, oh, dope. Okay. Yeah. Come on in. I mean, it’s so crazy, but no people just don’t question it. So this is live footage of me. Every time a model announces, you know, some grandiose achievement, I’m like, alright, fire up this leaderboard. Not there. Fire up this leaderboard. And I used to be able to look at that one, the live code bench leaderboard to see if they’ve cheated in other respects and kinda use that as as, you know, a test of how honest they are. Now I now I can’t do that. But this is another example of typical shenanigans. XAI, this was also February, they announced that they tested Grok 3 on the 2025 American Invitational Mathematics exam abbreviated as AIM, seven days after the questions and answers were published. So this is a really big test and all of the math models, they like hawk over this test and the day after the test, they publish the questions and the answers. Again, we are operating under the assumption that these models aren’t then saying, okay. Thank you. And like hungry little hippos, you know, training their models on the test because they have the answers. But so this was seven days after they were published. We are supposed to believe that in those seven days that Grok did not touch that test. And they claimed that Grok 3, their think model achieved an astounding 93.3% on what is arguably the most difficult math test out there. Maybe. Probably not.

Okay. So remind you of anything. I mean, we’ve basically come full circle. Like, I used to be in SEO in the early days where, you know, people were putting like white text on a white background to, you know, like boost their rankings. That’s where we are, but for AI. So be vigilant. Thank you.