When AI Makes Up Links: Measuring URL Hallucinations in ChatGPT and Gemini

Written by Ahmad Alobaid Updated 17 March 2026 7—minute read

AI assistants like ChatGPT and Gemini generate a surprising number of fake links. We studied more than 100,000 generated URLs and analyzed them across different models, languages, sectors, intents, and prompting techniques. The results reveal unexpected patterns: English prompts hallucinate more than Arabic ones, smaller models can outperform larger ones, and advanced prompting techniques like chain-of-thought have little impact on hallucination.

Content

  1. Introduction
    1. Key Terms
    2. Key Questions
  2. Experiment
    1. Data Collection
    2. URL Hallucination
  3. Results
    1. Break down by status code
    2. Break down by model
    3. Break down by sector
    4. Break down by language
    5. Break down by intent
    6. Break down by prompting technique
  4. Limitation
  5. Findings Summary

Introduction

In the early days of BuzzSense, we were experimenting with many ways of getting URLs from AI assistants and observed a large number of hallucinated URLs. Hallucinated URLs are fake links generated by AI assistants. In this experiment, we used several models from OpenAI and Google. We collected more than 100,000 URLs and tested whether they were valid or invalid (hallucinated).

Key Terms

  • URL: The acronym URL is short for Uniform Resource Locator. It is the address of a resource (or simply a web page). Many times, links and URLs are used to refer to the same thing.
  • Hallucination: Invalid piece of information generated by the language model.
  • Hallucination Score: The percentage of URLs that resulted in either a network error or that returned a status error, which indicates an error (e.g., 404 when the resource/page is not found).

Key Questions

In this blog post, we report our findings and answer the following questions:

  1. Who hallucinates more, Gemini or ChatGPT?
  2. Do hallucination rates vary across different sectors?
  3. Are hallucinated URLs more common in Arabic prompts or English prompts?
  4. Do certain prompt intents lead to more hallucinated URLs?

Experiment

We had many prompts for different purposes and intents (see below). Experiments were conducted across two languages, spanning three different sectors, and tested on several models from OpenAI and Google.

Data Collection

The data we collected was from buzzsense.ai, which tracks brand visibility in AI Assistants like ChatGPT and Gemini. Prompts are written for the following sectors: Technology, Plastic Surgery, and Beauty Salons. These areas were chosen due to business use cases related to BuzzSense. In total, we have collected more than 100K URLs from different models.

URL Hallucination

URL reachability was automatically inspected, and different network errors were recorded (e.g., DNS, SSL, …). The status codes of the HTTP requests were also recorded as a sign of validity. Status codes between 200 and 399 are considered valid, and anything else is considered invalid (hallucination). So, the hallucination score would be the percentage of hallucinated URLs.

Results

We found that most of the suggested URLs were invalid. We consider any unreachable URL or one with an invalid status code (>= 400) as invalid or a hallucination. We break down the collected URLs and report the results for different dimensions.

Break down by status code

The status code refers to the code of the HTTP response. Status codes are numbers; those between 200 and 299 indicate a successful request, those between 300 and 399 are redirects, and the rest are generally errors. We can check the meaning of each here.

We also plugged in the type of error in case we encountered one. The majority of the URLs got the DNS error message. For more details on the meaning of these errors, visit the requests library documentation.

In this study, we label URLs that resulted in successful status codes and redirects as valid URLs.

The majority of the valid URLs returned a 200 status code, which is around 35K URLs. For hallucinated URLs, there are two types of errors: network errors and status codes indicating errors (which could be client or server errors). For the former, most returned status codes were 404, indicating that the requested resource/web page was not found. For network errors, DNS errors were by far the most common.

Break down by model

We tested six different models: two Gemini, two GPT-4, and two GPT-5. We notice that GPT-4 models have the highest hallucination scores. The hallucination of gpt-5-nano is sandwiched between the two Gemini models.

The surprising thing is that the lite version of Gemini (gemini-2.5-flash-lite) outperformed gemini-2.5-flash by 7%. The model with the lowest hallucination was gpt-5-mini, with 51.7% hallucination score.

Break down by sector

The experiment was performed on many target brands, all within the following sections: Technology, Beauty, and Plastic Surgery. The technology prompts focused mainly on software development and SaaS (Software as a Service). As most brands in the beauty sector were beauty salons, prompts focused mainly on pedicure and manicure services. Plastic Surgery prompts covered a wide range of topics, like breast augmentation, facelift, liposuction, among others.

The hallucination scores for the Plastic Surgery and Technology sectors were in the high 70% range , while the beauty sector had the lowest hallucination score of 67%.


Break down by language

ICANN has long supported non-Latin domain names, called Internationalized Domain Names (IDNs). Nonetheless, the majority of domains in Arabic-speaking countries still rely almost exclusively on Latin domain names. URLs by AI Assistants were consistent with that as well. This means hallucinations in Arabic prompts are not due to IDN encoding errors. Surprisingly, more hallucinated URLs are found in English prompts, around 7% more than in the Arabic ones.

Break down by intent

Intents are the reasons or the “need behind [a prompt or search] query”. There are many ways of categorizing intents. We have the following taxonomy for the intents:

  • Brand: Focusing on the brand and the products/services it offers.
  • Category. Asking about a specific product or service category (e.g., best sun block)
  • Problem. Seeking a solution to a problem that involves the use of a tool, a product, or a service. Something like (I have flat feet, what running shoes should I buy?).
  • Alternative. The classic alternative queries (e.g., what are the best alternatives for Product X or Service Y).

The alternative intent has the lowest hallucination score of %68 while all the others are in the 70s%. We can’t confirm whether mentioning real brands slightly nudges the results to more valid URLs, as it provides an anchor point.

Break down by prompting technique

The chain-of-thought (COT) prompting technique has also been tested. It is basically breaking the complexity of the task down and providing the AI assistant with a step-by-step guide on how to perform the task at hand, in contrast to the direct approach, in which the final answer is asked directly (without breaking down the tasks into smaller steps). The hallucination scores are almost identical for both.


Limitation

  1. Prompts classification is subjective, and intents are not mutually exclusive. Sometimes a prompt can belong conceptually to more than one intent. In such cases, we chose to add the most fitting intent in our opinion.
  2. URL validity does not mean correctness. In other words, the test does not verify that the information cited actually comes from the provided/claimed URLs.
  3. Status code changes over time. Some checked URLs might return different status codes, as it is common for websites to change their URLs due to redesigns, internationalization, revamps, or acquisitions, among other reasons.
  4. Limited model coverage. We did not run tests across all the available models from OpenAI and Google, which might yield different results.
  5. Imbalance in URL distribution. The number of URLs collected across the different dimensions varied (e.g., language, intent, section, ...). This can be due to the varying number of URLs per prompt and the business-related factors (when certain brands were added, the number of brands per sector, etc.).
  6. Limited reproducability. Language models are non-deterministic, and repeating the experiment might not yield the same results.
  7. Crawler and access restrictions. Some websites, for security reasons or others, might decide to block the crawling of their websites. Such cases might label a valid webpage as a hallucinated URL.

Findings Summary

A large number of URLs suggested by AI assistants were hallucinations and not valid URLs.
Among the tested models, gpt-5.2 has the lowest percentage of hallucinations (~51%), while other models exhibited higher rates.
The beauty sector has the lowest hallucination score (~10% lower) compared to the Technology and Plastic Surgery sectors. Similarly, Arabic prompts resulted in fewer hallucinations (7% less) than English ones. Differences across intents and prompting techniques were minimal. After verifying 100,000 links across different dimensions, even in the best-performing case, roughly half of the URLs suggested by AI assistants were hallucinations. This highlights the need for link-verification.

Share this post!