At Icecat, managing, enriching, and syndicating millions of product data-sheets across thousands of global brands is an intricate operation. Central to this mission is protecting the integrity of our repository, our database and copyrights, the copyrights of our sponsoring brands, and the immense bandwidth required to serve our channel partners. A critical tool in this effort is our robots.txt file. Especially, in the age of AI. Today, we share insight into why we disallow certain bots, why this doesn’t hinder AI search discovery, and why we require you, as an Icecat partner, to do the same.
If you inspect the robots.txt configuration on Icecat.biz, you will notice strict Disallow directives targeted at specific web crawlers. Beyond restricting standard search crawlers from hammering heavy internal backend systems and stopping bots from scanning our Search API, we aggressively block a specific breed of user-agents. These are the AI training bots, scrapers, and unvetted content aggregators. This includes user-agents like GPTBot (OpenAI), CCBot (Common Crawl), Google-Extended, Anthropic-ai, and various agents of Meta.
The choice to block these agents comes down to two major reasons. First, we must prioritize copyright and value protection. Our Open Icecat and Full Icecat repositories contain proprietary, high-quality rich media, specifications, and marketing assets, created by us or by brands. Allowing general AI models to freely scrape and ingest this data to train foundational models separates the data from its commercial context, which directly violates the intended use of the open catalog ecosystem. Second, mass data extraction by LLM scrapers places an immense, non-productive load on our server infrastructure, degrading performance. By blocking them, we ensure our server response times remain optimized for legitimate users who pull content dynamically to run their businesses.
A common concern among digital marketers is whether blocking AI user-agents will render product data invisible to modern, AI-powered search engines like Microsoft Copilot, Google Gemini, or Perplexity. The short answer is that it does not hurt your visibility at all. There is a clear technical distinction between an AI training bot and an AI search or indexing bot.
Training bots like GPTBot or Google-Extended collect vast amounts of data strictly to train future versions of LLMs. This means that they do not feed real-time search engine result pages. On the flip side, standard web-crawling bots like Googlebot and Bingbot are fully permitted to index Icecat product pages. Modern AI search engines generate their real-time summaries and citations by using the live index from traditional search crawlers. Because Googlebot and Bingbot can easily read the pages, your products remain fully visible to AI search assistants that browse the web in real time to answer buyer queries. We are simply stopping tech companies from taking our data for free training material.
This protection cannot happen on Icecat.biz alone. Once our product content syndicates live on your e-commerce platforms, your website becomes the new target for aggressive AI scrapers and third-party aggregators. If they are not already targeted. According to the Open Icecat Fair Use Policy, all users – whether utilizing a free Open Icecat subscription or integrating Full Icecat data – must actively help protect this ecosystem. The policy explicitly states that you must update your organization’s robots.txt to explicitly exclude the systematic download by crawlers that are potentially violating the copyrights of materials on your website for purposes other than promoting your organization’s business, including but not limited to the training of AIs, analytics, and content aggregation.
To remain compliant with Icecat’s Fair Use terms, you must ensure that your e-commerce store’s robots.txt file explicitly bars AI scrapers from harvesting your product pages. An example of entries you should include alongside your standard configurations:
Plaintext
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Anthropic-ai
Disallow: /
User-agent: Claude-Web
Disallow: /
By ensuring your robots.txt is updated, you protect your own server bandwidth, defend the intellectual property of Icecat and the brands you sell, and fulfill your legal compliance under the Icecat license agreements.
Artificial intelligence has become one of the biggest investment priorities for retailers. From personalized shopping…
We are thrilled to announce the release of Sprint 97 for Icecat Studio. In this…
Artificial intelligence is often associated with chatbots, shopping assistants, and personalized recommendations. However, some of…
For years, the e-commerce industry has focused on making online shopping faster and more convenient.…
When TikTok first entered e-commerce, many viewed it as an experiment in social shopping. A…
What Is a Product Feed? A product feed is a structured digital file that contains…