From Cached to Near Real-Time XML Syndication

Avatar for Sergey Shvets
By
Likethumbsup(9)Dislikesthumbsdown(0)

Big Data, Easy Access

Everything we do is about big data. We gather, process and syndicate huge product data quantities. We started our PIM platform in 2001. Young and innocent back then, we didn’t really have a clue that today our database would grow to a billion records, downloaded by tens of thousands of e-commerce customers around the globe! For our data customers it’s crucial that the product data-sheets (PDSs) are up-to-date, contain the latest content versions. However great some foundational choices that we made back in 2001, the requirements of 2020 are markedly different.

Today, our platform is dealing with rich product content, which by nature means lots and lots of data and advanced digital asset types. Right now, we are storing information for 20 million different products, where 7.3 million are well-described. For 50 languages this leads already to 350 million to 1 billion localized PDSs depending on quality level! Further, for the well-described products there are approximately 730 million attribute values. Typically, brand and Icecat editors are adding around 150,000 to 250,000 enriched & described PDSs every month via the Icecat Vendor Central or PIM. To make things “worse”, we expect our PIM database to expand at least 50-fold in the coming years.

To continue to provide accurate PDSs to any of our current and future customers, easily and just in time, we need to drastically upgrade our syndication infrastructure once more.

Cached Localized Syndication

During the evolution of our PIM and syndication platform, we already had to make some marked adaptations. In the early days, we migrated from access to a single full database dump, to PDS access on product level. Each product entry is accessible through a unique URL, through which one can download the product data in XML, or through other methods. And these URLs are distributed by product indexes or real-time product calls. 

So, in the second phase, we had one URL per product with all the product content in all languages in the respective file. In the third phase, we differentiated per country to deliver content transformed and translated for specific local contexts. During this phase, country codes transformed into locales (i.e, language and country combinations) as multiple languages might need to be supported within one country, and brands often have different messaging per locale.

Internally, we are storing master data in a SQL database. Product content has to be accessible quickly to our customers, which doesn’t leave us a lot of time to navigate through all of the data relations and collect every piece of data from every table. So the solution at the time was rather obvious. We decided to make a pre-cache of all product data in noSQL database, one record per product per locale, ready to be distributed to customers. 

As our product data footprint was only increasing, we hit the issue of big data propagation, as the growing number of product records is multiplied by the growing number of locales in our pre-cache.

Diagram of the update process as it is now

In the diagram above, you can see that locale-specific data are fetched by the Publisher from the Dictionary, so it has to process every locale separately and store data at the pre-cache in a locale-specific state. And that is exactly the root of the issue, as at the current update rate we have to publish approximately two million products in 50+ languages per day, making it 100 million documents to be republished daily. A 50-fold increase would just blow up the process entirely.

Towards Real-time Syndication

Designing the solution we had to keep in mind that we already have 80,000 existing data users on board, and a lot of them have an existing implementation of our XML or other export formats. So, we have to avoid changes to our pull-APIs or at least minimize them and have them backward compatible.

Diagram of the update process as it will be

We decided to make the Publisher produce locale-independent content, into a pre-cached JSON format. This to reduce the documents update rate by a factor 50, while avoiding the flooding of the SQL database with data requests for content propagation. So, a near real-time solution, depending on the update frequency and smartness of the JSON pre-cache.

Storing the content in JSON format would provide us with the flexibility to perform customer-specific adaptations on the fly, including applying personalized content access permission rules. The latter will help to offload the creation of customer-specific XMLs by the Composer. By creating user-specific XMLs dynamically, we can win another 10% in overall system efficiency. It also improves download efficiency for users that are present in multiple markets, as they get one single and tailored product XML including the locales they signed up for, foregoing the need to download product XML per each locale. This reduces the download “stress” with around 20% overall.

As a bonus, we will get the possibility to update very specific parts of a PDS instead of re-building a complete PDS, as we are doing right now. This smart updating is expected to lead to a further 60% efficiency gain and shall be the foundation for the PDS Update Journal.

Roll-out February 2020

We plan to roll-out this change to near real-time syndication per February 2020. It should have little to no impact on the existing integrations of our customers, but instead will improve the time-to-market for PDS updates, make download connections far more efficient, and shall significantly simplify the delivery of PDS data.

Update Journal

The stuff above is pretty cool given the gains. But, what about further simplification of data access for our customers? Our data footprint is only growing, and right now a full database sync requires processing a data index of a few gigabytes of XML by our customers, and daily syncs are about indexes containing hundreds of MB. That will only get bigger. So what can we do about that? We intend to make the PDS Update Journal available as an export so that Icecat users can replay all updates done to their PDS database for a given time frame. Further, we see possibilities to improve our existing personalized indexes by adding Update Journal data to it in daily/weekly/monthly delta files, and by adding XML versions to our standard XML interfaces.

Feedback

I’d love to hear your feedback on our platform and improvements you would like us to do. Feel free to respond to this post or reach out to me directly.

Avatar for Sergey Shvets

Director of Technology at Icecat, the company behind the open product catalog with more than 5,832,475 product data-sheets of 24,533 brands. I am an IT entrepreneur with a deep technical background. Founder & CEO at Bintime, an outsourcing company, and Gepard, a  product content syndication platform, empowering manufacturers and retailers to deliver the product information consumers demand. https://www.linkedin.com/in/sergeyshvets/

Leave a Reply

Your email address will not be published. Required fields are marked *

Manual for Icecat URL: Integrating Links to Product Data Sheets and Images

Version: 1.21, August 28, 2019. The purpose of this post is to explain the Icecat URL method to get...
 October 4, 2018

Manual for Icecat Live: Real-Time Product Data in Your App

Icecat Live is a (free) service that enables you to insert real-time product content from some hundr...
 June 1, 2018

Iceclog: Content Log and Playground for New Ideas like a Free Vendor Central and Social Media Functions

“Iceclog” (Icecat content log) is the Icecat blog, where you will find...
 June 26, 2019

Manual for the Icecat CSV Interface

This document describes the CSV (Comma-Separated Values) variant of Icecat's Open Catalog Interface...
 September 28, 2016

Manual for Open Icecat JSON Product Requests

JSON (JavaScript Object Notation) is an increasingly popular means of transferring to data, comparab...
 September 17, 2018
 November 3, 2019

Manual for your Personalized Interface File and Catalog from Icecat

Via the Icecat website and login area, a user can generate personalized or customized CSV or Excel f...
 October 5, 2016

Icecat Add-ons including Magento, PrestaShop, Oracle, SAP Hybris, Google Shopping. NEW: JS.NODE

Icecat has a huge list of integration partners, that make it easy for clien...
 October 25, 2019

Iceclog Editor Guidelines: Writing Compelling Posts

The Iceclog Editor Guidelines are a quick guide for contributors to the Iceclog blog or "cl...
 August 17, 2016