Everything we do is about big data. We gather, process and syndicate huge product data quantities. We started our PIM platform in 2001. Young and innocent back then, we didn’t really have a clue that today our database would grow to a billion records, downloaded by tens of thousands of e-commerce customers around the globe! For our data customers it’s crucial that the product data-sheets (PDSs) are up-to-date, contain the latest content versions. However great some foundational choices that we made back in 2001, the requirements of 2020 are markedly different.
Today, our platform is dealing with rich product content, which by nature means lots and lots of data and advanced digital asset types. Right now, we are storing information for 20 million different products, where 7.3 million are well-described. For 50 languages this leads already to 350 million to 1 billion localized PDSs depending on quality level! Further, for the well-described products there are approximately 730 million attribute values. Typically, brand and Icecat editors are adding around 150,000 to 250,000 enriched & described PDSs every month via the Icecat Vendor Central or PIM. To make things “worse”, we expect our PIM database to expand at least 50-fold in the coming years.
To continue to provide accurate PDSs to any of our current and future customers, easily and just in time, we need to drastically upgrade our syndication infrastructure once more.
During the evolution of our PIM and syndication platform, we already had to make some marked adaptations. In the early days, we migrated from access to a single full database dump, to PDS access on product level. Each product entry is accessible through a unique URL, through which one can download the product data in XML, or through other methods. And these URLs are distributed by product indexes or real-time product calls.
So, in the second phase, we had one URL per product with all the product content in all languages in the respective file. In the third phase, we differentiated per country to deliver content transformed and translated for specific local contexts. During this phase, country codes transformed into locales (i.e, language and country combinations) as multiple languages might need to be supported within one country, and brands often have different messaging per locale.
Internally, we are storing master data in a SQL database. Product content has to be accessible quickly to our customers, which doesn’t leave us a lot of time to navigate through all of the data relations and collect every piece of data from every table. So the solution at the time was rather obvious. We decided to make a pre-cache of all product data in noSQL database, one record per product per locale, ready to be distributed to customers.
As our product data footprint was only increasing, we hit the issue of big data propagation, as the growing number of product records is multiplied by the growing number of locales in our pre-cache.
In the diagram above, you can see that locale-specific data are fetched by the Publisher from the Dictionary, so it has to process every locale separately and store data at the pre-cache in a locale-specific state. And that is exactly the root of the issue, as at the current update rate we have to publish approximately two million products in 50+ languages per day, making it 100 million documents to be republished daily. A 50-fold increase would just blow up the process entirely.
Designing the solution we had to keep in mind that we already have 80,000 existing data users on board, and a lot of them have an existing implementation of our XML or other export formats. So, we have to avoid changes to our pull-APIs or at least minimize them and have them backward compatible.
We decided to make the Publisher produce locale-independent content, into a pre-cached JSON format. This to reduce the documents update rate by a factor 50, while avoiding the flooding of the SQL database with data requests for content propagation. So, a near real-time solution, depending on the update frequency and smartness of the JSON pre-cache.
Storing the content in JSON format would provide us with the flexibility to perform customer-specific adaptations on the fly, including applying personalized content access permission rules. The latter will help to offload the creation of customer-specific XMLs by the Composer. By creating user-specific XMLs dynamically, we can win another 10% in overall system efficiency. It also improves download efficiency for users that are present in multiple markets, as they get one single and tailored product XML including the locales they signed up for, foregoing the need to download product XML per each locale. This reduces the download “stress” with around 20% overall.
As a bonus, we will get the possibility to update very specific parts of a PDS instead of re-building a complete PDS, as we are doing right now. This smart updating is expected to lead to a further 60% efficiency gain and shall be the foundation for the PDS Update Journal.
We plan to roll-out this change to near real-time syndication per February 2020. It should have little to no impact on the existing integrations of our customers, but instead will improve the time-to-market for PDS updates, make download connections far more efficient, and shall significantly simplify the delivery of PDS data.
The stuff above is pretty cool given the gains. But, what about further simplification of data access for our customers? Our data footprint is only growing, and right now a full database sync requires processing a data index of a few gigabytes of XML by our customers, and daily syncs are about indexes containing hundreds of MB. That will only get bigger. So what can we do about that? We intend to make the PDS Update Journal available as an export so that Icecat users can replay all updates done to their PDS database for a given time frame. Further, we see possibilities to improve our existing personalized indexes by adding Update Journal data to it in daily/weekly/monthly delta files, and by adding XML versions to our standard XML interfaces.
I’d love to hear your feedback on our platform and improvements you would like us to do. Feel free to respond to this post or reach out to me directly.
Read further: In-depth, bigdata, content syndication, Locales, PIM, XML
Director of Technology at Icecat, the company behind the open product catalog with more than 5,832,475 product data-sheets of 24,533 brands. I am an IT entrepreneur with a deep technical background. Founder & CEO at Bintime, an outsourcing company, and Gepard, a product content syndication platform, empowering manufacturers and retailers to deliver the product information consumers demand. https://www.linkedin.com/in/sergeyshvets/
Your email address will not be published. Required fields are marked *