Creating Engaging Videos with AI-Driven Text and Image Conversion

Avatar for Elmo Hoogeveen
Creating Engaging Videos with AI-Driven Text and Image Conversion

In today’s digital age, video content has become a powerful tool for marketing and product promotion. Creating compelling videos can be time-consuming and resource-intensive. However, with advancements in artificial intelligence (AI), it is now possible to automate the process of video generation from existing text and picture-based product pitches. In this blog post, I will explore the applications, challenges, and potential solutions for utilizing AI in video creation.

AI Aspects

The AI-driven video generation process involves two main aspects: asset scraping and voice-over. As an example, I focused on creating a product video featuring one of Icecat’s clients, using Icecat product data sheets along with database pictures and sales text. Asset scraping involves extracting high-resolution product images from existing databases and product data sheets. Voice-over capabilities enabling the conversion of product descriptions and sales texts into speech using AI-generated voices.

Potential Automatization (Editing)

While asset scraping and voice-over can be automated effectively, the editing process of videos remains a challenge. Currently, there are no online editors capable of precise and autonomous video creation according to the specific requirements of e-retail. To address this, a custom program for an existing non-linear video editing program is needed. This program should be able to input various file formats such as .wav, .mp3, .png, .jpeg, and output video formats like h.264 and/or h.265.

The Process

To demonstrate the AI video generation process, I utilized existing Icecat product data sheets, database pictures, and sales text. I scraped high-resolution images and used them directly in the video editor. It’s essential to ensure the accuracy and appropriateness of the images, as generative AI techniques may inadvertently distort or omit important details. Therefore, I did not use AI for image manipulation.

The text content, excluding headlines, I converted into speech using a text-to-speech program. However, the resulting audio may require human proof-listening, particularly when technical jargon or acronyms are present. Each text under a specific headline, I separated into different files to create distinct video chapters. The headlines themselves serve as chapter indications, appearing at the bottom left or top left of the screen.

Logo Placement and Image Treatment

To establish branding, I placed the company logo on the bottom left of the screen. Custom logos should be created or sourced for each company seeking to utilize AI video generation. The logos are preferably black on an alpha layer in .png format. I made a distinction between product pictures (with a white background) and location pictures. Treating them differently avoids a monotonous feel within the video.

Editing Techniques

The editing process involves specific techniques to enhance visual appeal. Product pictures with white backgrounds can smoothly transition into one another through fading effects. On the other hand, location pictures are best animated from right to left to create movement. Proper scaling and positioning of images within the frame are crucial to ensure seamless horizontal movement without revealing image edges.

Image Duration and Movement

White background product images typically last 2-4 seconds, while the intro and ending images have a duration of 4 seconds. In between images (location pictures) need to be zoomed in to approximately 110% of the total frame width, with logarithmic movement over several seconds. Fading effects accompany image transitions, and the movement speed gradually increases during the initial 20% and decreases during the final 20% of the movement duration.

Required Video Layers

To achieve the desired visual composition, the minimum required video layers, from top to bottom, include:

  1. Company logo (with 30% opacity)
  2. Headline titles
  3. Background for headline title
  4. Images
  5. White background

Titles and Styling

For title elements, multiple styles can be utilized. One approach is to fade in the title after its underlying darker layer appears. Alternatively, the text can slowly appear alongside the underlying darker layer from left to right. This layer is essential to ensure text visibility, especially when using white or black text on varying image backgrounds.

Tools Used

During the AI video generation process, I employed various tools. Initially, I used Speechify for text-to-speech conversion. But, it had limitations when dealing with acronyms and specific technical language, as can be heard in words such as but not limited to:
“High-res” Pronounced High reis by AI
“i9 processor” Pronounced eynin processor by AI
“Matte” Pronounced mattie by AI

Happily, I found that Google’s Cloud Text-to-Speech (TTS) tool provided more convincing results. While Google Cloud TTS may still have minor pronunciation issues, it outperforms competitors and addresses the challenges faced with acronyms and technical terms.


Once the necessary tools and workflows are established, the process can be scaled effectively. By inputting the link to a product, each aspect of the AI video generation pipeline can be automatically processed and prepared for integration into the video editor.


AI video generation from existing text and picture-based product pitches offers a practical solution for creating numerous videos across various products. While the current state of pre-trained AI models may not support the reliable generation of 3D imagery, this approach allows for efficient automation of the video creation process. As technology continues to advance, we can anticipate further improvements and opportunities in AI-driven video generation for marketing and promotional purposes.

Avatar for Elmo Hoogeveen

Cinematographer and Photographer. Bachelor of Arts in Cinematography and Film/tv production. Founder of Kip Media, focused on business video productions.

Leave a Reply

Your email address will not be published. Required fields are marked *

Icecat xml

Open Catalog Interface (OCI): Manual for Open Icecat XML and Full Icecat XML

This document describes the Icecat XML method of Icecat's Open Catalog Inte...
 November 3, 2019

Manual for Icecat Live: Real-Time Product Data in Your App

Icecat Live is a (free) service that enables you to insert real-time produc...
 June 10, 2022
Manual for Icecat CSV Interface

Manual for Icecat CSV Interface

This document describes the manual for Icecat CSV interface (Comma-Separate...
 September 28, 2016
 October 4, 2018

How to Create a Button that Opens Video in a Modal Window

Recently, our Icecat Live JavaScript interface was updated with two new fun...
 November 3, 2021
Addons plugins

Icecat Add-Ons Overview. NEW: Red Technology

Icecat has a huge list of integration partners, making it easy for clients ...
 October 27, 2023

Manual for Open Icecat JSON Product Requests

JSON (JavaScript Object Notation) is an increasingly popular means of trans...
 September 17, 2018
 January 20, 2020
New Standard video thumbnail

Autheos video acquisition completed

July 21, Icecat and Autheos jointly a...
 September 7, 2021

Manual Personalized Interface File and Catalog from Icecat

With Icecat, you can generate personalized or customized CSV or Excel files...
 May 3, 2022