A text-to-video generation workflow which actually has a use-case.
At my startup, @Gloo, we encountered a fascinating problem while developing @GlooTV, a solution that focuses on controlling and engaging televisions in customer experience centers such as Salons, Restaurants, and Co-Working spaces.
The problem arose when businesses were given control of the television and continuously chose to display their promotions or partner advertising. This resulted in a significant issue for the end-customer (viewer) – boredom and irritation towards the television itself.
To tackle this challenge, we realized that simply showing promotions was not the answer. We needed to engage the viewers, but how? How can we captivate the attention of viewers who are already bombarded with stimuli in an out-of-home location?
Moreover, viewers often have their mobile phones or a group of friends as alternative sources of entertainment. So, how do we compete with that? The solution lies in managing expectations, specifically our own.
We understood that we didn’t need to create an extravagant HBO or Netflix special. Our goal was to provide engaging content that would pique curiosity within viewers and encourage them to follow the content on the television screen for a few minutes during their visit.
But here’s where things got tricky. As a tech-focused team, we excel in areas like content marketing through blogs with SEO and basic Instagram-style content. However, we had no experience in generating high-quality video content.
With this challenge in mind, I decided to take a break during a late-night development sprint and delve into the world of generative video models and video tag detection systems. The aim was to develop a workflow that could partially address our content problem – ensuring that video content existed and remained contextual to the viewers. However, the real challenge was creating content that was truly engaging.
Here’s how I approached the problem:
Step 1: Tagging Existing Video & Image Content
To tag the existing video and image content, I utilized Google Vision AI’s powerful Video API and Image API. These APIs provided tags for various elements within the content, along with confidence scores for each tag.
For instance, a video might be tagged with labels such as "food" with a confidence score of 0.97248906, "dish" with a confidence score of 0.9149506, "cuisine" with a confidence score of 0.9079056, and so on. Label: food Confidence: 0.97248906 Label: dish Confidence: 0.9149506 Label: cuisine Confidence: 0.9079056 Label: cooking Confidence: 0.8895474 Label: indian cuisine Confidence: 0.81907773 Label: recipe Confidence: 0.73818874 Label: sauces Confidence: 0.5377984 Label: chicken meat Confidence: 0.4836464 Label: spice Confidence: 0.4687535 Label: vegetable Confidence: 0.3413351
Step 2: Prompt Preparation with Tags from Step 1
Using the tags obtained from the previous step, I prepared prompts that would guide the generation of engaging content aligned with the viewer’s interests and preferences. These prompts served as the blueprint for crafting the video script, ensuring that it would captivate the audience and hold their attention.
Here’s an example of the generated script based on the provided tags:
Frame 1-10: The scene opens in a bustling kitchen, filled with the aroma of spices. The chef, with a confident smile, starts preparing an exquisite Indian cuisine. The camera pans across various ingredients and utensils, highlighting the art of cooking. Frame 11-20: As the chef continues cooking, the focus shifts to the vibrant sauces simmering on the stovetop. Each sauce is carefully crafted, blending spices and flavors to perfection. The camera captures the sizzling sounds and the enticing aroma. Frame 21-30: The scene transitions to a serene vegetable garden. The camera zooms in on the lush greenery and showcases a variety of fresh vegetables. The chef is seen carefully picking the choicest vegetables, emphasizing the importance of quality ingredients. Frame 31-40: Back in the kitchen, the chef starts chopping the vegetables with precision. The colorful assortment of vegetables adds a visual appeal to the cooking process. The camera captures the chef's skillful knife work and the vibrant mix of flavors. Frame 41-50: A montage of spices being added to the dish follows. The camera focuses on the chef's hands as they sprinkle a medley of spices, infusing the dish with rich aromas and complex flavors. Close-up shots showcase the vibrant colors of the spices. Frame 51-60: The dish starts to take shape as the chef expertly combines the marinated chicken meat with the aromatic spices. The sizzling sound fills the kitchen as the flavors come together, creating a tantalizing experience for the senses. Frame 61-70: The camera zooms in on the dish as it cooks to perfection. The bubbling curry, the golden-brown chicken, and the fragrant steam create an irresistible visual and olfactory feast. The anticipation builds as the dish nears completion. Frame 71-80: The scene transitions to a beautifully set dining table. The camera captures the dish being carefully plated, with garnishes and sauces added to enhance its presentation. The chef's attention to detail is evident in every element of the presentation. Frame 81-90: The camera focuses on the first bite of the dish. As a fork pierces through the succulent chicken and dips into the flavorful sauce, a look of delight appears on the diner's face. The taste transports them to a world of culinary bliss. Frame 91-100: The satisfied diner savors each bite, relishing the explosion of flavors. Close-up shots capture their expressions, highlighting the pleasure derived from this gastronomic journey. The camera pans out, revealing the chef's contented smile. Frame 101-110: The scene shifts to a cookbook, open to a recipe page. The camera zooms in on the recipe for the showcased dish, capturing the ingredients and instructions. The recipe's simplicity and clear steps inspire viewers to try it themselves. Frame 111-120: The camera pans across a collection of spices, highlighting their vibrant colors and distinct aromas. The narrator explains how spices play a crucial role in Indian cuisine, adding depth and complexity to the dishes. Frame 121-130: The focus turns to a spice grinder, where the chef is seen grinding fresh spices. The camera captures the transformation of whole spices into a fine powder, enhancing the flavors and aromas of the dishes. Frame 131-140: The chef introduces another dish from the rich Indian cuisine. The camera showcases the process of marinating chicken meat with a blend of spices, highlighting the flavors that infuse the meat, making it tender and flavorful. Frame 141-150: A montage of vegetables being chopped follows. The camera captures the chef's hands swiftly preparing an assortment of vegetables, showcasing their freshness and variety. The vibrant colors add visual appeal to the cooking process. Frame 151-160: The chef skillfully combines the marinated chicken meat and chopped vegetables in a sizzling pan. The camera focuses on the ingredients as they sizzle and release their flavors, creating an aromatic symphony. Frame 161-170: The scene transitions to a family gathering around a dinner table. The camera captures their joyful expressions as they are served the Indian dish. The shared experience of savoring the food strengthens their bonds and creates lasting memories. Frame 171-180: The camera zooms in on the diners' plates, showcasing the beautifully plated dishes. The narrator highlights the importance of presentation in Indian cuisine, where visual appeal complements the explosion of flavors. Frame 181-190: The diners express their delight with every bite, their faces lighting up with satisfaction. The camera captures their animated conversations and the harmonious blend of flavors that make the meal unforgettable. Frame 191-200: As the meal concludes, the camera pans out to reveal the chef receiving heartfelt appreciation from the diners. Their applause and smiles are a testament to the chef's culinary skills and the memorable experience they had.
Now, it was time to utilize a smart text-to-video model to bring the generated script to life. We chose to leverage the capabilities of the deforum-stable-diffusion model, a sophisticated text-to-video model renowned for its dynamic video generation prowess.
With this model, we could create visually appealing videos that seamlessly integrated with the viewer’s interests. By leveraging cutting-edge technologies and innovative approaches, we aimed to revolutionize television content and elevate customer engagement at various establishments.
In conclusion, our journey at @Gloo to address the challenge of engaging television content led us to understand the importance of captivating viewers and keeping them interested. Through intelligent tagging, prompt preparation, and leveraging advanced text-to-video models, we aimed to transform the out-of-home viewing experience and create a win-win situation for businesses and viewers alike.
By providing crisp and captivating content on television screens, we aimed to hold viewers’ attention and provide an alternative to their mobile phones or other distractions. Our mission was not just about delivering promotions but creating an immersive and enjoyable experience for viewers, leaving a lasting impact on their minds.
Stay tuned as we continue to explore innovative solutions to create engaging content and enhance customer experiences in the ever-evolving world of television and customer experience centers.