Multi Media with Gemini

"The Muse takes note of our dedication.
She approves.
We have earned favor in her sight.
When we sit down and work,
we become like a magnetized rod
that attracts iron filings.
Ideas come. Insights accrete."

- Steven Pressfield, The War of Art

So I have been playing with AI for a while now, but the promise of Vibe Coding is still lost on me. Most modern LLMs help with coding tasks, or the chatbots can help with ideation. However, as soon as you want to do something more complex, it becomes an unmitigated disaster of word salad. Sometimes I wonder how many Kamala Harris speeches were written by AI, or accidentally or deliberately in the training set of California-based AI providers. Okay, politics aside and back to Gemini. We have Gemini Pro as part of our Google subscription, so since it came with the pack there was an incentive to try it out.

Things that, at the time of writing, seem to work really well with Gemini Pro:

Synthesizing small code snippets for easy problems
Draft diagramming in Mermaid
Summarizing text
Studio Ghibli style image generation (although input filters are a lot more restrictive now than when it was first released), or concept art (see above)

While the deep research feature is nice, it is an unmitigated disaster when you ask for anything that you could not resolve within a few Google searches. Google should not call it “deep research”, they should call it “deep survey”.

Also, coding in agent mode is still a disaster. In addition to Gemini, we use GitHub Copilot. Do not ever give any of these tools access to your code repository. They will happily start making changes that break your codebase. It then becomes an insidious game of whack-a-mole to fix the codebase while the AI keeps breaking it. The pain varies by language. Python seems to kind of work, but as soon as you venture into C, C++, TypeScript, or Rust, it becomes a nightmare. These agents struggle to understand complex build toolchains, dependencies, or code inlining for performance. The best way to use AI for coding is to fence it off to small tasks by suggestion or auto-completion, but be sure to know your IDE’s shortcut key for disabling it quickly.

So the above is all well and good, but it does not really justify the valuations of these AI companies. Can we use it for something useful?

The two areas I have found that Gemini Pro is actually useful for me are:

Bad handwriting recognition, even for foreign languages like German and Japanese
Summarizing long videos and podcasts into text

Handwriting Recognition

I have terrible handwriting. However, since my teens I have kept a journal, probably mostly inspired by films like Dances with Wolves.

Over the years, I have accumulated a few thick notebooks full of handwritten notes in German. My kids struggle to read my handwriting, and they are not fully fluent in German yet. So in the interest of digital inheritance, I was wondering what the best way to digitize my handwritten notes would be. I tried a few OCR tools, but Gemini Pro blew them all out of the water. By just taking a picture of the handwritten notes, Gemini Pro was able to transcribe them with amazing accuracy. As part of the prompt, specify the input language and it will transcribe accordingly.

This also works via their API. With a little bit of coding and prompt engineering, you can potentially “revolutionize” your note-taking workflow at work too, especially if you have terrible handwriting like me.

Video and Podcast Summarization

Over the decades there have been numerous tools to scrape various multimedia content from the web. However, what has been missing is a way to make them indexable and searchable. Especially if someone recommends a podcast or technical lecture, and you do not want to sit through the entire thing, it would be nice to have a summary of the content.

AI has made this possible. I have for a couple of years already been using Whisper to transcribe audio content into text, or generate makeshift subtitles for videos. However, this either incurs OpenAI cloud charges, or requires a local model installation. For me, Whisper still takes considerable time on an Nvidia 5060 to transcribe larger batches of audio content.

Since we are already in the Google ecosystem, I decided to try Gemini Pro for this task. The process is fairly straightforward:

Download the video or podcast audio locally
Upload it to Gemini
Use Gemini Pro to transcribe the content, explicitly prompt it not to omit any sections
For large files there seems to be an option to stage it via Google Drive too

Since the output is Markdown, this can be easily recycled as documentation or wiki content.

Conclusion

I think, like many other folks, that AI (specifically LLMs) is probably overhyped right now. However, there are some specific use cases where they can be really useful. There is lots of potential for specialized applications that were computationally infeasible a few years ago. For example, think how far image classification has come since the early days of CNNs, even though neural networks have been around for decades.

I feel like we are still in the early days of LLMs, and there is a lot of room for improvement. The misconception seems to be to commercialize LLMs as general-purpose agents that can do everything. However, the reality is that they are still quite limited in their capabilities. The key is to find the right use cases where they can add real value. For me, handwriting recognition and multimedia summarization are two such areas where Gemini Pro shines. For constrained technical tasks as well.

Since so many people are handing off their creativity to AI, I fear we are on a trajectory that puts us closer to what Idiocracy envisioned 20 years ago, rather than replacing Pressfield’s Muse with something better. Time will tell.

Published: 2025-10-10
Updated : 2025-10-10
Not a spam bot? Want to leave comments or provide editorial guidance? Please click any of the social links below and make an effort to connect. I promise I read all messages and will respond at my choosing.

← Blog Migration Floatcard for Small Businesses →