What I'm trying to accomplish with my weekend side project
An update on my "automatic AI-podcasts for Substack authors" project.
In my last post, I mentioned my side project, which generates AI podcasts for blogs. Today, I’m going to talk a bit more about that project. There are some developments in that world which shake things up.
So, firstly, what does my app do? Let’s say you’re the author of a blog. You find my site and sign up for an account. First, you’ll be asked for the URL of your blog. Our servers will then fetch your blog (making sure it’s really a blog) and present a list of your posts, on our site, in a podcast-looking feed.
You’ll want to select an AI voice to use, by default. I’m a fan of old British male voices (thank you epic fantasy audiobooks for that fetish), but that doesn’t have to be true for you. Lots and lots of options there.
For any posts that you want turned into a podcast episode, you’ll click a little button to prepare it. (Later versions of this will be more automated.) This makes our servers fetch the individual article in order to do a few things.
Obviously, we need the content. What we have is text in HTML format, so we have to do some operations to extract just the important part. That means removing pictures, links, tweets, and other things like that. This step is actually pretty complicated. Thankfully, I’ve been experimenting with the best way to do this since 2020 – even having authored an open-source JavaScript package to do just this.
Beyond that, we need to figure out how to split up the content. Sure, maybe you (the author) have split it neatly into paragraphs, but those alone aren’t sufficient. This is where the custom model I’ve trained comes in. It knows when and where to add a 1.15-second pause, mid-paragraph, for dramatic effect.
It also knows how to handle things like headings and bulleted lists. Mostly, both of those typographical elements don’t end with punctuation, and this does funny things to AI voices. Sometimes it’s as simple as adding periods to the end of text, and sometimes it’s a little more complicated.
One-paragraph digression on a pet peeve of mine… Have you used the voice chat function on the ChatGPT app? It is impressive, but one thing that drives me nuts is how the default responses format (for many of these LLMs) is to present things in list form. Why? Who prefers it this way? It’s not how humans communicate with each other when speaking to each other, and I find it greatly reduces my ability to comprehend.
Once my fine-tuned AI model processes the article, you now have access to an edit page which contains a list of segments. A segment is sort of like a paragraph, and each one displays the amount of silence will follow it. Scrolling through this list, you can double-check my model’s calculations. For example, perhaps you need to delete an advertisement that got accidentally ingested. You can also choose a different voice for a certain segment, say, a quote.
Once my music-adding functionality is implemented, you’ll also be able to see where the model thinks music should go. Of course, you’ll be able to modify that stuff as well; though hopefully you won’t need to do that too much.
Finally, you can click a button to have the server convert it to a podcast episode. I’ve experimented with various ways to do this. Presently, I’m just using ElevenLabs’ API for this because their voices are just a cut above the rest. I can easily swap this out in the future, once the open-source models have caught up. ElevenLabs is quite expensive: a 10-minute article will cost at least a couple dollars to generate.
Once our servers generate the audio clips for each segment and stitch them together into a .mp3 file, it’s added to your podcast feed. We give you a link to add your podcast to any of the major podcast platforms, which you can share with your audience. So anyone subscribed to your podcast will get your new episode in their podcast app.
New developments in the world of AI text-to-speech
So, I found out about ElevenLabs’ “Reader” app this week, and I installed it. Holy crap, it’s good. It does a lot of what my app is trying to do, albeit in a far less automated way.
You can copy and paste any text – or provide any website via the “share” interface – to the app, choose a voice, and have it read to you. This is amazing in combination with the Substack app. You find any article that lacks voiceover, share it to Reader, and listen to it while you do chores.
Crazily enough, it’s temporarily free – they’re calling this a beta test or some such thing. That means instead of paying hundreds of dollars for the hours of content I’ve already listened to (or what it would cost right now via their API), I’ve paid nothing.
So, how is my app different? First of all, the lack of intelligent pauses is noticeable, and… stressful? I don’t understand how no one gets that part right. Let the sentences breathe! I’m guessing they’ll figure that out eventually, but that’s one way I’m ahead of the curve.
Another big way it’s different is automation: both the notion of pulling from your blog’s feed once you make a new post, and the notion of automatically adding the resultant audio file to your podcast feed. I don’t think they’ll add that anytime soon, but Substack certainly might.
Finally, the notion of adding music for only certain portions of the episode (I always reference The Sunday Read as my aim, here) makes my app unique. If Substack were to offer this functionality, I don’t think they’d have that feature.
The elephant in the room, so to speak, is whether I’m wasting my weekends working on this app, given these developments. It’s possible. I’d like to think I’m mature enough to see the writing on the wall, at this point in my “solopreneur” journey… if the permanent marker really has been brandished.
If nothing else, this project has given me an excuse to learn how to fine-tune LLMs, and that alone will be valuable in the future.
So that’s where I stand and how it’s been going. I hope you found this interesting.
Your writing is delightful. I don't think you are wasting your time on this. It helps how you compare your project to the Daily's Sunday Read. I thought that was an AI voice and it does showcase the limitations with the format you are trying to address.