majoc.digital

Digital Design Experiments

Ai Voice Cloner? I’m testing Bark.

But why?

  • I want to narrate my videos in English and avoid embarrassing mistakes in my pronunciation.
  • My English still has room for improvement, and I would need to record multiple times to make it sound fluent.
  • I want to avoid post-editing speech errors, excessive ‘uhs,’ and long pauses.
  • I want the speech to be easily understandable while still sounding human.

Why Bark?

  • Open Source – Bark is free and allows developers to modify and expand the code. However, it’s not the easiest to install, takes longer to generate speech, and requires time to achieve a decent voice clone.
  • Multimodality – It can not only clone voices but also simulate emotions and even background noises. However, sometimes you don’t want that, and removing unwanted emotional interpretations or inappropriate sounds can be tricky.

Installation

Thanks to these two helpful YouTube tutorials – and with the help of ChatGPT – the installation finally worked:
1. Local installation of Bark: https://youtu.be/jUlPlCBQOC8
2. Clone voices: https://youtu.be/o8-1hb7hFTI
A small note on point 2: I had to specify the full path to the checkpoint – not just "bark/", but "/Users/(your username)/bark/". It took me forever to figure that out!

Challenges I encountered

Speaker-ID:

  • Keep the audio file used as the speaker ID as short as possible (for me, a single sentence was enough). Longer recordings resulted in complete nonsense.
  • The recording must already be in the target language. (At first, I used a German audio file and was confused by the strange output.)
  • I simply entered the same text I used for the speaker ID as the output text in the Python script.
  • If I record the speaker ID in heavily accented English as a German speaker, that accent will carry over to all generated outputs.
  • To avoid this, I should choose a sentence I can pronounce relatively accent-free.
  • Using a German speaker ID for an English text sometimes works, but it increases the chance of a noticeable German accent—unless that’s the intended effect.
  • Playback Clarity & Sample Rate
  • I noticed that my cloned voice sounds much clearer when played back at a faster speed.
  • To improve clarity, I’ll try saving it with a 48,000 Hz sample rate—this might be the solution.
  • However, this only applies to exporting and playback; in the Python script, the sample rate should remain at 24,000 Hz, or the voice will sound like Mickey Mouse.
  • Recording Quality Tips
  • Record the audio as clearly as possible, without background noise.
  • Ensure your speaker ID or recording is of the highest quality. Short, well-articulated samples with no background noise make a big difference.
  • Experiment with different sentences or neutral texts. At first, I put too much emphasis on pronunciation.
  • I keep noticing a slight noise in my cloned voice, but I want a clean output. ChatGPT suggested inserting the language tag [en] into the text, but this made the voice drift further from my natural sound, so I eventually removed the tags.
  • Exporting in the Correct Format
  • After countless failed attempts, ChatGPT finally revealed how crucial the export format is. Even if the audio sounds decent, make sure to:
  • Record in an audio app like Adobe Audition or Audacity, and export with these settings:
    • Format: WAV (WAVE PCM)
    • Sample Rate: 44,100 Hz (I haven’t tested 48,000 Hz yet; 24,000 Hz didn’t work well for me.)
    • Channel Mode: Mono
    • Bit Depth: 16-bit
  • These parameters weren’t mentioned in the installation videos (or I missed them), but they are extremely important!

Important Parameters in the Python File

1. seed

The parameter seed=42 is commonly used to control the random number generation process in models and algorithms.

Meaning of seed=42:

  • A seed is an initial value that initializes the random number generation. Although it sounds random, a seed actually produces deterministic results, meaning that using the same seed will always reproduce the same outputs.
  • The number 42 itself is completely arbitrary. However, it has become an unofficial standard in programming, partly due to its cult status from The Hitchhiker’s Guide to the Galaxy, where 42 is the answer to the “ultimate question.).

Usage in the Context of Bark:

  • With seed=42, you ensure that the outputs (e.g., generated voice or audio) remain reproducible as long as all other parameters and inputs stay the same. This is especially useful when testing and comparing different configurations.
  • Different seed values can introduce small variations in the output, such as changes in tone, timing, or expression.

Conclusion:

If you want consistent results, stick to a fixed seed like 42. However, if you want to experiment with more variation, you can… (sentence cut off, let me know if you want me to complete it!)

2. waveform_temp

waveform_temp=0.7 (Waveform Temperature):

  • This parameter influences the randomness in audio waveform generation. Higher values create more variation, while lower values produce more conservative results. It may help if pauses and intonation sound too monotonous, but it is not the main factor affecting speed.
  • Optimizing the Parameter:
  • Experiment with higher waveform_temp values (e.g., 0.9) to encourage more variation and potentially reduce monotone pauses.
  • A higher setting (e.g., 0.8 or 0.9) can improve speech fluidity and natural pacing, but you’ll need to experiment to find the sweet spot.

3. temperature

  • temperature=0.7 (Text Temperature)
  • This setting affects randomness in text output. A higher value can lead to more unpredictable results, while lower values make the generated speech more stable and coherent. However, this parameter primarily influences text structure, not speech speed.
  • Similar to waveform_temp, a slightly increased value (e.g., 0.8) can make speech patterns sound more natural by introducing subtle variations.

Conclusion

I wanted…I have…
I want to narrate my videos in English and avoid embarrassing mistakes in my pronunciation.If you don’t check the outputs, embarrassing results can still happen. But overall, it’s great.
My English still has room for improvement, and I would need to record multiple times to make it sound fluent.It takes a lot (!) of repetitions to get everything just right, create a good speaker ID, and fine-tune the parameters—at least on the first try.

I want to avoid making corrections afterward due to speech errors, too many “uhs,” and long pauses.
The “uhs” actually aren’t that bad. Bark even adds them intentionally from time to time—though of course, there shouldn’t be too many.
The pauses in Bark can be annoying. Just like Bark’s own text interpretations, which can be both amusing and refreshing.
I want the speech to be clear and understandable but still sound human.Bark does a really good job at that—although by now, I can recognize the Bark laugh everywhere.

Is My Voice Cloned Convincingly?

To be honest, not really (yet) in my case. There are some similarities, but then it sounds different again. However, it’s definitely better than Bark’s default English female voice, which (at least for now) only comes in a single speaker version.

But maybe I’ll still figure it out…

A Few Final Tips

Instead of generating just one sentence at a time to avoid overwhelming Bark, there’s also a way—as mentioned in the second video—to automatically generate multiple sentences in sequence.

If Bark Creates Pauses That Are Too Long, You Can:

  • Use shorter text inputs. Long texts often introduce pauses intentionally to create structure.
  • Adjust punctuation for rhythm. For example, a period (.) usually works better than a comma (,) for shorter pauses.

Emotions and Nuances

  • Bark is particularly good at conveying emotions. If you want to simulate specific emotions, you can adjust your input accordingly.
    For example:
    “I am so excited to tell you about this!” will sound different from
    “I am very calm today.”

Interim Results

Here you can see (or hear) the first results:

A single sentence:

Multiline Audio Output (with surprising help from space!):
There’s still a lot to adjust here, such as overly long pauses… The sentences were probably too long.

My Used Python Codes

If you’re anything like me, you’ll appreciate having a code snippet to test when you just can’t figure out the error on your own.

But beware: In addition to my base folder bark, I created another folder with the same name (bark) inside this directory while following the explanation in the second installation video (you’ll need to adjust this accordingly). And, of course, don’t forget to replace the username.

Python script for a simple clone output:

Python Script for a Multi-Line Output:

I had ChatGPT generate the code for multi-line output based on my simple script.

While reviewing it, I wondered what this newly integrated code means:

1. import numpy as np

Meaning:
This command imports the NumPy library, a powerful library for numerical computations in Python.

  • as np: This sets an alias (np), so instead of writing numpy.array, you can simply use np.array. This saves typing effort and keeps the code cleaner.

Use in Bark:
NumPy is often used for mathematical operations, manipulating arrays, or efficiently processing data—such as generating and handling audio waveforms.

2. What is np?

np is just an alias (short name) for the NumPy library. It’s a convention widely used in Python to keep code consistent across different projects.

3. What is a chunk?

chunk is a general term used in various contexts, especially in data processing:

  • In Bark: A chunk could refer to a piece of data being processed in a loop. For example, audio data may be split into smaller parts for efficient analysis or generation.
  • In general: It often describes a subset of data, such as a segment of an audio signal, a text, or a file.

Bark or ElevenLabs?

That’s a question worth considering…

BARKElevenLabs
AdvantagesOpen Source: Bark is free and allows developers to modify and extend the code.
Multimodality: It can not only clone voices but also simulate emotions and even background noises.
Flexibility: Supports a wide range of use cases.
User-Friendly: The platform is easy to use, even for non-technical users.
High Quality: The generated voices sound very natural and are often indistinguishable from real ones.
Speed: Processing is fast, making it ideal for real-time applications.
Additional Features: Offers professional options like Text-to-Speech and Speech-to-Speech with customizable voices.
DisadvantagesComplexity: Setup and usage require technical knowledge, especially for beginners.
Performance: The generated voices can sometimes sound less natural compared to commercial tools.
Speed: Processing can be slower, especially for longer texts or complex requirements.
Cost: ElevenLabs is paid, and prices can be high for large projects.
Limited Customization: Compared to open-source tools like Bark, it offers less flexibility for developers.
Licensing: Usage requires compliance with strict licensing conditions, especially for commercial applications.

Conclusion

  • Bark is ideal for developers and experimental projects where flexibility and cost-free usage are key.
  • ElevenLabs is perfect for professional applications where ease of use and high audio quality are essential.

Comparison Example – ElevenLabs Output (Standard Voice, Not Cloned)

Super easy, clean, and quickly generated! Of course, if I use French expressions, it tries to pronounce them in English. Also, a poorly structured sentence is simply adopted as is. 😉

I haven’t tried the cloned voice yet since it’s a paid feature. But I will test it soon and add my experience here!

How I Finally Did It:

  1. I had YouTube generate a transcript of my German spoken text and made some revisions to the subtitles.
  2. Then, I let YouTube automatically translate it into English.
  3. I exported the texts and used Bark to generate audio files in blocks.
  4. These were never convincingly cloned, so I only used them as a base to train my pronunciation.
  5. I then recorded my own audio files using this text, with my German accent.
  6. I inserted these audio files into the video and realized that resynchronizing a video in a foreign language is a huge challenge. But with a lot of adjustments, I somehow managed to make it work. You can see the reault here: https://majoc.digital/the-numerical-secret-of-my-1998-calendar/
  7. For me, the most beneficial approach is clearly to record the texts myself in English or even French. This helps me train my pronunciation, and as long as it’s understandable, an accent can actually be quite charming—just like the occasional “uhm.”
  8. If voice cloning works and doesn’t sound robotic, it can be a huge relief when you need to quickly record texts in a completely unfamiliar foreign language. However, someone who actually understands the language should listen to it before it’s published. The same applies to my own dubbing—it should imitate a human voice but not sound robotic.
    But if it’s a language you’d like to speak yourself, I recommend going through the challenging learning process. In the end, you’ll retain more than just a professional tool subscription. 😉
  9. For next time, I’ve come up with a different system that I’m now planning to test…

Next Post

© 2025 majoc.digital

Theme by Anders Norén