Skip to main content

Voice Cloning

Voice cloning lets a character sound as consistent as it looks. Influgen uses a voice provider layer, with ElevenLabs-style controls for stability, similarity, style, and speaker boost when the configured backend supports them.

What voice cloning is used for

Once a voice clone is attached to a character, it can be used for:

  • voice previews
  • generated speech clips
  • talking head videos
  • narration layered onto content items

The same settings can also become the default for future voice generations.

Two ways to create a voice clone

Upload samples

Use multipart form upload when you have raw files. Influgen accepts sample fields named samples or sample.

This is the best choice when:

  • you have clean WAV or MP3 files on hand
  • you want the browser workflow
  • you are building the voice directly from the web app

Provide sample URLs

Use sample_urls when the audio already lives remotely. This is especially useful in mobile workflows, where the voice lab expects direct audio sample URLs if you are not uploading files.

This is the best choice when:

  • you are using the mobile app
  • your asset pipeline already stores speech samples in cloud storage
  • you want to clone voices from a backend automation flow

[screenshot: Voice clone panel with sample upload area, tuning sliders, and preview player]

Voice clone inputs

You can provide:

  • name
  • description
  • labels
  • sample_urls
  • model_id
  • stability
  • similarity_boost
  • style
  • use_speaker_boost

Keep the labels practical. They are most useful for tracking accents, language, pacing, or intended use, not for writing long descriptions.

Best practices for sample quality

  • Use dry speech without music or room echo.
  • Keep the microphone quality reasonably consistent.
  • Avoid stacking multiple speakers in one sample.
  • Aim for natural pacing rather than exaggerated performance.
  • Use the same language and accent you expect to publish with.

Poor samples create a clone that sounds unstable no matter how much tuning you do later.

Preview vs. generate

Influgen exposes two voice actions:

  • Preview: generate a short test clip from text
  • Generate: create speech for a full asset or a linked content item

If you generate without a content_item_id, the call runs synchronously and returns the voice output immediately. If you include a content item, Influgen queues the generation so it can move through the job system alongside the rest of your content workflow.

Tuning controls

Different providers interpret these values slightly differently, but the working intent is consistent:

  • stability: reduces variation between takes
  • similarity_boost: pushes the output closer to the cloned identity
  • style: increases expressive shaping
  • use_speaker_boost: helps preserve speaker character

If a clone sounds flat, adjust style before adding more samples. If it sounds unstable, improve sample quality before increasing similarity.

Where voice shows up next

After cloning a voice, the next high-leverage workflows are: