Contents

focus group transcription

How to transcribe a focus group without losing the speaker labels

A practical guide to recording, transcribing, and tagging focus group sessions so the speaker labels survive the journey from audio to analysis.

May 3, 2026 9 min read 2,073 words By Andrei

If you have ever opened a focus-group transcript and seen Speaker 1, Speaker 2, Speaker 1, Speaker 4, Speaker 2, Speaker 1… for nine pages, you know the problem this guide solves.

Focus groups are difficult to transcribe well — not because the words are hard, but because the speaker labels are. When eight people share one room, one microphone, and a habit of finishing each other's sentences, even a careful AI transcription tool will scramble who said what within ten minutes. By the time you reach analysis, you cannot tell whether "the parents really hated the price" came from one outspoken person or six different ones, and the entire qualitative finding is suspect.

This piece walks through how researchers actually solve this — before, during, and after the session — and what to look for in a transcription tool. It is written for UX researchers, market researchers, academics, and anyone running their first focus group with a budget that does not stretch to a $400-per-hour human transcriber.

Why focus group transcripts go wrong

Three problems compound:

Crosstalk. Two people talk at once. The microphone records a single audio waveform that the model has to untangle. Even state-of-the-art speaker-diarization models lose accuracy steeply once you cross five voices in close acoustic conditions.

Similar voices. Pitch, accent, and speaking pace are the cues machine listeners use to separate people. Two adult women with similar regional accents, sitting equidistant from the same microphone, are essentially the same voice to the model. It will collapse them into one speaker label, then guess wrong every few minutes about who is who.

Tools that label everyone "Speaker 1". Most consumer-grade transcription tools were built for one-on-one interviews and meetings. They handle the two-speaker case well, then degrade fast. By six speakers most tools either give up and label everyone as one speaker, or rotate labels arbitrarily so the same person becomes Speaker 3 early in the session and Speaker 7 later. Neither failure mode is recoverable in post-processing.

The result is a transcript that reads as a chronological list of words but cannot answer the question that matters most in qualitative analysis: who said this, and what else did they say?

Before you hit record

Most of the speaker-label problem is solvable before the session starts. Three preparations do most of the work.

Microphone placement

A single tabletop microphone in the centre of the table will record everyone, but it will also flatten the audio so the model has nothing to discriminate on. If your budget allows, lavalier microphones for each participant are the gold standard — they capture each voice on a near-isolated channel. Software like Riverside.fm, SquadCast, or even a multi-track Zoom recording will preserve those channels separately, and modern transcription tools can transcribe each track independently with the speaker label baked in.

If individual microphones are not realistic, two or three boundary microphones spread across the table are the next best thing. They give the model directional information it can use to separate voices.

A single phone propped in the middle of the table, recording mono, is the worst case. The transcript will be readable, but the speaker labels will be a coin flip. If that is your only option, the tactics in the next two sections matter more, not less.

Pre-recorded introductions

Ask each participant to introduce themselves on the recording, in turn, in the first 90 seconds: name, role, one short sentence. This serves three purposes:

It gives the transcription tool a clean voice sample for each person, anchored to a name they speak aloud.
It catches participants who arrived late or were missed during off-the-record introductions.
It creates a permanent, time-stamped audio record of who was in the room, which is useful for both analysis and consent compliance.

Modern AI transcription tools that support named-speaker labelling (more on this in the tools section) will pick up the introductions and use the spoken names as labels for the rest of the transcript. The accuracy of this approach is dramatically higher than after-the-fact manual labelling.

A naming sheet

Even if you record introductions, write down the names of each participant in the seating order. If a participant prefers a pseudonym for the transcript, note that too. This sheet is your fallback if the audio introductions are inaudible or get cut, and it is what you will paste into the "speaker names" field of any tool that supports it.

For sessions where participants will not be on the record by name (common in market research), pre-decide pseudonyms — P1, P2, Marketing Manager A, Parent #3, whatever fits your protocol — and use those consistently from the moment of recording.

During the session

Two habits in the room dramatically improve transcript quality:

Ask the moderator to summarise speakers' names early on. When the moderator says "Sara, what about you?" the model gets another name-to-voice anchor for free. This is the single highest-leverage thing a moderator can do for downstream transcription. Aim for two or three uses of each name in the first 15 minutes.

Keep a backup recording. Phone next to the laptop, voice recorder app on the moderator's tablet — anything. Audio loss in focus groups is more common than people expect, especially with cloud-recording tools that depend on a stable internet connection. A second device that captured the session is the difference between rescheduling eight participants and a 30-minute hassle.

Crosstalk discipline is harder to enforce, but small interventions help: a moderator who occasionally says "let me come back to you in a sec, Daniel — Sara, finish your thought" makes the resulting audio cleaner without breaking the conversational flow.

Choosing a transcription approach

You have three real options, with different cost / quality / control tradeoffs.

Option 1: Manual transcription

A human transcriber types the transcript by listening to the audio. For a focus group with clear audio and 6 participants, expect 6 to 8 hours of transcription per recorded hour at $1.50 to $4.00 per audio minute (US rates), so a 90-minute session will cost roughly $135 to $360 and arrive in 2 to 4 days.

Manual transcription remains the gold standard for legal-grade or publication-grade transcripts where every speaker attribution must be defensible. For most qualitative research, where you will spend more time coding the transcript than reading it, manual is overkill.

Option 2: AI transcription with speaker diarization

Speaker diarization is the technical name for "the model groups segments by which voice produced them." Tools like Otter, Notta, and the basic AI plans from Rev offer this. They will give you a transcript labelled Speaker 1, Speaker 2, etc.

Diarization works moderately well for two-speaker interviews and degrades quickly past four speakers. For an eight-person focus group, expect to spend 30 to 60 minutes manually relabelling speakers in the resulting transcript before you can analyse it. That is sometimes acceptable, sometimes ruinous, depending on how many sessions you are running.

The cost is low — usually $8 to $20 per recorded hour at consumer pricing — but the manual cleanup time eats most of the savings.

Option 3: AI transcription with named-speaker labelling

Some newer AI tools can use spoken introductions or pre-supplied speaker names to label the transcript with actual names instead of numbers. When the participant says "Hi, this is Maya," the tool labels Maya's lines as Maya for the rest of the recording, even when she stops introducing herself an hour in.

This is the right approach for focus groups when you can get clean introductions on the recording. The cost is similar to diarization-only tools, and the cleanup time drops from "an hour per session" to "ten minutes per session" — usually just verifying that the model got the matching right.

This is the approach we built CleanScribe.ai around. We support both: spoken-introduction matching, and a free-text "speaker names" field at upload time so you can pre-load the names from your prep sheet. For multi-hour focus groups (3 or 4 hours of audio is not unusual when participants run long), we also handle the file in a single pass without forcing you to split it.

If you want to try it on a session you already have: the free tier is 120 minutes per month, no credit card. That is enough for a focus group or two before you decide whether to upgrade.

After the transcript: cleanup and coding

Even with a well-labelled transcript, three small post-processing steps save time during analysis.

A 5-minute QA pass

Open the transcript at the 20-minute mark (not the start — the start is always the cleanest). Read for two minutes. Look for:

Speaker labels switching mid-sentence (a sign the model lost the thread)
The moderator's lines being attributed to participants
Long monologues attributed to one person that obviously came from a discussion (a sign of crosstalk that defeated the model)

If you see these, scan the rest of the transcript at the points where new participants speak up for the first time. That is where most labelling errors compound.

Standardise speaker names

Decide your label format before you start coding, not after. Common formats:

Real first names (Maya, Daniel) for studies where participants are on the record
Codes (P1, P2) for confidential studies; map them in a separate password-protected file
Roles (Moderator, Parent A, Parent B) for studies where the role matters more than the person

Find-and-replace is your friend. If you set this up before you import to your analysis tool, you will not have to redo it.

Export for analysis tools

Most qualitative analysis tools — NVivo, Atlas.ti, MAXQDA, Dovetail, Reduct — accept plain-text transcripts with speaker labels at the start of each line. The format Maya: Lorem ipsum… works in all of them. Tools that produce JSON or proprietary formats will need a one-time export script; ask the tool vendor or check their docs.

Keep the audio file alongside the transcript in the same folder. When a quote is ambiguous during coding, you want to hear the inflection, the pause, the laughter — and the audio is the only way to do that. A clean transcript and the original recording together are the working pair.

A short checklist

If you remember nothing else from this piece, this is the working checklist:

Before the session:

Decide on lavalier vs. boundary vs. tabletop mic
Write the participant naming sheet
Backup recording device ready
Plan the introduction round

During the session:

Each participant introduces themselves on the recording
Moderator uses each participant's name twice in the first 15 minutes
Backup recorder running

After the session:

Choose a transcription tool that supports named speakers (or budget the manual cleanup time)
Pre-fill speaker names at upload if your tool supports it
5-minute QA pass at the 20-minute mark
Standardise label format before coding
Keep the original audio next to the transcript for ambiguous quotes

Where CleanScribe fits

CleanScribe.ai was built for exactly this use case: long-form qualitative recordings with multiple speakers, where the speaker labels matter as much as the words. Three things we did differently:

Named speakers, not numbers. When a participant introduces themselves on the recording, we use that name. You can also pre-fill names at upload.
Long files in one pass. Up to 8 hours per upload, no splitting. Long focus groups go in as one file and come out as one transcript.
Polished prose, not a recording in text. We strip the umms, the false starts, and the repeated half-sentences so the transcript reads as a conversation. The meaning stays; the noise goes. The original audio is still there if you want to listen back to a specific quote.

The free tier is 120 minutes per month. No credit card. Try it on a session you have already transcribed somewhere else, and compare the speaker labelling side by side.

→ Start free at cleanscribe.ai/for/researchers

Have a transcription workflow tip we should add to this guide? Email us — we update this piece as new tools and techniques become standard.