Contents

interview transcription

How to transcribe a long-form interview without losing the quote you need

A practical guide to recording and transcribing journalist interviews so the quote you need is still findable on deadline.

May 12, 2026 12 min read 2,728 words By CleanScribe Editorial

If you have ever sat in a coffee shop the night before deadline scrubbing through a 90-minute interview looking for one sentence the source said about halfway through, you know the problem this guide solves.

Long-form interview transcription is a different problem from transcribing a five-minute clip or a short Q&A. The audio is longer, which means more opportunity for errors to compound. The quote you need is buried without a signpost. And deadline pressure turns a 20-minute cleanup job into an hour of anxious searching. This guide walks through how journalists actually solve this — before the recording starts, during the session, and after the transcript comes back — so the quote you need is one search away instead of half a lost evening.

Why a 90-minute interview becomes 12 hours of work

Most journalists who have lived through this problem know it viscerally, but it helps to name the individual causes. Three problems compound, and the result is worse than the sum of the parts.

Word accuracy drift across long files. A 90-minute interview is a much harder transcription job than a five-minute clip, not just because there is more audio, but because the model's frame of reference accumulates noise over time. Proper nouns, place names, and technical terms that appeared correctly in the first 15 minutes start being rendered phonetically by the time you reach the 60-minute mark. A source who said "the Bergmann report" at minute 12 is being transcribed as "the Bergman report" by minute 45, and then "the big man report" by minute 70. If you are searching the transcript for the phrase you remember, you will not find it.

Speaker labels that flip mid-recording. Speaker diarization — the technique that distinguishes who is speaking — works by tracking voice characteristics over time. In a 90-minute session, small audio events (the source leaning forward, the room HVAC cycling on, a passing truck) can cause the model to lose its thread. When it picks up again, it has reset its speaker count. The result is a transcript where your source is labelled Speaker 1 for the first 40 minutes, then Speaker 2 from 42 minutes onward, with the labels occasionally swapping back. By the time you need to pull a quote, you cannot tell which lines are theirs.

No full-text search across the audio. Audio is a sequential medium. To find something you remember, you have to start at an approximate timestamp and scrub forward and back. If you are wrong about where in the interview it happened — and you often are, because memory flattens the timeline — you add 20 minutes to the search. A transcript with full-text search collapses that to seconds. But only if the transcript is accurate enough for the remembered phrase to be findable.

No reliable way to cite the exact moment. Fact-checkers and editors increasingly want to know not just what a source said, but when. A timestamp to the nearest second — "she said this at 42:17" — is a defensible citation. A transcript with no time references, or with timestamp granularity measured in minutes, forces you back to the audio every time a quote is questioned.

Result: an unusable transcript that wastes the reporting work that produced it.

Before the interview

Most of the problems above are solvable before you hit record. Four preparations do most of the work.

Mic placement

The single most important technical decision you make before an interview is microphone placement. A lavalier clipped to the source's lapel captures their voice clean, isolated from the room. Pair that with a backup recorder on the table — a second phone running a voice memo app, a handheld digital recorder, anything — and you have two independent captures of the session. If the lavalier battery dies at minute 67, the backup catches the rest. If the backup malfunctions, the lavalier has everything.

When a lavalier is not possible — an ambush interview, a source who objects to the clip, a phone call — prop your phone close to the source rather than halfway between you. Your own voice matters far less for the transcript than theirs; you already know what you asked. A second backup device further away gives you the room and catches your questions for context.

Filename and folder

Name the file with the source's name and the interview date before you start recording, if your recorder allows it, or immediately after if not. Future-you, searching for a quote three months later, searches by source name and story. 2026-05-12 Maya Chen — re: housing piece is findable in seconds. voice-memo-072 requires that you remember the number, the device, and the approximate date — three things you will not remember at the same time, under deadline, in six months.

A folder per story, with the audio file and the eventual transcript both inside it, means the working pair stays together. The most common reason journalists lose audio is not file corruption — it is forgetting where they saved it.

Asking the source for a spoken introduction

At the top of every recorded interview, before the first substantive question, ask the source to state their name and role on the record. "For the recording, can you introduce yourself — your full name and your title?" This takes eight seconds and pays dividends three ways.

First, it anchors named-speaker labelling for the rest of the recording. When the transcript model hears "I'm Maya Chen, director of housing policy at the city," it has a clean voice sample linked to a name. Every subsequent line Maya speaks can be attributed to Maya by name, not to Speaker 1.

Second, it creates a consent record. When a source later disputes attribution — "I never said that" or "you're taking me out of context" — the audio introduction establishes that they knew they were being recorded from the first moment. This has legal implications in two-party consent jurisdictions, and practical implications when your editor asks whether a source dispute is worth publishing on.

Third, it catches the case where the source's preferred name differs from the booking name. If you booked an interview with "William Chen" and the source opens with "I'm Billy Chen," you now have a transcript labelled correctly from the start instead of one you have to search-and-replace before sharing.

Backup recorder running

Every journalist who has lost audio has learned this lesson once, and learned it badly. Laptop batteries die. USB recorders corrupt files. The voice memo app on your primary phone crashes when you get an incoming call. A second device — a second phone, a dedicated recorder, a tablet on the table — running the entire session is the difference between a missed deadline and a 30-minute hassle when the primary fails.

During the interview

Three habits in the room improve interview transcript quality significantly.

Use the source's name two or three times in the first ten minutes. The transcription model is calibrating its speaker model in the early minutes. When you say "Maya, can you walk me through what happened next?" you give the model an anchor: the person who responds immediately after their name is Maya. "And then?" gives the model nothing. The source's name appearing in your questions is the single highest-leverage thing you can do during the session for downstream transcript quality. Two or three natural uses in the first ten minutes — "So Maya, when you first saw the proposal…" — is enough.

For multi-source interviews — round-tables, panel pieces, executive interviews with a comms handler present — introduce everyone on the record at the start. The comms handler who interjects at minute 22, the deputy who answers three questions while the principal steps out: if they are not introduced on the recording, the transcript will label them as new unknown speakers and scramble the surrounding attribution. A round of introductions takes 60 seconds and prevents 40 minutes of relabelling afterward.

Do not talk over the source. This sounds obvious, but deadline pressure and enthusiasm make it a common failure. Crosstalk — two voices occupying the same audio moment — is the hardest problem in speaker diarization, and it degrades accuracy steeply. When the source is finishing a thought, wait the half-second. The silence is fine. It reads as attentive listening and produces a cleaner transcript. The quote you need was probably said in the last sentence of a long answer; talking over the final clause is the most expensive mistake in interview transcription.

Choosing a transcription approach

You have three real options, with different cost, quality, and control tradeoffs.

Option 1: Manual transcription

A human transcriber listens to the audio and types the transcript. For a long-form interview with clear audio and one or two speakers, expect 6 to 8 hours of transcription per recorded hour — so a 90-minute interview is a 9-to-12-hour job at $1.50 to $4.00 per audio minute (US market rates). That is roughly $135 to $360 for the session, arriving in 2 to 4 days.

Manual transcription remains the gold standard for publication-grade quote verification on a major piece — an investigative cover story, anything where a misheard word is a retraction. The human transcriber listens twice, uses context to resolve ambiguous words, and can flag inaudible passages honestly. For daily reporting, it is overkill, and the 48-hour turnaround makes it useless on a breaking deadline.

Option 2: AI transcription with speaker diarization

The mainstream AI transcription tools — and there are many — offer speaker diarization: they group audio segments by voice and label them Speaker 1, Speaker 2, and so on. This works well for a solo interview (one reporter, one source) and degrades for anything more complex. Two sources with similar vocal profiles, or a source who is also being interviewed by a second reporter in the room, will collapse or swap labels. You will spend 20 to 40 minutes after the transcript arrives manually relabelling who is who.

For a journalist on deadline with a solo interview, this is often fine. The words are right, the labels are two instead of the source's name, and a quick find-and-replace handles it. For a panel interview or a source who keeps handing the floor to a colleague, the relabelling overhead eats most of the time saved.

Option 3: AI transcription with named-speaker labelling

Newer tools can use spoken introductions or pre-supplied speaker names to label the transcript with actual names rather than numbers. When the source says "I'm Maya Chen" at the top of the recording, the tool labels Maya's lines as Maya for the rest of the transcript — even 80 minutes later when she has long stopped introducing herself.

This is the right approach for most journalism interview transcription. The cost is similar to diarization-only tools. The transcript arrives with names in place. Cleanup time drops from 30 minutes of relabelling to 5 to 10 minutes of verification — you scan for the spots where the model lost the thread rather than doing a full attribution pass. For a reporter who transcribes 10 interviews a month, this is the difference between 5 hours of monthly cleanup and 90 minutes.

This is the approach we built CleanScribe.ai around: spoken-introduction matching and a free-text speaker names field at upload time so you can pre-load names from your notes before the transcript arrives.

If you want to try it on an interview you already have: the free tier is 120 minutes per month, no credit card. That's two or three sessions before you decide whether to upgrade.

After the transcript: finding the quote, citing the second

The transcript has arrived. The deadline is tomorrow. Here is how to move from raw transcript to published quote.

5-minute QA pass at the 20-minute mark. The first 15 minutes of a recording are always the cleanest: the audio is warmest and both parties are speaking carefully at the start. Errors compound from about the 15-minute mark onward. Open the transcript at the 20-minute mark — not at the beginning — and read for two minutes. Look for speaker labels switching mid-sentence, your own questions being attributed to the source, and proper nouns starting to drift into phonetic approximations. If you see more than two or three errors, scan the transcript at the points where the source shifted topics or tone — these are the moments most likely to have confused the model. That pass takes five minutes and tells you how much cleanup is ahead.

Full-text search to find the remembered phrase. You remember that the source said something about "three years before the decision" or "my mother always said." Search the transcript for a fragment of that phrase. Found in seconds. Without a transcript, you would scrub the audio for 20 minutes and might still miss it by 30 seconds. This is the core value of interview transcription and the reason the word accuracy section above matters: a transcript that rendered "three years before" as "free years before" will not surface on that search. Good interview transcription is the difference between a 10-second search and a 20-minute scrub.

Citation discipline. When you find the quote you need, verify the published version against the audio before you file. A single misheard word — "billions" instead of "millions," "won't" instead of "will" — is a retraction and a source relationship. Find the quote in the transcript, go to the timestamp, listen to the source say the words, and transcribe what you hear, not what the transcript says. The transcript found the moment; the audio is the source of truth for the exact words.

Keep the audio. Source disputes happen months after publication. The audio is the record of what was said; the transcript is a working document. Archive the audio alongside the transcript, in the same folder, with the same filename root. The journalist who can produce the audio when a source says "I never said that" is in a very different position than the journalist who cannot find the file.

A short checklist

Before the interview:

Lavalier on the source + backup recorder
Filename with source name + date
Plan to ask for spoken introduction at the top

During the interview:

Source states name + role on the record
Moderator uses the source's name in the first 10 minutes
No talking over

After:

Choose a transcription tool that supports named speakers
5-minute QA pass at the 20-minute mark
Search for the remembered phrase
Verify the published quote against the audio
Keep the audio archived alongside the transcript

Where CleanScribe fits

CleanScribe.ai was built for exactly this use case: long recordings with one or more named sources, where accurate speaker attribution and findable quotes matter as much as the words themselves. Three things we did differently:

Named speakers, not numbers. When the source introduces themselves on the recording, we use that name. You can also pre-fill names at upload if you have a structured interview with multiple sources. The transcript arrives with Maya Chen: at the start of her lines, not Speaker 1: — no relabelling pass required.
Long files in one pass. Up to 8 hours per upload, no splitting. A 90-minute interview is one file in, one transcript out. A three-hour investigative session goes through the same way — no seam where the model's speaker model resets between chunks.
Polished prose, not a recording in text. We strip the umms, the false starts, and the repeated half-sentences so the transcript reads as the conversation you actually had. The meaning stays; the noise goes. The original audio is still there if you want to verify a specific quote.

The free tier is 120 minutes per month. No credit card. Try it on your last interview, and see whether the quote you need is one search away.

→ Start free at cleanscribe.ai/for/journalists

Have an interview-transcription workflow tip we should add to this guide? Email us — we update this piece as new tools and techniques become standard.