Posts Tagged ‘voice’
How to Transcribe
I don’t have time for this post, mostly because I have about a dozen interview transcripts I need to create. But it is the perfect opportunity to talk about transcribing using Dragon NaturallySpeaking v 12 (DNSv12) as well as different techniques for creating transcripts. I hope it helps you. I’ve talked about the need to create transcripts before, but this is about actually creating them.
Dragon NaturallySpeaking v12 works in two ways that I find helpful to different degrees. First, it reads text which, if you’ve visited my state constitutions website, is critical for voicing impossibly long text documents. It recognizes punctuation, knows when sentences begin and end and uses rudimentary but adequate accenting and emphasis for words and sentences. This it does very well. The other thing I need it for; transcribing, it doesn’t do as intuitively. I Tweeted DNS and suggested that their next version include a timed transcription feature. They’re response seemed less than personal and a little rushed. I guess they’re really busy.
I’ll be using an interview I just did with a political candidate as the example. Candidate Eric Squires is running for HD 26 in Oregon.
DNSv12 lets you drop .wav audio files (not .mp3) into a box called “DragonPad” that turns the audio file into words. The problem is the software rarely hears punctuation, so it doesn’t know when to end sentences and it doesn’t capitalize words at the start of sentences. Here is the intro to the interview w/Mr. Squires as the software heard it;
“I’m Don Merrill and I’m talking with Eric squires and squires is a Democrat who is running for the seat in House District 26 the district includes Wilsonville Sherwood Kings table Mountain and Greenville squires think very much for talking thanks Raven here.”
This is how the text should read:
“I’m Don Merrill and I’m talking with Eric Squires. Mr. Squires is a Democrat who is running for the seat in House District 26. The District includes Wilsonville, Sherwood, Kings City, Bull Mountain and Greenville. Mr. Squires, thank you very much for talking with me. Thanks for having me here.”
You may have noticed something else the software doesn’t do. It doesn’t put each person on their own line. To fix that along with the other things it doesn’t do, you can go through the text manually after the software has read it. But that means that you also have to listen to the audio as you’re reading the text so you can fix it as you go. Otherwise, you’ll miss words that you don’t see but are in the audio that the software missed. If you download the entire audio file, you have to consider the total length of the audio you’re listening to and add the time it takes to make corrections to the text. In the end, you might spend up to twice as long or longer creating an accurate transcript.
There is another way to transcribe your audio. If you use DNSv12, it can recognize spoken commands like punctuation, line changes, proper nouns, etc. And once you set it up on your PC with a microphone, you can repeat the interview into the microphone and DNSv12 will create an accurate transcript. But there is a trick to this.
The audio you want to transcribe has to be on another playback device like a microcasette player or digital recorder. Put on headphones from that playback device and start the DNSv12 software on your PC. Repeat (in your own voice) into the DNSv12 microphone what you hear through your headphones. And when you speak, remember that you have to speak in a very specific way that the DNSv12 software can understand. Let’s go back to the Eric Squires intro. When speaking into the microphone, this is how you have to speak to get an accurate transcript like the one following the example below:
COMMAND: Capital DM
COMMAND: colon
COMMAND: space
I’m Don Merrill
COMMAND: comma
and I’m talking with Eric squires
COMMAND: period
COMMAND: space
Mr Squires is a Democrat who is running for the seat in House District 26
COMMAND: perod
The District includes Wilsonville
COMMAND: comma
Kings City
COMMAND: comma
Bull Mountain and Greenville
COMMAND: perod
Mr Squires
COMMAND: comma
thank you very much for talking with me
COMMAND: period
COMMAND: new line
COMMAND: Capital ES
COMMAND: colon
COMMAND: space
Thank you for having me here
COMMAND: period
NOTE: The word “Command” is only there to tell you a command is necessary for the software. Don’t say the word “Command”.
And this is how it should look:
DM: “I’m Don Merrill, and I’m talking with Eric Squires. Mr. Squires is a Democrat who is running for the seat in House District 26. The District includes Wilsonville, Sherwood, Kings City, Bull Mountain and Greenville. Mr. Squires, thank you very much for talking with me.
ES: “Thanks for having me here.”
The advantage of this form of transcription is, once you get the technique down, you can complete an accurate and properly formatted transcript in one reading. If you’re repeating an hour long interview, it can get tedious which is why it requires concentration. But the trick, besides knowing how to speak in a way the software recognizes is to learn how to speak exactly what you hear without thinking about it. The words almost have to flow into your ears and out of your mouth into the microphone. If you do too much thinking, you lose your place, you have to start over and you get frustrated.
In both cases, the software has to be first “trained” to recognize your voice. The disadvantage of the first technique; dropping an audio file into DragonPad, is that in conversation, we tend to not speak as precisely as we would if we were reading specifically for the software. Normal conversation has slurs, shortcuts and slang that the computer would have a hard time recognizing even if it could recognize our voices perfectly. Plus, remember that the software is not trained to recognize other voices. So in the end, you have to do all of the work anyway. It seems to me that the second technique for transcribing is the better of the two.
When I have transcribed using the first technique, this is the process I usually go through and how long each part takes. On average, a 30 minute interview is around six pages long.
Transcribing Process
1. 00:01:00 – Convert .mp3 file to .wav file in Audacity
2. 00:09:00 – Machine Transcription of a 30:00 .wav interview in Dragon
3. 00:05:00 – Transfer to Open Office/MS Word Document
4. 00:30:00 – Spacing between interviewer/interviewee segments and adding periods.
5. 00:05:00 – Identifying speakers by initials at beginning of soundbytes
6. 00:05:00 – Timing pages
7. 00:16:00 – Deep Corrections Page #1 – Capitalization, puncutation, correcting words and spelling
7b. 00:14:00 – Deep Corrections Page #2 – same
7c. 00:11:00 – Deep Corrections Page #3 – same
7d. 00:18:00 – Deep Corrections Page #4 – same
7e. 00:12:00 – Deep Corrections Page #5 – same
7f. 00:14:00 – Deep Corrections Page #6 – same
8. 00:05:00 – Retime pages
Total Transcribing Time – 2:25:00
DNSv12 has lots of features, including backspacing, delete and repeat among others. And its text read function is wonderful. Its transcribing function is something to be desired, but it is better than manual transcription, which can be hellish.
The Audio Doesn’t Lie
This is a quickie.
The interesting thing about interviewing is you can tell where the passion is very easily in someone’s views, arguments, whatever, and it’s not always where you think it would be. As an interviewer listening to someone voice an opinion, you would think that if you follow their logic, their thinking would lead you to a conclusion and it is in that conclusion where their greatest passion and truth would lie. But as an editor, watching the waveform of them speaking, you can see the most heat isn’t always at the end of a reasoned and well lit conclusion.
At the outset, I want to say that of course it is important to take the natural rise and fall in a person’s speaking style into account. But, with that said, I find me wondering about the conviction a speaker may have for whatever points they are making when I start paying attention to the volume of their voice as they speak. When the needle gets peaked or buried in places you don’t expect, I go back and listen to what they were saying and ask why, if that is where they imply they are most affected, doesn’t their emotion reflect that? Alternately, something that seems insignificant is actually a source of their real passion.
When they talk in hushed tones about something that they say is important, is it them being reverential or unsure? When they swing loudly upward, are they showing conviction or insecurity? I would expect a researcher could have a lot of fun comparing points of the highest and lowest volume of a speaker’s voice against where those speakers place their most relevant logical arguments. Anecdotally though, they don’t always match, which sometimes makes me wonder about the sincerity of the message.
As an editor though, all I really care about is that riding those highs and lows isn’t too much work for the listener. So, I usually end up smoothing those peaks and valleys out with compression or leveling software. Fortunately for me, all I have worry about is turning that picture into story for the listener to interpret for themselves.
But is it is one of those things that make you go, “Hmmmmmmmmmmmmm …”