Audio Recording on LINUX

Notes on recording audio on my LINUX box, using the M-Audio Audiophile 2496 card and ALSA drivers. This is a "pro" class card, and behaves a bit odd.

I did my best to turn up inputs HW1 and HW2 on envy24control. It appears that alsamixer was already at full volume. Then run

     arecord -f dat -d 2800 audioDAT.wav

This records with DAT quality, 48KHz, 16-bit stereo, for a maximum of 2800 sec.

Next, I convert to 16KHz mono. (I am digitizing speach) The downsampling and channel averaging should reduce noise. It appears that my system was only using 1/2 of the dynamic range, so I increased volume at this stage.

     nice sox -V audioDAT.wav -c 1 -r 16000 -v 1.8 audio.wav stat &

This will produce statistics of the run. The default resampling seems to be bretty good, but we could alter the resample parameters if we wanted.

It might be nice to use the auto-volume-normalizer in sox here. Unfortunately, I do not understand the use of the compand parameters. Here is a guess at what that might look like:

     sox -V c1bDAT.wav -c 1 -r 16000 e.wav compand 0.5,3 -70,-70,-50,-20,0,0 1 0 stat

My understanding is that this has attack time of 0.5 sec (it takes about 0.5 seconds to reduce gain for a loud event). Decay time of 3 seconds (it takes at least 3 seconds of silence to increase gain). Actually, I'd bet that these are some sort of corner frequencies for a feedback-loop, but that is not documented. I think that the -70,-70 part of the transfer function says that if the input is below -70 db in volume, leave it alone. Expand the parts at -50 db to -20 db, and leave the loudest parts alone. For my slightly underrecorded speach audio (volume factor roughly 1.6) this seemed to give reasonable expansion. I started with a gain of 1 db because I knew I was slightly underrecorded, and started with initial sound level of 0 since I know that the selections all start with a silence period. So, for recorded speech, I could try a combined command like:

     arecord -f dat - | sox -V -t wav - -c 1 -r 16000 speech.wav compand 0.5,3 -70,-70,-50,-20,0,0 1 0 stat & ; (sleep 3200;killall sox)&

For some reason, if arecord ends before sox, sox DELETES the output file. This is bad. Hence, we cannot kill or use a duration on arecord. We must send a signal to the sox process.

This is getting too complicated. Besides, the compand seems to be making some artifacts that I don't care for too much. So perhaps it will be better to just downsample to floats as we record. This will give an initial data compression of 3:1 which will help disk usage a bit, while not adding unnecessary quantization error.

I got some distortions trying this for my application. Some might be from my audio card output when using full range... but I'm not sure. I wrote a script called bookRecC which will use the SoX compand function for level normalization on speech.

Older versions of sox mentioned a problem with removing the output file below. At least with version 12.17.3, this has been fixed. Please ignore steps taken to avoid this where mentioned below

New approach.... I wrote a little program called agc which will apply an automatic gain control to a stream of floats. It looks like appropriate settings for my typical environment are:

     agc 3300 0.000001 1900

I seem to get pretty close to full-range output on speech if I try to control output levels to an RMS of about 3300 counts. I use a very low AGC gain, so that gain reacts very, very slowly to a change in volume. The agc program does NOT know about sample rates, or even if the incoming data is mono or stereo, so the units of the gain parameter (0.000001) are arbitrary. However, since I am using a very low AGC gain, I need to seed the procedure with an accurate estimate of the input data RMS value. (The control loop is very slow to converge to an updated estimate of RMS level.) You can compute this seperately if you wish, but for my recording system set to nearly MAX input range, this value is only around 1900. I'm not sure why, but even with the mixer showing full range, the output seems to be only about 1/2 range.

A fancy, complete command to do what I want for encoding my speech tapes will look like:

     arecord -f dat - | \
         sox -V -t wav - -t raw -c 1 -r 16000 -f - | \
         agc 3300 0.000001 1900 | \
         sox -t raw -r 16000 -c 1 -f - -s -w speech.wav & \
         ; (sleep 3200;killall sox)&

For some reason, sox produces no output file when its input pipe is killed (in some ways). I cannot use duration on arecord for this reason. It appears that a clever kill command will work. I now use a script called bookRec to control recordings. You can see that it sets off a process to terminate the recording at the proper time before it executes the complicated recording command. Use killall sleep to force the clean termination procedure for bookRec. Warning: without killing the recording cleanly, sox may delete its output file!

Another attempt to use agc. While testing, I'm just recording the downsampled data as float with a command like:

arecord -f dat -d 3200 - | sox -V -t wav - -t raw -c 1 -r 16000 -f -l audio.float stat &

The simplest thing to do is to read the volume reported by the above SoX command, and feed it back for the conversion from float to short:

sox -V -t raw -r 16000 -l -f -c 1 audio.float -v 1.45 -s -w audio.wav stat

Where 1.45 was the volume level reported in the downsampling to float sox command.

If I want to experiment with Automatic Gain Control, I can experiment with the parameters in the following command:

     nice -19 agc 6000 0.00003 4900 0.000001 < a.float | sox -t raw -r 16000 -c 1 -s -w - a.wav stat &

Where "a" is the base-name of the file being processed, with target output pseudo-RMS of 6000, and estimated input pseudo-RMS of 4900. The feedback factor of 0.00003 on the RMS integrator is fairly fast, and this noisy pseudo-RMS estimate is smoothed by the slow response of the 0.000002 factor on the gain smoother. The latest version of sox (12.17.3 as of this writing) has trouble with float raw, but no longer deletes files when input stops suddenly. So the latest version of agc has signed-short output, so the above normalizing command works. Kill sox or agc, and you still get a valid WAV file.

It might be good (someday) to put a buffer on agc so that it look ahead when computing gain and initial RMS. However, it could be tricky, because the buffer width should be related to the lag from the gain integrator... and it would take some thinking to derive what that delay should be ;-)

SoX also has some pitch-invariant time expansion/contractors that might be interesting to use. I may play with those someday.

Now, for use in portable MP3 player, try these:

     lame audio.wav -V 4 audio.mp3

     lame audio.wav -h -b 32 audio.mp3

If the player can handle VBR encoding, use the first option, else use the latter. I found this acceptable for audio books.

Someday, if vocoders (GSM? CELP?) get common, I'll want to compress that way. For now, my goal is just to get an entire audio book on a single CD.

At this time, I have a program called stats which I can use to display recording levels in ASCII graphics. This program can be used to locate blank spaces in *.wav files.

   stats 8000 -s 16000 -m -a -r < side4a.wav | less

can be used to display recording level every 0.5 seconds on a 16KHz sample rate, 16-bit WAV file, in native byte order. The arguments mean 8000 sample analysis interval, sample rate 16000, mono, ascii graphics, in report format.

I have now written a script wavVol which displays volume stats (using the above stats program) for wav files. Usage is like:

     wavVol myFile.wav | less

One can use xmms to play side4a.wav to check which silence periods are good places to break up the large file into tracks.

After that, a simple program called wavsplit can be used to split side4a.wav into tracks.

     wavsplit side4a.wav 0:07 3:47 10:23 14:58

The above command would create a directory called side4a containing the first 7 seconds of the recording in file 01.wav, the next 3 minutes, 40 seconds in 02.wav, etc.

These tracks can then be compressed to mp3 if desired.

Aaron Birenboim

Last modified: Wed Nov 23 08:17:01 MST 2005