PocketSphinx setup

These instructions are for installing pocketsphinx on Raspbian Stretch. They should translate to other Debian Stretch based distros like Ubuntu, Mint, etc pretty easily. If you are comfortable with your system, you should be able to use these instructions with almost any distro. In many cases the package names will be the same, and in other cases you should be able to locate the package by searching for its name or the name of a file within it.

test the microphone ("hello, can you hear me?")

You want to make sure that the level indicator at the bottom of the screen goes up to about 60% when you are speaking. Use alsamixer to adjust your recording and playback levels.

Also, play it back and make sure the audio does not contain any hissing or popping.

We will use Phonetisaurus to prepare PocketSphinx to transcribe this audio later in these instructions.

[~]$ sudo apt install alsa-utils

[~]$ alsamixer

[~]$ arecord -vv -r16000 -fS16_LE -c1 -d3 test.wav

[~]$ aplay test.wav

If you are on a Raspberry Pi, most likely when you use the arecord command, you will get an error such as "arecord: main:788: audio open error: No such file or directory". This is because the first sound device (card 0) is output only. You will need to specify the recording device. To get a list of recording devices, use "arecord -l". This will return something like this:

[~]$ arecord -l
**** List of CAPTURE Hardware Devices ****
card 1: Phone [PH USB Speaker Phone], device 0: USB Audio [USB Audio]
  Subdevices: 1/1
  Subdevice #0: subdevice #0

This means that audio card 1, subdevice 0 is capable of recording audio. Usually you will either reference the device as hw:1,0 or plughw:1,0. hw:1,0 accesses the device more directly, while plughw:1,0 includes a translation layer allowing it to be used to record in formats that the device does not support natively. You can use "arecord -L" to see which interfaces are available:

[~]$ arecord -L
null
    Discard all samples (playback) or generate zero samples (capture)
default:CARD=Phone
    PH USB Speaker Phone, USB Audio
    Default Audio Device
sysdefault:CARD=Phone
    PH USB Speaker Phone, USB Audio
    Default Audio Device
dmix:CARD=Phone,DEV=0
    PH USB Speaker Phone, USB Audio
    Direct sample mixing device
dsnoop:CARD=Phone,DEV=0
    PH USB Speaker Phone, USB Audio
    Direct sample snooping device
hw:CARD=Phone,DEV=0
    PH USB Speaker Phone, USB Audio
    Direct hardware device without any conversions
plughw:CARD=Phone,DEV=0
    PH USB Speaker Phone, USB Audio
    Hardware device with all software conversions

Use "-D" to specify the device, and "--list-hw-params" to get more information about what formats the device supports:

[~]$ arecord -Dhw:1,0 --dump-hw-params
Recording WAVE 'stdin' : Unsigned 8 bit, Rate 8000 Hz, Mono
HW Params of device "hw:1,0":
--------------------
ACCESS:  MMAP_INTERLEAVED RW_INTERLEAVED
FORMAT:  S16_LE
SUBFORMAT:  STD
SAMPLE_BITS: 16
FRAME_BITS: 32
CHANNELS: 2
RATE: 16000
PERIOD_TIME: [1000 8192000]
PERIOD_SIZE: [16 131072]
PERIOD_BYTES: [64 524288]
PERIODS: [2 1024]
BUFFER_TIME: [2000 16384000]
BUFFER_SIZE: [32 262144]
BUFFER_BYTES: [128 1048576]
TICK_TIME: ALL
--------------------
arecord: set_params:1299: Sample format non available
Available formats:
- S16_LE

The important bits here are "CHANNELS: 2", "RATE: 16000" and "Available formats: - S16_LE". The rate and format match the format that Naomi expects audio to be captured in, but we need mono audio, not stereo, so we will most likely need to use the plughw version.

[~]$ arecord -Dhw:1,0 -vv -r16000 -fS16_LE -c1 -d3 test.wav
Recording WAVE 'test.wav' : Signed 16 bit Little Endian, Rate 16000 Hz, Mono
arecord: set_params:1305: Channels count non available

[~]$ arecord -Dplughw:1,0 -vv -r16000 -fS16_LE -c1 -d3 test.wav
Recording WAVE 'test.wav' : Signed 16 bit Little Endian, Rate 16000 Hz, Mono

Install Phonetisaurus

Build and install openfst

[~]$ sudo apt install gcc g++ make python-pip autoconf libtool
[~]$ wget http://www.openfst.org/twiki/pub/FST/FstDownload/openfst-1.6.9.tar.gz
[~]$ tar -zxvf openfst-1.6.9.tar.gz
[~]$ cd openfst-1.6.9
[~/openfst-1.6.9]$ autoreconf -i
[~/openfst-1.6.9]$ ./configure --enable-static --enable-shared --enable-far --enable-lookahead-fsts --enable-const-fsts --enable-pdt --enable-ngram-fsts --enable-linear-fsts --prefix=/usr
[~/openfst-1.6.9]$ make
[~/openfst-1.6.9]$ sudo make install
[~/openfst-1.6.9]$ cd

Build and install mitlm-0.4.2

Building mitlm is only necessary because we are training our own fst model a little further on.

[~]$ sudo apt install git gfortran autoconf-archive
[~]$ git clone https://github.com/mitlm/mitlm.git
[~]$ cd mitlm
[~/mitlm]$ ./autogen.sh
[~/mitlm]$ make
[~/mitlm]$ sudo make install
[~/mitlm]$ sudo ldconfig
[~/mitlm]$ cd

Build and install Phonetisaurus

[~]$ git clone https://github.com/AdolfVonKleist/Phonetisaurus.git
[~]$ cd Phonetisaurus
[~/Phonetisaurus]$ ./configure --enable-python
[~/Phonetisaurus]$ make
[~/Phonetisaurus]$ sudo make install
[~/Phonetisaurus]$ cd python
[~/Phonetisaurus/python]$ cp -iv ../.libs/Phonetisaurus.so ./
[~/Phonetisaurus/python]$ sudo python setup.py install
[~/Phonetisaurus/python]$ cd

Build and install CMUCLMTK

[~]$ sudo apt install subversion
[~]$ svn co https://svn.code.sf.net/p/cmusphinx/code/trunk/cmuclmtk/
[~]$ cd cmuclmtk
[~/cmuclmtk]$ ./autogen.sh
[~/cmuclmtk]$ make
[~/cmuclmtk]$ sudo make install
[~/cmuclmtk]$ sudo ldconfig
[~/cmuclmtk]$ cd
[~]$ sudo pip install cmuclmtk

Install Pocketsphinx

Build and install sphinxbase

[~]$ sudo apt install swig libasound2-dev bison
[~]$ git clone --recursive https://github.com/cmusphinx/pocketsphinx-python.git
[~]$ cd pocketsphinx-python/sphinxbase

Now, the next line will be different depending on where your python library is located.

If you used the naomi-setup.sh script to install naomi and chose option 1, it will look something like this:

[~/pocketsphinx-python/sphinxbase]$ PYTHON="/home/pi/.naomi/local/bin/python" PYTHON_VERSION=3.5 ./autogen.sh LDFLAGS="-L/home/pi/.naomi/local/lib"

If you installed directly on your base python using apt, then you probably just need

[~/pocketsphinx-python/sphinxbase]$ PYTHON="/usr/bin/python3" PYTHON_VERSION=3.5 ./autogen.sh

Moving on:

[~/pocketsphinx-python/sphinxbase]$ make
[~/pocketsphinx-python/sphinxbase]$ sudo make install
[~/pocketsphinx-python/sphinxbase]$ cd ..

Build and install pocketsphinx

[~/pocketsphinx-python]$ cd pocketsphinx

Again, the next line will be different depending on where your python library is located.

If you used the naomi-setup.sh script to install naomi on a Raspberry Pi, it will look something like this:

[~/pocketsphinx-python/sphinxbase]$ PYTHON="/home/pi/.naomi/local/bin/python" PYTHON_VERSION=3.5 ./autogen.sh LDFLAGS="-L/home/pi/.naomi/local/lib"

If you installed directly on your base python using apt, then you probably just need

[~/pocketsphinx-python/sphinxbase]$ PYTHON="/usr/bin/python3" PYTHON_VERSION=3.5 ./autogen.sh

Moving on:

[~/pocketsphinx-python/pocketsphinx]$ make
[~/pocketsphinx-python/pocketsphinx]$ sudo make install
[~/pocketsphinx-python/pocketsphinx]$ cd ..

Install python PocketSphinx library

Again, you may need to adjust this line depending on the location of your python executable.

If you installed using naomi-setup.py:

[~/pocketsphinx-python]$ sudo ~/.naomi/local/bin/python setup.py install

Otherwise:

[~/pocketsphinx-python]$ sudo python3 setup.py install

Format cmudict.dict and train model.fst

I'm not exactly sure why this is, but apparently it is necessary to reformat the default cmudict.dict file.

  • When there are multiple pronunciations for a word, this removes the trailing "(n)".
  • Then it compresses multiple white spaces into a single space.
  • Then it removes white space from the beginning and end of the line.
  • Finally, it replaces the first space on the line with a tab character.
[~/pocketsphinx-python]$ cd pocketsphinx/model/en-us
[~/pocketsphinx-python/pocketsphinx/model/en-us]$ cat cmudict-en-us.dict | perl -pe 's/^([^\s]*)\(([0-9]+)\)/\1/;s/\s+/ /g;s/^\s+//;s/\s+$//; @_=split(/\s+/); $w=shift(@_);$_=$w."\t".join(" ",@_)."\n";' > cmudict-en-us.formatted.dict
[~/pocketsphinx-python/pocketsphinx/model/en-us]$ phonetisaurus-train --lexicon cmudict-en-us.formatted.dict --seq2_del
[~/pocketsphinx-python/pocketsphinx/model/en-us]$ cd

Test

[~]$ mkdir test
[~]$ cd test
[~/test]$ echo "<s> hello can you hear me </s>" > test_reference.txt

Create test.vocab

[~/test]$ text2wfreq < test_reference.txt | wfreq2vocab > test.vocab

Create test.idngram

[~/test]$ text2idngram -vocab test.vocab -idngram test.idngram < test_reference.txt

Create test.lm

[~/test]$ idngram2lm -vocab_type 0 -idngram test.idngram -vocab test.vocab -arpa test.lm

Create test.formatted.dict

[~/test]$ phonetisaurus-g2pfst --model=`ls ~/pocketsphinx-python/pocketsphinx/model/en-us/train/model.fst` --nbest=1 --beam=1000 --thresh=99.0 --accumulate=true --pmass=0.85 --nlog_probs=false --wordlist=./test.vocab > test.dict
[~/test]$ cat test.dict | sed -rne '/^([[:lower:]])+\s/p' | perl -pe 's/([0-9.])+//g;s/\s+/ /g;@_=split(/\s+/);$w=shift(@_);$_=$w."\t".join(" ",@_)."\n";' > test.formatted.dict

Test with audio file

[~/test]$ pocketsphinx_continuous -hmm ~/pocketsphinx-python/pocketsphinx/model/en-us/en-us -lm ./test.lm -dict ./test.formatted.dict -samprate 16000/8000/48000 -infile ~/test.wav 2>/dev/null

Test with microphone

[~/test]$ pocketsphinx_continuous -hmm ~/pocketsphinx-python/pocketsphinx/model/en-us/en-us -lm ./test.lm -dict ./test.formatted.dict -samprate 16000/8000/48000 -inmic yes 2>/dev/null

Here's what this section of the profile.yml looks like

active_stt:
  engine: sphinx
pocketsphinx:
  fst_model: /home/pi/pocketsphinx-python/pocketsphinx/model/en-us/train/model.fst
  hmm_dir: /home/pi/pocketsphinx-python/pocketsphinx/model/en-us/en-us
  phonetisaurus_executable: phonetisaurus-g2pfst