Phonetic Error Analysis

Let there be a sequence of audio-transcription tuples \((x_{i_n}, y_{i_n}), x_{i_n} \in \mathcal{X}, y_{i_n} \in \mathcal{Y}\) indexed by \(i_n \in \mathcal{T} = \mathcal{T}_\text{training} \cup \mathcal{T}_\text{dev} \cup \mathcal{T}_\text{test}\), where \(y_i\) is the orthographic string represented by the speech \(x_i\). Let a parametric classifier \(f(x; \theta)\) produce a probability distribution over \(\mathcal{Y}\) given audio \(x\). Let \(\phi(x_i, y_i)\) return the phonetic transcription of the audio-transcription tuple; for the sake of simplicity, assume that \(\forall i, j \in \mathcal{T}.y_i \neq y_j \implies \phi(x_i, y_i) \neq \phi(x_j, y_j)\), which holds for limited speech recognition. Let \(\boldsymbol D'\) be the distance matrix induced by the edit distance string metric \(\delta(s_1, s_2)\) among all distinct pairs of phonetic transcriptions \(\mathcal{Z}\), ordered lexicographically by the index set \(\mathcal{I}\) along the rows and columns. Let \(\boldsymbol C\) be the confusion matrix from evaluating a speech classification model on the test set, with the rows matching those of \(\boldsymbol D'\) and columns \(\mathcal{J}\) being the index set of orthographic predictions \(\mathcal{Y}\), and \(\boldsymbol C_{ij}\) is the number of times the model erroneously predicted the \(i^\text{th}\) true phonetic transcription as the \(j^\text{th}\) orthographic transcription.

Phonetic confusion correlation coefficient (PC3): the expected Pearson’s correlation coeff. between the phonetic edit distance and the number of confusions, with the confusions drawn from some distribution (e.g., the traffic): \[ \rho = \mathbb{E}[r(D_Z, \hat{S}_Z)|\hat{S}_Z \neq 0], \] where \(Z\) is the phonetic transcription, \(D_Z\) is the phonetic edit distance between \(Z\) and the model prediction, \(r\) is the correlation function (Pearson’s), and \(\hat{S}_Z\) is the number of erroneous confusions for \(Z\). Let \(\Phi(y)\) produce the indices in \(\mathcal{I}\) of the possible phonetic transcriptions for the string \(y\). Since the phonetic error \(\hat{S}_Z\) of the model is unavailable (it outputs orthographic transcriptions), we take it as a latent variable that depends on \(f(x; \theta)\). For simplicity, we assume that this latent variable is the minimum possible phonetic distance between \(Z\) and the set of candidates \(\Phi(f(x; \theta))\), i.e., the model has no intraclass phonetic bias, just interclass phonetic bias. Let \(\boldsymbol D_{ij}\) be this minimum possible phonetic edit distance between \(z_i, i \in \mathcal{I}\) and \(\Phi(y): z_i \in \Phi(y)\). Our estimator is hence computed as \[ \hat{\rho} = \sum_{i \in \mathcal{J}} w_i r_i,\hspace{5mm} w_i = \frac{\sum_{\boldsymbol C_{ij} \neq 0} \boldsymbol C_{ij}}{\sum_{\boldsymbol C_{kj}\neq0} \boldsymbol C_{kj}},\hspace{5mm} r_i = \frac{\sum_{\boldsymbol C_{ij} \neq 0}(\boldsymbol D_{ij} - \bar{\boldsymbol D_i}) (\boldsymbol C_{ij} - \bar{\boldsymbol C_i})}{\sqrt{\sum_{\boldsymbol C_{ij} \neq 0}(\boldsymbol D_{ij} - \bar{\boldsymbol D_i})^2}\sqrt{\sum_{\boldsymbol C_{ij} \neq 0}(\boldsymbol C_{ij} - \bar{\boldsymbol C_i})^2}}. \] By construction, \(\hat{\rho} \in [-1, 1]\), with positive values indicating a preference to make more errors for dissimilar transcripts and negative values indicating more for similar transcripts. To compute confidence intervals, use bootstrapping for now, but we can likely derive from normality assumptions and/or exact confidence distributions. If sampling according to the traffic distribution doesn’t matter (i.e., uniform at random), this is as simple as using Fisher’s \(r\)-to-\(z\) transformation for each \(r\).