Construction and Evaluation of a Robust Multifeature Speech/Music Discriminator

Page 1

CONSTRUCTION AND EVALUATION OF A ROBUST MULTIFEATURE SPEECH/MUSIC DISCRIMINATOR
Eric Scheirer

Malcolm Slaney
y
Interval Research Corp., 1801-C Page Mill Road, Palo Alto, CA, 94304 USA ABSTRACT We report on the construction of a real-time computer system capable of distinguishing speech signals from music signals over a wide range of digital audio input. We have examined 13 fea- tures intended to measure conceptually distinct properties of speech and/or music signals, and combined them in several multidimen- sional classification frameworks. We provide extensive data on system performance and the cross-validated training/test setup used to evaluate the system. For the datasets currently in use, the best classifier classifies with 5.8% error on a frame-by-frame basis, and 1.4% error when integrating long (2.4 second) segments of sound. 1. OVERVIEW The problem of distinguishing speech signals from music signals has become increasingly important as automatic speech recogni- tion (ASR) systems are applied to more and more “real-world” multimedia domains. If we wish to build systems that perform ASR on soundtrack data, for example, it is important to be able to distinguish which segments of the soundtrack contain speech. There has been some previous work on this topic [1], [2]. Some of this work has suggested features which prove valuable for incor- poration into a multidimensional framework, and we have included them when possible. This paper extends that work in several ways: by considering multiple features, by examining powerful classifi- cation methods, and by describing a principled approach to training and testing the system. The rest of our paper is divided into three sections: a description of the features examined in our system; a discussion of the different multivariate classification frameworks which we have evaluated; and results of a careful training and evaluation phase in which we present the performance characteristics of the system in its current state. 2. FEATURES Thirteen features have been evaluated for use in the system. Each of them was intended to be a good discriminator on their own; as we shall show, not all of them end up adding value to a multivariate classifier. Of the thirteen, five are “variance” features, consisting of the variance in a one-second window of an underlying measure which is calculated on a single frame. If a feature has the property that it gives very different values for voiced and unvoiced speech, but remains relatively constant within a window of musical sound, then the variance of that feature will be a better discriminator than the feature itself. It is also possible that other statistical analyses of “underlying” features, such as second or third central moments, skewness, kur- tosis, and so forth, might make good features for discriminating classes of sound. For example, Saunders [2] bases four features on
Eric Scheirer is currently at the MIT Media Laboratory, Cambridge, MA, USA, eds@media.mit.edu yMalcolm Slaney can be reached at malcolm@interval.com
the zero-crossing rate, using the variance of the derivative, the third central moment, the thresholded value, and a skewness measure. The features used in this system are: 4 Hz modulation energy: Speech has a characteristic energy modulation peak around the 4 Hz syllabic rate [3]. We use a portion of the MFCC algorithm [4] to convert the audio signal into 40 perceptual channels. We extract the energy in each band, bandpass filter each channel with a second order filter with a center frequency of 4 Hz, then calculate the short-term energy by squaring and smoothing the result. We normalize each channel’s 4 Hz energy by the overall channel energy in the frame, and sum the result from all channels. Speech tends to have more modulation energy at 4Hz than music does. Percentage of “Low-Energy” Frames: The proportion of frames with RMS power less than 50% of the mean RMS power within a one-second window. The energy distribution for speech is more left-skewed than for music—there are more quiet frames—so this measure will be higher for speech than for music [2]. Spectral Rolloff Point: The 95th percentile of the power spec- tral distribution. This measure distinguishes voiced from un- voiced speech—unvoiced speech has a high proportion of en- ergy contained in the high-frequency range of the spectrum, where most of the energy for unvoiced speech and music is contained in lower bands. This is a measure of the “skewness” of the spectral shape—the value is higher for right-skewed distributions. Spectral Centroid: The “balancing point” of the spectral power distribution. Many kinds of music involve percussive sounds which, by including high-frequency noise, push the spectral mean higher. In addition, excitation energies can be higher for music than for speech, where pitch stays in a fairly low range. This measure gives different results for voiced and unvoiced speech. Spectral “Flux” (Delta Spectrum Magnitude): The 2-norm of the frame-to-frame spectral amplitude difference vector, kjXij jXi+1jk. Music has a higher rate of change, and goes through more drastic frame-to-frame changes than speech does; this value is higher for music than for speech. Note that speech alternates periods of transition (consonant - vowel boundaries) and periods of relative stasis (vowels), where mu- sic typically has a more constant rate of change. This method is somewhat similar to Hawley’s, which attempts to detect harmonic continuity in music [1]. Zero-Crossing Rate: The number of time-domain zero- crossings within a speech frame [2]. This is a correlate of the spectral centroid. Kedem [5] calls it a measure of the dominant frequency in a signal. Cepstrum Resynthesis Residual Magnitude: The 2-norm of the vector residual after cepstral analysis, smoothing, and resynthesis. If we do a real cepstral analysis [6] and smoothing of the spectrum, then resynthesize and compare the smoothed

Page 2

to unsmoothed spectrum, we’ll have a better fit for unvoiced speech than for voiced speech or music, because unvoiced speech better fits the homomorphic single-source-filter model than music does. In the voiced speech case, we are filtering out the pitch “ripple” from the signal, giving higher values for the residual. Pulse metric: A novel feature which uses long-time band- passed autocorrelations to determine the amount of “rhyth- micness” in a 5-second window. It does a good job telling whether there’s a strong, driving beat (ie, techno, salsa, straightahead rock-and-roll) in the signal. It can’t detect rhythmic pulse in signals with rubato or other tempo changes. The observation used is that strong beat leads to broadband rhythmic modulation in the signal as a whole. That is, no matter what band of the signal you look in, you see the same rhythmic regularities. So the algorithm divides the signal into six bands and finds the peaks in the envelopes of each band; these peaks correspond roughly to perceptual onsets. We then look for rhythmic modulation in each onset track using autocorrelations, and select the autocorrelation peaks as a description of all the frequencies at which we find rhythmic modulation in that band. We compare band-by-band to see how often we find the same pattern of autocorrelation peaks in each. If many peaks are present at similar modulation frequencies across all bands, we give a high value for the pulse metric. We use the variances of the rolloff point, spectral centroid, spec- tral flux, zero-crossing rate, and cepstral resynthesis residual mag- nitude as features as well. In practice, we are using log transforma- tions on all thirteen features; this has been empirically determined to improve their spread and conformity to normal distributions. As an example, Figure 1 shows two of the features and their two- dimensional joint distribution. As we can see, there is significant overlap in the marginal probability distributions, but much less when the features are considered in conjunction. 3. CLASSIFICATION FRAMEWORKS We have examined a multidimensional Gaussian maximum a pos- teriori (MAP) estimator, a Gaussian mixture model classification, a spatial partitioning scheme based on k-d trees, and a nearest- neighbor classifier in depth. We will describe them here and con- trast the way they divide the feature space; we will provide data for performance comparison in the Evaluation section. Multidimensional MAP Gaussian classification works by mod- eling each class of data, speech and music, as a Gaussian-shaped cluster of points in feature space (for example, the 13-dimensional space consisting of the parameters described above). We form estimates of parameter mean and covariances within each class in a supervised training phase, and use the resulting parameter es- timates to classify incoming samples based on their proximity to the class means using a Mahalanobis, or correlational, distance measurement. A Gaussian mixture model (GMM) models each class of data as the union of several Gaussian clusters in the feature space. This clustering can be iteratively derived with the well-known EM algo- rithm [7]. In contrast to the MAP classifier, the individual clusters are not represented with full covariance matrices, but only the diag- onal approximations. That is, the resulting Gaussian “blobs” have their axes oriented parallel to the axes of the feature space. Classification using the GMM uses a likelihood estimate for each model, which measures how well the new data point is modeled by the entrained Gaussian clusters. An incoming point in feature space is assigned to whichever class is the best model of that point (whichever class the point is most likely to have come from). The nearest-neighbor estimator simply places the points of the training set in feature space. To classify new points, we examine the local neighborhood of feature space to determine which training point is closest to the test point, and assign the class of this “nearest neighbor”. Spatial partitioning schemes [8] are often used to make
−6 −4 −2 0 2 4 Variance of Spectral Flux −2 −1 0 1 2 3 4 5 6 Pulse Metric
Figure 1. Marginal and joint probability distributions for two of the features examined in the system. The darker density cloud and histogram outline represent speech data; the lighter, music data. The ellipses are contours of equal Mahalanobis distance at 1 and 2 standard deviations. The data shown are a random sample of 20% of the training data. Each axis has been log-transformed. more efficient the process of determining which point in the training is the closest; we are using the k-d tree algorithm. We have also investigated several common variants of the sim- ple nearest-neighbor framework. Thek-nearest-neighbor classifier conducts a class vote among the nearest k neighbors to a point; what we call k-d spatial classification approximates the k-nearest- neighbor approach by voting only among those training points in the particular region of space grouped together by the k-d tree parti- tioning. These points are nearby each other, but are not necessarily strictly the closest neighbors. This approximate algorithm is much faster than the true nearest-neighbor schemes. The difference between the power of these classification schemes is obvious when the partition boundaries are considered. The Gaussian model creates a hyper-quadric surface boundary in the feature space (ie, hypersphere, hyperellipsoid, hyperparaboloid, etc). The Gaussian mixture model induces a decision boundary of a union of hyper-quadrics, where each hyper-quadric is oriented with the axes of the feature space. The k-d spatial partitioning draws arbitrary “Manhattan seg- ment” decision boundaries, whose complexity depends on the amount of training data and number of partitioning “bins”. The nearest-neighbor and k-nearest schemes are attempting to estimate the local probability density within every area of the feature space, and so arbitrarily complex decision boundaries can be drawn, de- pending on the manifold topology of the training data. Returning to Figure 1, we can see that Gaussians (or, perhaps, the union of a small number of Gaussians) are not a bad model for the individual features, but the joint distribution is not so easily described. In particular, there are clear locations in space where the data distribution is somewhat homogeneous—for example, near the point ( 1; 3:5) on the scatter plot—that do not fall into the main cluster regions. We can estimate theoretical bounds on the performance of some classifiers. The estimated class assignment probability density
P(wijx) given by the nearest-neighbor rule is no greater than
2P , where P is the underlying probability (see [9] p.100-102

Page 3

CPU Feature Latency Time Error 4 Hz Mod Energy 1 sec 18 % 12 1.7 % Low Energy 1 sec 2 % 14 3.6 % Rolloff 1 frame 17 % 46 2.9 % Var Rolloff 1 sec 17 % 20 6.4 % Spec Cent 1 frame 17 % 39 8.0 % Var Spec Cent 1 sec 17 % 14 3.7 % Spec Flux 1 frame 17 % 39 1.1 % Var Spec Flux 1 sec 17 % 5.9 1.9 % Zero-Cross Rate 1 frame 0 % 38 4.6 % Var ZC Rate 1 sec 3 % 18 4.8 % Ceps Resid 1 frame 46 % 37 7.5 % Var Ceps Res 1 sec 47 % 22 5.7 % Pulse Metric 5 sec 38 % 18 2.9 % Table 1. Latency, CPU time required, and univariate discrimination performance for each feature. Each data point represents the mean and standard deviation of the proportion of frames misclassified over 10 cross-validated training runs. See text for details on testing procedure. CPU time is proportion of “real time” a feature takes for processing on a 120MHz R4400 Silicon Graphics Indy work- station. Note that many features can share work when classifying multidimensionally, so the total CPU time required is nonadditive. for details on this derivation). This bound tells us that no matter what classifier we use, we can never do better than to cut the error rate in half over the nearest-neighbor classifier (and, in fact, the
k-nearest-neighbor method is even more strictly bound), assuming
our testing and training data are representative of the underlying feature space topology. Any further improvements have to come from using better features, more or better training data or by adding higher-level knowledge about the long-term behavior of the input signal. 4. TRAINING, TESTING, AND EVALUATION We evaluated the models using labeled data sets, each 20 minutes long, of speech and music data. Each set contains 80 15-second- long audio samples. The samples were collected by digitally sam- pling an FM tuner (16-bit monophonic samples at a 22.05 kHz sampling rate), using a variety of stations, content styles, and noise levels, over a three-day period in the San Francisco Bay Area. We made a strong attempt, especially for the music data, to collect a data set which represented as much of the breadth of available input signals as possible. Thus, we have both male and female speakers, both “in the studio” and telephonic, with quiet conditions and with varying amounts of background noise in the speech class; and samples of jazz, pop, country, salsa, reggae, classical, various non-Western styles, various sorts of rock, and new age music, both with and without vocals, in the music class. For each classification model and several different subsets of features, we used a cross-validated testing framework to evaluate the classification performance. In this method, 10% (4 min.) of the labeled samples, selected at random are held back as test data, and a classifier trained on the remaining 90% (36 min.) of the data. This classifier is then used to classify the test data, and the results of the classification are compared to the labels to determine the accuracy of the classifier. By iterating this process several times and evaluating the classifier based on the aggregate average, we can ensure that our understanding of the performance of the system is not dependent on the particular test and training sets we have used. Note that we are selecting or holding back blocks of points corresponding to whole audio samples; the frames from a single speech or music case will never be split into partially training and partially testing data. This is important since there is a good deal of frame-to-frame correlation in the sample values, and so splitting up audio samples would give an incorrect estimate of classifier performance for truly novel data. Speech Music Total Framework Error Error Error MAP G’ss’n 2.1 1.2 % 9.9 5.4 % 6.0 2.6 % GMM: 1 G 3.0 1.1 % 8.7 6.6 % 5.8 2.9 % 5 G 3.2 1.1 % 8.4 6.8 % 5.8 2.9 % 20 G 3.4 1.5 % 7.7 6.2 % 5.6 2.4 % 50 G 3.0 1.4 % 8.2 6.6 % 5.6 2.6 %
kNN: k = 1
4.3 1.7 % 6.6 6.4 % 5.5 3.6 %
k = 5
4.2 1.8 % 6.4 6.2 % 5.3 3.5 %
k = 11
4.2 1.9 % 6.5 6.1 % 5.4 3.5 %
k = 25
4.2 1.9 % 6.5 6.1 % 5.4 3.5 % k-d: b = 5 5.2 1.0 % 6.1 2.5 % 5.7 1.5 %
b = 11
5.4 1.1 % 6.1 2.8 % 5.8 1.6 %
b = 21
5.7 1.4 % 5.9 2.9 % 5.8 1.9 %
b = 51
5.9 1.8 % 5.5 2.7 % 5.7 2.0 %
b = 101
6.1 1.7 % 5.3 2.6 % 5.7 1.8 % Table 2. Performance (mean and standard deviation frame-by- frame error) for various multidimensional classifiers. For the k-d spatial classifier, the b parameter is the number of data points in the “leaves” of the data structure, and thus the spatial resolution of the classifier. A higherbvalue represents larger bins, and more samples to vote among in each bin. Note that GMM with one Gaussian is not the same as MAP, as GMMs use diagonalized covariances only. We have used this cross-validation framework to evaluate the univariate and multivariate classification performance of the vari- ous models. The results are shown in Tables 1 and 2 respectively. In Table 1, “latency” refers to the amount of past input data required to calculate the feature. Thus, while the zero-crossing rate can be calculated on a frame-by-frame basis, the variance of the zero-crossing rate is referring to the last second of data. The frame rate was 50 Hz for the performance measures shown here. The effect of varying the frame size and window overlap on classification performance has not been examined, but is not expected to be large. The error rates are calculated using a spatial partitioning classifier. In Table 2, performance differences between the classifiers and the effects of parameter settings for the parameterized classifiers are examined. For the k-nearest-neighbor classifiers, we varied the number of neighbors involved in the voting. For the k-d spatial classification procedure, we varied the number of data points col- lected in each data “bucket” or “leaf”. Thus, higher values of b partition the feature space into broader groups. For the Gaussian mixture model classifier, we varied the number of Gaussians in each class. Several results are apparent upon examination of Table 2. First, there is very little difference between classifiers, or between pa- rameter settings for each classifier type. This suggests that the topology of the feature space is rather simple, and indicates the use of a computationally simple algorithm such as spatial partitioning for use in implementations. Second, it is generally more difficult to classify music than to classify speech; that is, it is easier to avoid mistaking music for speech than to avoid mistaking speech for music. This is not unexpected, as the class of music data, in the world and in our data set, is much broader than the class of speech samples. Further, some of the classifiers differ in their behavior on the individual speech and music classes. For example, the MAP Gaus- sian classifier does a much better job rejecting music from the speech class than vice-versa, while the k-d spatial classifier with medium-sized buckets has nearly the same performance on each class. Thus, the different classifiers might be indicated in situations with different engineering goals. We also tested several feature subsets using the spatial partition- ing classifier; the results are summarized in Table 3. The “best 8” features are the variance features, plus the 4 Hz modulation, low-energy frame percentage, and pulse metric. The “best 3” are

Page 4

Frame number Trial number
100 200 300 400 500 600 700 10 20 30 40 50 60
Figure 2. Classifications by trial and frame for one training/test partitioning of the data set. Each white point is a frame classified as music; each black point is a frame classified as speech. The trials in the upper region (trials 1-20) correspond to speech samples (although the classifier doesn’t know); the bottom trials are music. Trials 9 and 11 are speech with background music and were not used in the other experiments. The high error rate in the first 50 frames is due to the use of low-latency features only. the 4 Hz energy, variance of spectral flux, and pulse metric. The “fast 5” features are the 5 basic features which look only at a single frame of data, and thus have low latency. We can see from these results that not all features are necessary to perform accurate classification, and so a real-time system may gain improved performance by using only some of the features. We can understand more fully the behavior of the algorithm by examining the distribution of errors. Figure 2 shows the classifi- cations of test frames for one training/test partitioning. A number of features are apparent in this plot. First, there is a trial-by-trial difference; some samples are easy to classify, and some hard. This is true of both the speech region and the music region. Second, the errors are not independently distributed; they occur in long “runs” of misclassified frames. Finally, there are many more errors made in the early startup (frames 1-50), before the variance features can be collected, than in the later regions. Finally, for comparison with results reported previously [2], we calculated a long-term classification by averaging the results of the frame-by-frame spatial partitioning classification in nonover- lapping 2.4 second windows. Using this testing method, the error rate drops to 1.4%. Thus, the frame-by-frame errors, while not distributed independently, are separate enough that long-term av- eraging can eliminate many of them. Speech Music Total Subset Error Error Error All features 5.8 2.1 % 7.8 6.4 % 6.8 3.5 % Best 8 6.2 2.2 % 7.3 6.1 % 6.7 3.3 % Best 3 6.7 1.9 % 4.9 3.7 % 5.8 2.1 % VSFlux only 12 2.2 % 15 6.4 % 13 3.5 % Fast 5 33 4.7 % 21 6.6 % 27 4.6 % Table 3. Performance (mean and standard deviation frame-by- frame error) for various subsets of features. The k-d spatial clas- sifier was used in all cases. The “Var Spec Flux only” data is not directly comparable to that in Table 1, since in Table 1 “cannot classify” points were ignored, but here they are treated as errors. 5. CONCLUSION The multidimensional classifiers we have built provide excellent and robust discrimination between speech and music signals in digital audio. We note especially that the performance results presented are causal error, and so segmentation performance, where we know a priori that there are long stretches of alternating speech and music, would be significantly higher. There are many interesting directions in which to continuing pursuing this work. For example, a simple three-way classifier using this feature set to discriminate speech, music, and simulta- neous speech and music provided only about 65% accurate perfor- mance. Also, it does not seem as though this feature set is adequate to distinguish among genres of music. More research is needed on the methods humans use to solve these sorts of classification problems, and how to best implement those or other strategies in pattern-recognition systems. REFERENCES [1] Michael Hawley. Structure out of Sound. PhD thesis, MIT Media Laboratory, 1993. [2] John Saunders. Real time discrimination of broadcast speech/music. In Proc. 1996 ICASSP, pages 993–996, 1996. [3] T. Houtgast and H. J. M. Steeneken. The modulation transfer function in room acoustics as a predictor of speech intelligibil- ity. Acustica, 28:66–73, 1973. [4] M. J. Hunt, M. Lennig, and P. Mermelstein. Experiments in syllable-based recognition of continuous speech. In Proc. 1980 ICASSP, pages 880–883, 1980. [5] Benjamin Kedem. Spectral analysis and discrimination by zero-crossings. Proc. IEEE, 74(11):1477–1493, 1986. [6] B. P. Bogert, M. J. R. Healy, and J. W. Tukey. The Que- frency Alanysis of Time Series for Echoes: Cepstrum, Pseudo- autocovariance, Cross-Cepstrum, and Saphe Cracking, pages 290–243. John Wiley and Sons, New York, 1963. [7] Todd K. Moon. The expectation-maximation algorithm. IEEE Signal Processing Magazine, pages 47–70, Nov. 1996. [8] StephenM.Omohundro.Geometriclearningalgorithms.Tech- nical Report 89-041, International Computer Science Institute, Berkeley CA, 1989. [9] Richard O. Duda and Peter E. Hart. Pattern Classification and Scene Analysis. John Wiley and Sons, New York, 1973.

Construction and Evaluation of a Robust Multifeature Speech/Music Discriminator

Recent Documents:

Recent Search: