Problem
My objective was to investigate neural network techniques for analysing sequences. While the prediction of sequences over time is relatively straightforward and well understood, the problem of categorizing or characterizing entire sequences is considerably more difficult.First AttemptIn particular, my goal was to apply neural network techniques to the problem of analyzing music based on frequency spectrum data sampled over time and evaluate algorithm effectiveness on the task of musical genre idenfitication.
For the purposes of this project I limited myself to genres of electronica (house, techno, drum and bass, trance, experimental) with the possibility of individual artist identification. My motivation for doing so was a combination of the regular or predictable nature of most electronic music in general, the wide range of musical styles within the overall genre (if that is in fact an appropriate label), and an abundance of data already in digital format (mp3) readily available for training and testing.
The data was first run through a fft in 1/30th second increments to extract frequency spectrum data. A block of 30 seconds (900 samples) was then run through another fft to extract 'beat' information relative to frequency, resulting in a 24x32 matrix to characterize that block. Each 30 second was treated as an independent input vector. Here are some sample vectors.Second AtemptThe first network was trained on two samples sets: recordings of two 'house' and two 'jungle' or 'drum and bass' dj sets, each set representing a little less than 2 hours worth of music. A 768-300-1 backprop network was trained to output 0 for jungle and 1 for house. After 1000 epochs an MSE of .04 was achieved and the network correctly classified over 95% of the training set and over 95% of a test data set (taken from similar recordings and dj mix tapes). (Unfortunately I lost the .mat file for the network so I don't have detailed figures.)
The second network was a 768-350-5 backprop network trained to output a 1-hot encoding for each of five different classifications: house, jungle/drum and bass, trance, ambient, and liz phair/ani difranco (for kicks). It was similarly trained with about 2 hours of input from each genre.ShortcomingsWhile tresholding the output values at .5 yielded marginal results for classification, a max of the output values gave respectable results. Similarly, as genre classification is a non-discrete process with samples often resembling multiple classifications, the entire output vector as a whole was music more revealing. However, the results were mixed, classifying some test sets well and others not. The network also seemed to have problems falsely identifying input as drum and bass despite jungle's characteristic ~180bpm time step (vs 120bpm for most everything else). This can probably be attributed to the similarly characteristic 'noisy' nature of jungle that probably results in what looks to be pretty random data to the network.
These results point to an important observation: that test inputs not sufficiently similar to the trained classifications yield erratic output (as with 'fro glo'). The underlying problem here is that the classifications are intuitively chosen beforehand to represent the full spread of style based on preconceived (and possibly arbitrary) notions about genre. Likewise, in order to train the network, definitive works for each classification must be chosen that will invariably fail to capture the full spread of style within the genre.Third NetworkThere are two parts to the original objective, then, which was to teach a neural network musical genres. One is simply classification, which (given a definitive training set of pitch/beat information and predetermined classes) a simple backprop network can do quite well. The second is the formation of classifications themselves, which it would seem only an unsupervised self-organizing network could attack.
The other clear flaw with this design is the input set. Although it seems to yield distinctive input vectors for the genres I chose, it poorly characterises instruments or aspects of musical progression or syncopation and is especially sensitive to changes in tempo.
The final experiment was to build a partially recurrent Elman network, which is essentially a feedforward network with a feedback loop added from each layers outputs back into the inputs. It's trained like a backpropogation network.ConclusionsAgain the input set was run through a fft in blocks of 1/30th of a second, but this time the 30 second input blocks were kept as a sequence of 900 fft frequency vectors. Unfortunately the training process places a number of contraints on the design due to the large size of the sequences themselves and the training network training method, which calculates error values for each element in the sequence but does not backpropogate until after the entire sequence is completed. In the end i trained a 70-50-2 network to output a 1-hot encoding for either house or ani difranco/liz phair. In order to train it on a machine with only 512mb of RAM, only 50 input vectors were presented to it at a time for each increment of 10 epochs.
After about 100 epochs the MSE was down to .20 (and still decreasing, but time constraints forced me to stop there). Averaging the network output over the length of the sequence and taking the max of the output, it acheived 85.3% accuracy. Actually, when it was presented house music, it got it right 99.6% of the time, but only correctly classified ani difranco/liz phair 71.1% of the time. Thinking that it may take a bit for the network output to become meaningful I tried looking only at the output over the second half of the sequence and ended up getting 100% accuracy for house, but only 67.8% for ani/liz for an overall accuracy of 83.9%. My suspicion is that the house music, being far more regular and predicable, is much easier for the network to recognize and more readily generates a high output value, while the acoustic guitar/vocal stuff is more difficult to recognize meaningfully. Perhaps even the reason why the second part of the sequence did worse for ani was because the positive classifications can be partially attributed to the the network not yet responding to the regular house music, and cutting off the beginning of the sequence's output removes the artificially high ani/liz 'norm'. Here is some sample output.
This is evident in the network output for the test set, where it generalized well enough for house (classifying 56 of 63 sequences correctly) but poorly to other rock (only associating 10 of 29 samples of The Cure with Liz Phair/Ani Difranco). Although that may very well be a fair assessment of their real similarity, most would agree they have more in common with each other than house.
Either way, it seems clear that better results might be obtained comparing genres of electronica to each other as they all share the strong beat and rhythm that would hopefully fare well in a recurrent network. It also seems likely that given more training and perhaps a broader input set the network could have performed much better.
From this exercise it seems abundantly clear to me that neural nets can be trained to recognize particular elements of music that would typically define a particular style. Applying this to the problem of genre identification is not quite that simple, however. Barring a self-organizing approach (which I did investigate), the classification problem is inherently limited by the problem designers choice of genre class specifications and the sample data they choose to represent those classes. That is, given a set of buckets and samples from each it can classify whatever you through at it with reasonable accuracy (to whatever extent that accuracy can be fully defined), but coming up with the buckets is yet another problem.In terms of the different approaches, the recurrent Elman network seems to be the most robust approach simply because it (theoretically, at least) can define it's own mechanisms for characterizing beat and tempo without being limited to classification based solely on that information. Unfortunately the practical issues surrounding the training process make it difficult to fully evaluate whether or not it can indeed demonstrate that kind of performance.
Code
Winamp was used to decode mp3's to mono 44khz wavs. All of the networks and the rest of the data preparation for this project was done in Matlab.
- fftwav.m - build fft data from a wav on 1/30th second blocks
- dfftwav.m - build 30-second vectors by ffting 1/30th second block and then 30 second blocks
- fftdir.m - do an entire directory
- net2.m - code scraps used to build the second network
- sfftwav.m - build 30-second fft data sequences for a wav and return as cell matrix
- sfftdir.m - do an entire directory
- net3.m - code scaps used to build and train the elman network
Presentation slides
First presentation, Partially Recurrent Networks
Some links: