Its been almost 2 years since I have been an amateur drummer (which apparently is also the time since my last blog) and I have always felt that it would be great to have something that can provide me with drum transcriptions from a given music source. I researched a bit and came across a library that provides an executable as well as an API that can be used to generate drum tabs (consisting of hi-hats, snare and the kick drum) from a music source. Its called ADTLib. This isn’t extremely accurate when one tests it and I’m sure the library will only get better with more data sources available to train the neural networks better but this was a definitely a good place to learn a bit about neural networks and libraries like Tensorflow. This blog post is basically meant serve as my personal notes wrt how Tensorflow has been used in ADTLib.
So to start off, the ADTLib source code actually doesn’t train any neural networks. What ADTLib essentially does is feed a music file through a pre-trained neural network to give us the automatic drum transcriptions in text as well as PDF form. We will need to start off by looking at two methods and one function inside https://github.com/CarlSouthall/ADTLib/blob/master/ADTLib/utils/__init__.py
- methods create() and implement() belonging to class SA
- function system_restore()
In system_restore() we initialise an instance of SA and call the create() method. There are a lot of parameters that are initialised when we create the neural network graph. We’ll not go into the details of those. Instead let’s look at how Tensorflow is used inside the SA.create() method. I would recommend reading this article on getting started with Tensorflow before going ahead with the next part of this blog post.
If you’ve actually gone through that article you’d know by now that Tensorflow creates graphs that implement a series of Tensorflow operations. The operations flow through the units of the graphs called ‘tensors’ and hence the name ‘tensorflow’. Great. So, getting back to the create() method, we find that first a tf.reset_default_graph()
is called. This resets the global variables of the graph and clears the default graph stack.
Next we call a method weight_bias_init(). As the name suggests this method initialises the weights and biases for our model. In a neural network, weights and biases are parameters which can be trained so that the neural network outputs values that are closest to the target output. We can use ‘variables’ to initialise these trainable parameters in Tensorflow. Take these examples from the weight_bias_init() code:
self.biases = tf.Variable(tf.zeros([self.n_classes]))
self.weights =tf.Variable(tf.random_normal([self.n_hidden[(len(self.n_hidden)-1)]*2, self.n_classes]))
self.biases
is set to variable with an initial value which is defined by the tensor returned by tf.zeros()
(returns a tensor with dimension=2 where all elements are set to 0. self.n_classes is set to 2 in the ADTLib code). self.weights
is initialised to a variable defined by the tensor returned by tf.random_normal()
. tf.random_normal() returns a tensor of the mentioned shape with random normal values (type float32) with a mean of 0.0 and a standard deviation of 1.0. These weights and biases are trained based on the type of optimisation function later on. In ADTLib no training is actually done wrt to these weights and biases. These parameters are loaded from pre-trained neural networks as I’ve mentioned before. However we need these tensors defined in order to be able to implement the neural network on input music source.
Next we initialise a few ‘placeholders’ and ‘constants’. Placeholders and constants are again ‘tensors’ and resemble units of the graph. Example lines from the code:
self.biases = tf.Variable(tf.zeros([self.n_classes]))
self.seq=tf.constant(self.truncated,shape=[1])
Placeholders are used when a graph needs to be provided external inputs. They can be provided values later on. In the above example we define a placeholder that is supposed to hold ‘float32’ values in an array of dimension [1, 1000, 1024]. (Don’t worry about how I arrived at these dimensions. Basically if you check the init() method for class SA, you’ll understand that ‘self.batch’ is structure of dimension [1000, 1024]). Constants as the name suggests hold constant values. In the above example, self.truncated is initialised to 1000. ‘shape’ is an optional paramaters that specifies the dimension of the resulting tensor. Here the dimension is set to [1].
Now, ADTLib uses a special type of recurrent neural networks called bidirectional recurrent neural networks (BRNN). Here neurons or cells of a regular RNN are split into two directions, one for positive time direction(forward states), and another for negative time direction(backward states). Inside the create()
method, we come across the following code:
self.outputs, self.states= tf.nn.bidirectional_dynamic_rnn(self.fw_cell,
self.bw_cell, self.x_ph,sequence_length=self.seq,dtype=tf.float32)
This creates the BRNN with the two types of cells provided as parameters, the input training data, the length of the sequence (which is 1000 in this case) and the data type. self.outputs
is a tuple (output_fw, output_bw) containing the forward and the backward RNN output Tensor.
The forward and backward outputs are concatenated and fed to the second layer of the BRNN as follows:
self.first_out=tf.concat((self.outputs[0],self.outputs[1]),2)
self.outputs2, self.states2= tf.nn.bidirectional_dynamic_rnn(self.fw_cell2,
self.bw_cell2,self.first_out,sequence_length=self.seq2,dtype=tf.float32)
We now have the graph that defines how the BRNN should behave. These next few lines of code in the create()
method deals with something called as soft-attention. This answer on stack overflow provides an easy introduction to this concept. Check it out if you want to but I’ll not go much into those details. But what happens essentially is that the forward and backward output cells from the second layer are again concatenated and then furthur processed to ultimately get a self.presoft
value which resembles (W*x+b) as seen below.
self.zero_pad_second_out=tf.pad(tf.squeeze(self.second_out),[[self.attention_number,self.attention_number],[0,0]])
self.attention_m=[tf.tanh(tf.matmul(tf.concat((self.zero_pad_second_out[j:j+self.batch_size],tf.squeeze(self.first_out)),1),self.attention_weights[j])) for j in range((self.attention_number*2)+1)]
self.attention_s=tf.nn.softmax(tf.stack([tf.matmul(self.attention_m[i],self.sm_attention_weights[i]) for i in range(self.attention_number*2+1)]),0)
self.attention_z=tf.reduce_sum([self.attention_s[i]*self.zero_pad_second_out[i:self.batch_size+i] for i in range(self.attention_number*2+1)],0)
self.presoft=tf.matmul(self.attention_z,self.weights)+self.biases
Next we come across self.pred=tf.nn.softmax(self.presoft)
. This basically decides what activation function to use for the output layer. In this case softmax activation function is used. IMO this is a good reference for different kind of activation functions.
We now move onto the SA.implement()
method. This function takes an input audio data, processed by madmom to create a spectrogram. Next self.saver.restore(sess, self.save_location+'/'+self.filename)
loads the respective parameters from pre-trained neural network files for respective sounds (hi-hat/snare/kick). These Tensorflow save files can be found under ADTLib/files. Once the parameters are loaded, the Tensorflow graph is executed using sess.run()
as following:
self.test_out.append(sess.run(self.pred, feed_dict={self.x_ph: np.expand_dims(self.batch,0),self.dropout_ph:1}))
When this function is executed we get the test results and further processing is done (this process is called peak-picking) to get the onsets data for the different percussive components.
I guess that’s it. There are a lot of details that I have omitted from this blog, mostly because it would make the blog way longer. I’d like to thank the author of ADTLib (Carl Southall) who cleared some icky doubts I had wrt to the ADTLib code. There is also a web version of ADTLib that has been developed with an aim to gather more data to train the networks better. So contribute data if you can!