Mid-level:
Examples: pitch- and heat-related descriptors, such as note
onsets, fluctuation patterns, MFCCs
Low-level:
level of abstraction, temporal scope , music aspect, signal domain, ML approach
Temporal scope
Applies to music as well as non music
Instantaneous (~50 ms), Segment level (seconds), Global
Signal domain
Time domain
Frequency domain: Band energy ration,
Time frequency representation: Spectrogram, Mel-spectrogram, constant q-transform
Spectrogram: short term fourier transform
Traditional machine learning
Amplitude envelope, Root mean square energy, zero crossing rate, band energy ratio, spectral centriod,
spectral flux, spectral spread, spectral roll-off
Time domain feature pipeline
Audio , ADC,sampling and quatization, framing (overlapping)
Reason for frame overlapping: 1 sample: 1@44.1 KHz = 0.0227 ms, ear's resolution: 10 ms
frame size 128 - power of 2, if number of samples are power of 2, fast fourier transform is faster
if frame size \( \Uparrow \), freq resolution \( \Uparrow \) and time resolution \( \Downarrow\) (Reason is more frequency bins)
if frame size \( \Downarrow \), freq resolution \( \Downarrow \) and time resolution \( \Uparrow\)
Windowing function: Hann window \(w(k) = 0.5 . (1 - cos(\frac{2 \pi k}{K-1})) ; k = (0,....K) \)
If we simply visualize the spectrogram, it does not plot values which we can see visually since values are very small.
Therefore we feed the log of the amplitude (librosa power_to_db function: computes on square of amplitde given as input)
The above thing solves the purpose, however there is one more catch as we do not perceive the frequencies linearly, that also should be logarithmic
Therefore the steps can be summarized as follows:
Given signal, compute the STFT
compute the absolute value of the ouput
Take squared value of the absolute output, compute the db
plot it on the log scale frequency
Mel-spectrograms
Psychoacoustic exeperiment
1st sample: C2-C4 (65 - 262 Hz)
2nd sample: G6-A6 (1568-1760 Hz)
Both have around 200 Hz difference, but when we hear, we percieve/feel that the first one has more difference.
We have better resolution at low frequency than the higher frequency.
Humans perceive frequency logarithmically.
Ideal audio feature:
Iime frequency representation (simple spectrogram can do)
Perceptually-relevant amplitude representation (simple spectrogram can do)
Perceptually-relevant frequency representation (cannot do)
Mel scale: perceptually relevant scale for pitch.
\( m = 2595. \log(1 + \frac{f}{700} )\)
\( f = 700. (10^{m/2595} - 1) \)
1000 Hz = 1000 Mel
Recipe to extract mel spectrogram:
Extract STFT
Convert amplitude to DBs
Convert frequencies to Mel scale (steps below):
Choose number of mel bands (how many: depends- hyperparameter, generally 40 - 130)
Construct mel filter banks: (multi-step)
Apply mel filter banks to spectrogram
Steps to construct mel filter banks:
Convert lowest/highest frequency to mel using below formula
\( m = 2595 . log( 1 + \frac{f}{700}) \)
Create number of mel bands equally spaced points. (example for 6 bands)
Convert points back to hertz
\( f = 700 (10^{m/2595} - 1 )\)
Round to nearest frequency bin
Create triangular filters
Mel spectrogram = matrix multiplication between Melfilters and Spectrogram = M Y
Mel spectrogram shape: (# bands, # frames)
Mel spectrogram applications: Audio classification, Automatic mood recognition, music genre classification, music instrument classification
Mel-Frequency cepstral coefficients
cepsral : cepstrum: ceps: spec: spectrum
ceptrum developed while studying echos in siesmic signals. (1960s)
Audio feature of choice for speech recognition/identification (1970s)
Music processing (2000s)
quefrency: Frequency
liftering: filtering
Rhamonic: harmonic
Computing the cepstrum
\( C(x(t)) = F^{-1}\big[log(F[x(t)])\big] \)
Speech generation: glottal pulses \(\rightarrow\) Vocal Tract \(\rightarrow\) Speech Signal
Glottal pulse contains info about pitch, vocal tract about the timbre (vowels and consonents)
log spectrum \( \rightarrow\) get maximum spectral envelope \( \rightarrow \) the peaks which we get are called Formants \(\rightarrow \)these carry identity of sound
Spectral envelope contains Vocal tract frequency response (Impulse response)
Speech is convolution of vocal tract frequency response with glottal pulse.
\( x(t) = e(t)\otimes h(t) \)
X(f) = E(f). H(f)
log (X(f)) = log (E(f)) + log(H(f))
We donot get these two as separate, the goal is to separate these two components.
And we are not interested in the timbre, interested in formants.
We want to remove the high frequencies in the log power spectrum: low pass filter : liftering
Then \( y_t^k \) is interpreted as the probability of observing label \( k \) at time \( t \), which defines a distribution over the set \( L'^T \) of length \( T \) sequences over the alphabet \( L' = L \cup \{\text{blank}\} \):
From now on, we refer to the elements of \( L'^T \) as paths, and denote them \( \pi \).
The next step is to define a many-to-one map \( B : L'^T \rightarrow L^{\leq T} \), where \( L^{\leq T} \) is the set of possible labellings (i.e. the set of sequences of length less than or equal to \( T \) over the original label alphabet \( L \)).
We do this by simply removing all blanks and repeated labels from the paths (e.g. \( B(a − ab−) = B(−aa − −abb) = aab \)). Intuitively, this corresponds to outputting a new label when the network switches from predicting no label to predicting a label, or from predicting one label to another.
Finally, we use B to define the conditional probability of a given labelling \(l \in L^{≤T}\) as the sum of the probabilities of all the paths corresponding to it.
The probability of a given labelling is defined as:
The forward variable is computed by summing over all paths that end in the label sequence up to time \( t \), and multiplying the probabilities of the labels along the path.
The forward variable is computed recursively as follows:
def my_decorator(func):
def wrapper():
print("before the function call")
func()
print("after the function call")
return wrapper
@my_decorator
def say_hello():
print("Hello!")
say_hello()
\[
f(x; a, b) = \frac{x^{a-1} (1-x)^{b-1}}{B(a,b)} = \frac{\Gamma(a+b)}{\Gamma(a) \Gamma(b)} x^{a-1} (1-x)^{b-1}
\]
Inspect a code (find location of a function from where it is called)
x=model.classify_batch(signal)
how to find the location of this function?
import inspect
# Assuming `model.classify_batch` is a method
print(inspect.getsourcefile(model.classify_batch))
Custom functions in app script
function onOpen() {
var ui = SpreadsheetApp.getUi();
ui.createMenu('Sumit Menu')
.addItem('Insert the date', 'insertDate')
.addToUi();
}
/**
* Multiplies the input value by 2.
*
* @param {number} input The value to multiply.
* @return The input multiplied by 2.
* @customfunction
*/
function DOUBLE(input) {
return input * 2;
}
function insertDate() {
var sheet = SpreadsheetApp.getActiveSpreadsheet().getActiveSheet();
// var cell = sheet.getRange('B2');
var cell = sheet.getActiveRange();
cell.setValue(new Date());
}
Monte Carlo Estimation
1. Identify the Expectation or Integral
Let's say you want to compute the average (expectation) of some function \( f(x) \) over a probability distribution \( p(x) \). Mathematically, this looks like:
This is the expected value of \( f(x) \) under the distribution \( p(x) \).
2. Draw Random Samples
Instead of solving the integral directly, you draw a number of random samples \( x_1, x_2, \dots, x_n \) from the distribution \( p(x) \). These samples represent possible values from the distribution.
3. Evaluate the Function
For each sample \( x_i \), you compute the value of the function \( f(x_i) \). This gives you a bunch of values \( f(x_1), f(x_2), \dots, f(x_n) \).
4. Take the Average
Once you have evaluated the function for all the samples, you compute the average of these values. This average is your Monte Carlo estimate:
The gradient of the expectation \( \nabla_\phi \mathbb{E}_{q_\phi(z)}[f(z)] \) is now written as the expectation of \( f(z) \) multiplied by the gradient of the log-probability \( \log q_\phi(z) \). This step is crucial because it moves the gradient operator inside the expectation, making it feasible to estimate this expression using Monte Carlo sampling.
Jacobian
If you have a transformation from ε to z given by:
$$ z = g_{\phi}(\epsilon, x) $$
For a transformation from ε to z:
$$ d z = |J| \, d \epsilon $$
Reparameterization technique
\[z = g_{\phi}(\epsilon, x)
\]
Change of Variables Formula Proof
Given a deterministic transformation \( z = g_\phi(\epsilon, x) \), the probability densities of \( z \) and \( \epsilon \) are related by the formula:
The change of variables formula in probability theory states that if we apply a transformation \( z = g_\phi(\epsilon, x) \), then the probability density must satisfy:
$$ q_\phi(z|x) dz = p(\epsilon) d\epsilon $$
This states that the probability mass in the \( z \)-space is equal to the probability mass in the \( \epsilon \)-space.
Step 2: Jacobian Matrix and Volume Transformation
To account for how small changes in \( \epsilon \) propagate to changes in \( z \), we compute the Jacobian matrix, which is defined as:
$$ J = \frac{\partial z}{\partial \epsilon} $$
The Jacobian matrix captures the partial derivatives of the components of \( z \) with respect to \( \epsilon \). The volume elements transform as:
This proof demonstrates how the probability density function transforms under a deterministic mapping using the change of variables formula. The Jacobian determinant accounts for the scaling of the volume element when moving from the auxiliary variable \( \epsilon \) to the transformed variable \( z \).
kernel Density estimation using Parzen window
Imagine you have a bunch of little candles (representing data points). If you light each candle, the flame spreads out a bit (that’s the kernel). Now, if you stand back, you’ll see that the overall glow of all the candles shows where most of the candles are grouped (this is the KDE). The size of each flame (bandwidth) affects how smooth the glow looks.
In summary, the Parzen window in KDE helps you smooth out data points by putting little curves over them and combining them to estimate the shape of the data distribution. The width of the curve controls how smooth or detailed the estimate will be.
Instead of using bars like in a histogram, KDE uses smooth curves to show the distribution.
In kernel density estimation (KDE), the term "Parzen window" (sometimes called a Parzen–Rosenblatt window) refers to a non-parametric technique used for estimating the probability density function (PDF) of a random variable. It's essentially a method of smoothing data to generate a continuous distribution from discrete data points.
Where \( K \) is the kernel — a non-negative function — and \( h > 0 \) is a smoothing parameter called the bandwidth or simply width. A kernel with subscript \( h \) is called the scaled kernel and is defined as:
message = "This is a string"
message[0] = 'p' # Error
Lists are mutable
Tuples are immutable
# Example with a mutable list
lst = [1, 2, 3]
print(f"ID before change: {id(lst)}")
lst.append(4)
print(f"ID after change: {id(lst)}") # ID will not change because the list is mutable
# Example with an immutable tuple
tup = (1, 2, 3)
print(f"ID before change: {id(tup)}")
tup = tup + (4,) # Creating a new tuple, not modifying the original
print(f"ID after change: {id(tup)}") # ID will change because the tuple is immutable
Decorators: A decorator is a function that accepts a function and returns a function
from dataclasses import dataclass
@dataclass
class ExampleClass():
"""A class to using dataclass decorator to create a class without init method"""
title: str
description: str
example = ExampleClass("Dataclass", "A class without init method")
print(example)
print(example.title)
Staticmethod decorator
class ExampleClass:
def __init__(self, name, age):
self.name = name
self.age = age
@staticmethod
def example_method(age):
print("This is a static method, no need of self")
print(f"Age is {age}")
ExampleClass.example_method(25)
Get attribute in python
class Person:
def __init__(self, name):
self.name = name
p = Person('sumit')
print(getattr(p, 'name', 'new')) # Output: 'sumit'
Understanding Pynini
import pynini
###
# Example 1: Composing two FSTs
fst = pynini.union("dog", "cat", "mouse")
t = pynini.accep("dog")
# Perform the composition
result = t @ fst
# Method 1: Print as a string if it's an acceptor
if result.string():
print("Result (as string):", result.string())
# # Method 2: Iterate over paths and print
# Iterate over paths using list()
for path in result.paths().items():
print("Input:", path[0])
print("Output:", path[1])
###
# Example 2: Composing two FSTs
union_fst = pynini.union("hello", "world")
print(pynini.compose("hello", union_fst).num_states() > 0) # True
print(pynini.compose("world", union_fst).num_states() > 0) # True
print(pynini.compose("other", union_fst).num_states() > 0) # False
###
# Example 3: Composing two FSTs
fst_union = pynini.union("apple", "banana", "cherry")
fst_union.draw("union_fst.dot", title="Union Example")
####
# Equivalent Notation: The | operator is a shorthand for pynini.union
fst_union = pynini.union("dog", "cat")
fst_union_alt = pynini.accep("dog") | pynini.accep("cat")
####
# Example 4: Transducer
import pynini
# Define a transducer that replaces 'a' with 'b' and 'c' with 'd'
fst = pynini.cross("a", "b") | pynini.cross("c", "d")
test_input = "c" # or "c"
test_output = pynini.compose(test_input, fst)
print(test_output.string())
####
import pynini
# Define a transducer that replaces 'a' with 'b' and 'c' with 'd'
fst = pynini.cross("a", "b") | pynini.cross("c", "d")
# Use closure to handle multiple occurrences of 'c'
transducer = pynini.closure(fst)
# Input string with multiple 'c's
test_input = "a"
# Apply the transducer
test_output = pynini.compose(test_input, transducer)
# Print the output string
print(test_output.string()) # Expected output: 'ddddd'
###
import pynini
# Define a list of mappings (input -> output)
mapping = [("hello", "hi"), ("world", "earth")]
# Create a transducer using string_map
fst = pynini.string_map(mapping)
# Test the transducer
input_str = "world"
output_str = pynini.compose(input_str, fst).string()
print(output_str) # Output: 'hi earth'
###
import pynini
# Define a list of mappings (input -> output)
mapping = [("hello", "hi"), ("world", "earth")]
# Create a transducer using string_map
fst = pynini.string_map(mapping)
# Test the transducer with a multi-word string
input_str = "hello world"
# Split the input string into words
words = input_str.split()
# Apply the transducer to each word
transformed_words = []
for word in words:
word_fst = pynini.accep(word) # Create an acceptor for each word
transformed_word_fst = pynini.compose(word_fst, fst) # Apply the transducer
transformed_word_str = transformed_word_fst.string() # Convert back to string
transformed_words.append(transformed_word_str)
# Join the transformed words back into a single string
output_str = " ".join(transformed_words)
print(output_str) # Should output: 'hi earth'
###
import pynini
# Create the transducer that defines the rewrite rule
rewrite_rule = pynini.cross("a", "b")
# Define the left and right contexts
left_context = pynini.accep("x") # 'x' must precede 'a'
right_context = pynini.accep("y") # 'y' must follow 'a'
chars = ([chr(i) for i in range(1, 91)] +
["\[", "\\", "\]"] +
[chr(i) for i in range(94, 256)])
# Apply the context-dependent rewrite rule
sigma_star = pynini.string_map(chars).closure()
output_fst = pynini.cdrewrite(rewrite_rule, left_context, right_context, sigma_star)
# Define the input string
input_fst = pynini.accep("Sumitxay")
output = pynini.compose(input_fst, output_fst).string()
# Convert the output FST to a string and print it
print(output) # Should output "xby" because "a" is replaced by "b" when surrounded by "x" and "y"
####