Introduction

The below link contains notes for the speech signal processing.

Lecture wise notes

Important formulas

Pitch

Intensity, loudness and timbre

Audio signals

Analog to Ditital conversion

How to record sound (ADC), how to reproudce sound (DAC)

Audio Features

Level of abstraction

Temporal scope

  • Applies to music as well as non music
  • Instantaneous (~50 ms), Segment level (seconds), Global
  • Signal domain

    Time domain feature pipeline

    Frequency domain feature pipeline

    Time domain audio features

    Fourier Transform

    Introduction

    Discrete Fourier Transform

    From DFT to Fast Fourier Transform

    Short-Time fourier Transform

    Mel-spectrograms

    Mel-Frequency cepstral coefficients

    Pytorch for audio applications

    Torchaudio

    Connectionist Temporal Classification (CTC).

    Derivation:

    Chi Square Distribution

    Formulation:

    Likelihood \( \mathcal{L} = P( \mathbf{D}|\mathbf{\theta}, M) =\displaystyle \prod_{i=1}^N \dfrac{\exp\left(\dfrac{-r_i^2}{2\sigma_i^2}\right)}{\sqrt{2\pi}\sigma} \)

    \( \log \mathcal{L} = \frac{1}{2} \displaystyle \sum_{i=1}^N \bigg( -\log(2\pi) - \log(\sigma_i^2) - \frac{r_i^2}{\sigma_i^2} \bigg) \)

    If \( \sigma_i \) is constant, then

    \( \log L = c - \frac{1}{2} \chi^2 \)

    It’s often more convenient to calculate \( \log \mathcal{L} = \mathcal{L}^2 \)

    If \( Z_1, \ldots, Z_k \) are independent, standard normal random variables, then the sum of their squares,

    \( Q = \displaystyle \sum_{i=1}^{k} Z_i^2, \)

    is distributed according to the chi-squared distribution with \( k \) degrees of freedom.

    Markov Chain Monte Carlo

    Metropolis algorithm:

    Weighted Finite State transducer (WFST)

    Example: Writing Regular expression ab*cd+e
    wfst image

    Code to visualize wfst in ascii format

    
        import pynini
        
        teens = pynini.string_map([
            ("onze", "11"),
            ("douze", "12"),
            ("treize", "13"),
            ("quatorze", "14"),
            ("quinze", "15"),
            ("seize", "16"),
        ])
        
        teens.write("teens.fst")
            

    Command to Generate High-Resolution Image

    
        fstdraw --isymbols=ascii.syms -portrait teens.fst | dot -Tpng -Gdpi=300 > teens_high_res.png
            
    FST files

    arguments in python

    
    def example_function(pos1, pos2, *args, kw1, kw2="default", **kwargs):
        print(f"pos1: {pos1}, pos2: {pos2}")
        print(f"args: {args}")
        print(f"kw1: {kw1}, kw2: {kw2}")
        print(f"kwargs: {kwargs}")
    
    # Calling the function
    example_function(1, 2, 3, 4, 5, kw1="value1", extra1="extra", extra2="extra2")
    output:
    pos1: 1, pos2: 2
    args: (3, 4, 5)
    kw1: value1, kw2: default
    kwargs: {'extra1': 'extra', 'extra2': 'extra2'}
    

    Decorator example

    
    def my_decorator(func):
        def wrapper():
            print("before the function call")
            func()
            print("after the function call")
        return wrapper
    
    @my_decorator
    def say_hello():
        print("Hello!")
    
    say_hello()
    

    Sort a dictionary based on key or value

    
            sorted(d.items(), key=lambda x: x[1]) #value
            sorted(d.items(), key=lambda x: x[0]) #key
        

    EM algorithm

    Convert the probablity into a an expectation and then use Jenson's inquality. z is hidden variable.

    \( \log p(x; \theta) = \log \sum_z p(x, z; \theta) \)

    \( = \log \sum_z \frac{Q(z)}{Q(z)} p(x, z; \theta) \)

    \( = \log E_{z \sim Q} \left[ \frac{p(x, z; \theta)}{Q(z)} \right] \)

    \( \geq E_{z \sim Q} \left[ \log \frac{p(x, z; \theta)}{Q(z)} \right] \)

    \( = \text{ELBO}(x; Q, \theta) \)

    Corollary

    \( \log p(x; \theta) = \text{ELBO}(x; Q, \theta) \iff Q(z) = p(z | x; \theta) \)

    Additional Equations

    \( E_{z \sim Q} [g(z)] = \sum_z Q(z) g(z) \)

    \( g(z) = \frac{p(x, z; \theta)}{Q(z)} \)

    EM Algorithm Steps

    Randomly initialize \( \theta \)

    Loop till convergence

    Beta function equation:

    \[ \mathrm{B}(\alpha_{1},\alpha_{2},\ldots \alpha_{n}) = \frac{\Gamma(\alpha_{1}) \, \Gamma(\alpha_{2}) \cdots \Gamma(\alpha_{n})}{\Gamma(\alpha_{1} + \alpha_{2} + \cdots + \alpha_{n})} \]

    \[ \mathrm{B}(z_{1}, z_{2}) = \int_{0}^{1} t^{z_{1}-1} (1-t)^{z_{2}-1} \, dt \]

    \[ \mathrm{B}(z_{1}, z_{2}) = \frac{\Gamma(z_{1}) \, \Gamma(z_{2})}{\Gamma(z_{1} + z_{2})} \]

    \[ f(x; a, b) = \frac{x^{a-1} (1-x)^{b-1}}{B(a,b)} = \frac{\Gamma(a+b)}{\Gamma(a) \Gamma(b)} x^{a-1} (1-x)^{b-1} \]

    Inspect a code (find location of a function from where it is called)

    x=model.classify_batch(signal) how to find the location of this function?
    
        import inspect
        # Assuming `model.classify_batch` is a method
        print(inspect.getsourcefile(model.classify_batch))
    

    Custom functions in app script

        
            function onOpen() {
                var ui = SpreadsheetApp.getUi();
                ui.createMenu('Sumit Menu')
                  .addItem('Insert the date', 'insertDate')
                  .addToUi();
              }
              
              /**
               * Multiplies the input value by 2.
               *
               * @param {number} input The value to multiply.
               * @return The input multiplied by 2.
               * @customfunction
               */
              function DOUBLE(input) {
                return input * 2;
              }
              
              function insertDate() {
                var sheet = SpreadsheetApp.getActiveSpreadsheet().getActiveSheet();
                // var cell = sheet.getRange('B2');
                var cell = sheet.getActiveRange(); 
                cell.setValue(new Date());
              }
        
    

    Monte Carlo Estimation

    1. Identify the Expectation or Integral

    Let's say you want to compute the average (expectation) of some function \( f(x) \) over a probability distribution \( p(x) \). Mathematically, this looks like:

    \[ \mathbb{E}_{p(x)}[f(x)] = \int f(x) p(x) \, dx \]

    This is the expected value of \( f(x) \) under the distribution \( p(x) \).

    2. Draw Random Samples

    Instead of solving the integral directly, you draw a number of random samples \( x_1, x_2, \dots, x_n \) from the distribution \( p(x) \). These samples represent possible values from the distribution.

    3. Evaluate the Function

    For each sample \( x_i \), you compute the value of the function \( f(x_i) \). This gives you a bunch of values \( f(x_1), f(x_2), \dots, f(x_n) \).

    4. Take the Average

    Once you have evaluated the function for all the samples, you compute the average of these values. This average is your Monte Carlo estimate:

    \[ \mathbb{E}_{p(x)}[f(x)] \approx \frac{1}{n} \sum_{i=1}^{n} f(x_i) \]

    This average gives you an approximation of the true expectation.

    Log derivative trick

    \[ \nabla_\phi \log q (z, \phi) = \frac{ \nabla_\phi q (z, \phi) }{q (z, \phi) } \]

    We start with the gradient of the probability distribution \( q_\phi(z) \):

    \[ \nabla_\phi q_\phi(z) = q_\phi(z) \nabla_\phi \log q_\phi(z) \]

    We want to approximate:

    \[ \nabla_\phi \mathbb{E}_{q_\phi(z)}[f(z)] = \nabla_\phi \int f(z) q_\phi(z) \, dz \]

    Using the log-derivative trick, we apply the identity:

    \[ \nabla_\phi q_\phi(z) = q_\phi(z) \nabla_\phi \log q_\phi(z) \]

    Thus, the expression becomes:

    \[ \nabla_\phi \int f(z) q_\phi(z) \, dz = \int f(z) q_\phi(z) \nabla_\phi \log q_\phi(z) \, dz \]

    Now, recognizing that \( q_\phi(z) \) is the probability density, we can re-express this integral as an expectation:

    \[ \nabla_\phi \mathbb{E}_{q_\phi(z)}[f(z)] = \mathbb{E}_{q_\phi(z)} \left[ f(z) \nabla_\phi \log q_\phi(z) \right] \]

    ### Interpretation:

    The gradient of the expectation \( \nabla_\phi \mathbb{E}_{q_\phi(z)}[f(z)] \) is now written as the expectation of \( f(z) \) multiplied by the gradient of the log-probability \( \log q_\phi(z) \). This step is crucial because it moves the gradient operator inside the expectation, making it feasible to estimate this expression using Monte Carlo sampling.

    Jacobian

    If you have a transformation from ε to z given by:

    $$ z = g_{\phi}(\epsilon, x) $$

    For a transformation from ε to z:

    $$ d z = |J| \, d \epsilon $$

    Reparameterization technique

    \[z = g_{\phi}(\epsilon, x) \]

    Change of Variables Formula Proof

    Given a deterministic transformation \( z = g_\phi(\epsilon, x) \), the probability densities of \( z \) and \( \epsilon \) are related by the formula:

    $$ q_\phi(z|x) = p(\epsilon) \left| \det \left( \frac{\partial z}{\partial \epsilon} \right) \right| $$

    Step-by-Step Proof

    Step 1: Change of Variables Theorem

    The change of variables formula in probability theory states that if we apply a transformation \( z = g_\phi(\epsilon, x) \), then the probability density must satisfy:

    $$ q_\phi(z|x) dz = p(\epsilon) d\epsilon $$

    This states that the probability mass in the \( z \)-space is equal to the probability mass in the \( \epsilon \)-space.

    Step 2: Jacobian Matrix and Volume Transformation

    To account for how small changes in \( \epsilon \) propagate to changes in \( z \), we compute the Jacobian matrix, which is defined as:

    $$ J = \frac{\partial z}{\partial \epsilon} $$

    The Jacobian matrix captures the partial derivatives of the components of \( z \) with respect to \( \epsilon \). The volume elements transform as:

    $$ dz = \left| \det \left( \frac{\partial z}{\partial \epsilon} \right) \right| d\epsilon $$

    This means the volume element in the \( z \)-space is scaled by the absolute value of the Jacobian determinant.

    Step 3: Apply the Change of Variables Formula

    Now we can substitute the expression for \( dz \) into the change of variables formula:

    $$ q_\phi(z|x) \left| \det \left( \frac{\partial z}{\partial \epsilon} \right) \right| d\epsilon = p(\epsilon) d\epsilon $$

    This shows how the probability densities are related through the transformation.

    Step 4: Solve for \( q_\phi(z|x) \)

    We can cancel the \( d\epsilon \) terms on both sides of the equation, leaving:

    $$ q_\phi(z|x) \left| \det \left( \frac{\partial z}{\partial \epsilon} \right) \right| = p(\epsilon) $$

    Finally, solving for \( q_\phi(z|x) \) gives us the desired result:

    $$ q_\phi(z|x) = p(\epsilon) \left| \det \left( \frac{\partial \epsilon}{\partial z} \right) \right| $$

    Conclusion

    This proof demonstrates how the probability density function transforms under a deterministic mapping using the change of variables formula. The Jacobian determinant accounts for the scaling of the volume element when moving from the auxiliary variable \( \epsilon \) to the transformed variable \( z \).

    kernel Density estimation using Parzen window

    Imagine you have a bunch of little candles (representing data points). If you light each candle, the flame spreads out a bit (that’s the kernel). Now, if you stand back, you’ll see that the overall glow of all the candles shows where most of the candles are grouped (this is the KDE). The size of each flame (bandwidth) affects how smooth the glow looks. In summary, the Parzen window in KDE helps you smooth out data points by putting little curves over them and combining them to estimate the shape of the data distribution. The width of the curve controls how smooth or detailed the estimate will be. Instead of using bars like in a histogram, KDE uses smooth curves to show the distribution. In kernel density estimation (KDE), the term "Parzen window" (sometimes called a Parzen–Rosenblatt window) refers to a non-parametric technique used for estimating the probability density function (PDF) of a random variable. It's essentially a method of smoothing data to generate a continuous distribution from discrete data points.

    \[ \widehat{f}_{h}(x) = \frac{1}{n} \sum_{i=1}^{n} K_{h}(x - x_{i}) = \frac{1}{nh} \sum_{i=1}^{n} K \left( \frac{x - x_{i}}{h} \right) \]

    Where \( K \) is the kernel — a non-negative function — and \( h > 0 \) is a smoothing parameter called the bandwidth or simply width. A kernel with subscript \( h \) is called the scaled kernel and is defined as:

    \[ K_h(x) = \frac{1}{h} K\left( \frac{x}{h} \right) \]

    MethodType in python usage

    
    class Configurable:
        def __init__(self, mode):
            self.mode = mode
        
        def process(self):
            return "Default processing"
    
    def special_processing(self):
        return f"Processing in {self.mode} mode"
    
    obj = Configurable("test")
    print(obj.process()) ## Output: Default processing
    obj.process = MethodType(special_processing, obj)
    
    print(obj.process())  # Output: Processing in test mode
    

    Debug in espnet

    If you want to enable pdb in espnet2 https://github.com/espnet/espnet/pull/3941

    Useful vscode extensions

    Github notes

    
        git checkout -b  origin/
        

    Python fundamentals

    Understanding Pynini

    
        import pynini
    
        ###
        # Example 1: Composing two FSTs
        fst = pynini.union("dog", "cat", "mouse")
        t = pynini.accep("dog")
        
        # Perform the composition
        result = t @ fst
        # Method 1: Print as a string if it's an acceptor
        if result.string():
            print("Result (as string):", result.string())
        
        # # Method 2: Iterate over paths and print
        # Iterate over paths using list()
        for path in result.paths().items():
            
            print("Input:", path[0])
            print("Output:", path[1])
        
        ###
        # Example 2: Composing two FSTs
        union_fst = pynini.union("hello", "world")
        print(pynini.compose("hello", union_fst).num_states() > 0)  # True
        print(pynini.compose("world", union_fst).num_states() > 0)  # True
        print(pynini.compose("other", union_fst).num_states() > 0)  # False
        
        ### 
        # Example 3: Composing two FSTs
        fst_union = pynini.union("apple", "banana", "cherry")
        fst_union.draw("union_fst.dot", title="Union Example")
        
        #### 
        # Equivalent Notation: The | operator is a shorthand for pynini.union
        fst_union = pynini.union("dog", "cat")
        fst_union_alt = pynini.accep("dog") | pynini.accep("cat")
        
        
        #### 
        # Example 4: Transducer
        import pynini
        
        # Define a transducer that replaces 'a' with 'b' and 'c' with 'd'
        fst = pynini.cross("a", "b") | pynini.cross("c", "d")
        test_input = "c"  # or "c"
        
        test_output = pynini.compose(test_input, fst)
        print(test_output.string())
        
        
        
        #### 
        
        import pynini
        
        # Define a transducer that replaces 'a' with 'b' and 'c' with 'd'
        fst = pynini.cross("a", "b") | pynini.cross("c", "d")
        
        # Use closure to handle multiple occurrences of 'c'
        transducer = pynini.closure(fst)
        
        # Input string with multiple 'c's
        test_input = "a"
        
        # Apply the transducer
        test_output = pynini.compose(test_input, transducer)
        
        # Print the output string
        print(test_output.string())  # Expected output: 'ddddd'
        
        
        
        ###
        import pynini
        # Define a list of mappings (input -> output)
        mapping = [("hello", "hi"), ("world", "earth")]
        # Create a transducer using string_map
        fst = pynini.string_map(mapping)
        # Test the transducer
        input_str = "world"
        output_str = pynini.compose(input_str, fst).string()
        print(output_str)  # Output: 'hi earth'
        
        
        ###
        import pynini
        
        # Define a list of mappings (input -> output)
        mapping = [("hello", "hi"), ("world", "earth")]
        
        # Create a transducer using string_map
        fst = pynini.string_map(mapping)
        
        # Test the transducer with a multi-word string
        input_str = "hello world"
        
        # Split the input string into words
        words = input_str.split()
        
        # Apply the transducer to each word
        transformed_words = []
        for word in words:
            word_fst = pynini.accep(word)  # Create an acceptor for each word
            transformed_word_fst = pynini.compose(word_fst, fst)  # Apply the transducer
            transformed_word_str = transformed_word_fst.string()  # Convert back to string
            transformed_words.append(transformed_word_str)
        
        # Join the transformed words back into a single string
        output_str = " ".join(transformed_words)
        print(output_str)  # Should output: 'hi earth'
        
        ###
        import pynini
        # Create the transducer that defines the rewrite rule
        rewrite_rule = pynini.cross("a", "b")
        # Define the left and right contexts
        left_context = pynini.accep("x")  # 'x' must precede 'a'
        right_context = pynini.accep("y")  # 'y' must follow 'a'
        chars = ([chr(i) for i in range(1, 91)] + 
                 ["\[", "\\", "\]"] + 
                 [chr(i) for i in range(94, 256)])
        # Apply the context-dependent rewrite rule
        sigma_star = pynini.string_map(chars).closure()
        output_fst = pynini.cdrewrite(rewrite_rule, left_context, right_context, sigma_star)
        # Define the input string
        input_fst = pynini.accep("Sumitxay")
        output = pynini.compose(input_fst, output_fst).string()
        # Convert the output FST to a string and print it
        print(output)  # Should output "xby" because "a" is replaced by "b" when surrounded by "x" and "y"
        
        ####