High level overview of how TimbreCLIP works. One encoder takes text and one encoder takes audio of single instrument notes. Both modalities are projected into a shared latent space. The encoders are trained such that text and audio that belong together project to points that are close in the latent space