conceptual overview (2).png

High level overview of how TimbreCLIP works. One encoder takes text and one encoder takes audio of single instrument notes. Both modalities are projected into a shared latent space. The encoders are trained such that text and audio that belong together project to points that are close in the latent space

I) Text driven EQ parameterisation

II) Timbre to image generation

III) Text attributes used in training