We propose a novel audio-guided image manipulation approach for artistic paintings, generating semantically meaningful latent manipulations that give an audio input. To our best knowledge, our work is the first to explore generating semantically meaningful image manipulations from various audio sources. Our proposed approach consists of two main steps. First, we train a set of encoders with a different modality (i.e., audio, text, and image) to produce the matched latent representations. Second, we use direct code optimization to modify a source latent code in response to a user-provided audio input. This methodology enables various manipulations for art paintings conditioned on driving audio inputs, such as wind, fire, explosion, thunderstorm, rain, folk music, and Latin music.
Submitted to NeurIPS Machine Learning for Creativity and Design Session