cool idea, even motion and other things can also be done in a great way!

better SVG generative models! Vector image generation models lag behind raster image models by a lot, as of January 2024. If one wants a good vector image, it works better to generate a raster image and then use a vectorizer to convert it.

Why are vector images harder? It seems to be hard for two reasons: vectors, to be useful, tend to be highly symbolic and modularized and programming-language-like.

It is trivial to convert a raster image to a degenerate vector image which has a vector per pixel, or to do something similar like differentiably optimize thousands of vector curves into an approximation of the pixel image, but this would generally defeat the point: it wouldn’t upscale/downscale well, it wouldn’t have meaningful semantic chunks like backgrounds which could be deleted or which could be edited by a human. (Whereas a pixel image can be made out of blobs and textures smashed together until they look good, with no expectation that the image would be made out of discrete modular parts.) Quite aside from the basic challenge of writing out thousands of SVG text tokens without any syntax errors or other issues, writing out a good vector image seems more challenging, and on a higher semantic level, than a raster equivalent.

Secondly, and related to the previous difficulty, complicated high-quality vector images are hard to make and scarce compared to raster images. Anyone can pick up their smartphone and shoot a high-quality photograph of something in front of them, like their cat; but to produce a vector illustration of said cat is quite another thing. Even a simple UI icon for a web page or an app is a fairly specialized skill. So, it’s no surprise that while it’s easy to scrape billions upon billions of images from the Internet, vector datasets remain in the low millions, generally, and have severe quality & diversity issues. (Almost all of them will be icons, for example, while the ones which represent large complex scenes or illustrations may be poorly auto-converted from raster images and liabilities for training.)

And if there is a dearth of SVG files, then there are even fewer SVGs with adequate text captions describing them. So we are in a bad place for any ‘text2vector’ generative model. We have so many raster images, many raster images’ text captions, some SVGs, and just a few SVGs’ text captions. And we want to go from the least common to the second-least common type, where that type is also an intrinsically difficult type of data.

It is no wonder that vector generative models tend to be highly limited, low-quality, or aim at other tasks like converting raster to SVG.

My suggestion for how to handle these problems is to generalize & scale it: create a single generative model which handles all those modalities and translates between them, treating it as an autoregressive sequence modeling problem, like DALL·E 1 & CogView.

The original DALL·E was a GPT-3 which trained on sequences of text tokens, concatenated with image ‘tokens’. So it learned to take a text description and predict the image tokens of the corresponding image. CogView noted that there was no reason that you could not simply reverse this: instead of [TEXT, IMAGE], [IMAGE, TEXT], which is an image captioner. And since this is the same domain, one can use the same model for both and share the knowledge. (Just train it on both ways of formatting the data.) And one could keep on adding modalities to this. One could add audio of people describing images, and train on [AUDIO, TEXT, IMAGE] + [TEXT, AUDIO, IMAGE] + etc. Now you could record someone talking about what they see, predict the text caption, and then predict the image.

This is particularly useful because some modalities can be easier than others: we may have much more of one kind of data, or be able to construct pairs of data. We can apply various kinds of roundtrip techniques like backtranslation or data augmentation, or exploit synthetic data to reverse a direction. (For example, we could feed a dataset of text into a voice synthesizer to get audio descriptions to train on.)

In the case of vector generation, we would train on all permutations of [SVG, raster image, text caption]. The benefit here is that it covers all of our desired use-cases like SVG → image or text → SVG, and enables bootstraps and synthetic data. For example, if we have a random unlabeled SVG, we can still train on it directly, and we can also fill it out: given an SVG, we can create its raster image easily, and we can then use an image captioner to caption the raster image. Now we have all the pairs.

We can improve the generator by roundtrips: SVG → image → SVG should yield near-identical SVGs, and vice-versa. We can especially exploit synthetic data: we can superimpose SVG images on top of each other in specific relationships, and learn to generate them unconditionally (which would encourage the generator to learn to cleanly write separate objects, even for extremely complicated or cluttered scenes), render them into images and train to generate the original SVG, or generate a text description and SVG (and rendered image) all together. For example, one could imagine a little Euclidean domain-specific language, perhaps like the infamous SHRDLU ‘blocks world’, which generates sets of blocks like “a triangle on top of a cylinder to the left of a sphere”; one can render that scene as an SVG, and this would help with the sort of relational reasoning that existing generative models often struggle with when you ask for ‘A to the left of B’. This can be arbitrarily augmented: add an SVG object of a growling lion, for example, and now you can have “a lion behind 3 blocks”. We can use all sorts of transformations which ought to commute or be identity functions—eg. corrupt the SVGs or images, and roundtrip through an image. (‘Corruption’ here can include lossy transformations like upscaling & downscaling: if I resize a 1024px PNG to 256px, they should yield nearly-identical SVGs, and if I grayscale it, it should yield something perceptually & textually similar to the SVG of the color image which has been grayscaled in SVG-space.)

Or indeed, why not train on more than one ‘SVG’ modality? We can define arbitrarily many ‘kinds’ of ‘SVG’: ‘SVG’ as generated by specific tools like potrace, or after going through an optimizer/minifier tool, or generated by previous versions of the model. (Simply prefix all the metadata to the tokens.) So one could train the model to generate [SVG, minified-SVG], or [syntactically-wrong SVG, SVG]. Broken or bad SVGs can be generated automatically by adding noise to SVGs, but also collected from the model’s errors, both during training, and if a user submits an example of a text prompt which yielded a bad SVG and the intended good SVG, one can train [text, good-SVG] but also [text, bad-SVG, good-SVG].

Obviously, the rendered images provide useful training signal to the model trying to generate SVGs by informing it how the SVG should look, but because we can encode more metadata, we could go beyond simply presenting the pairs or triplets for autoregressive model. We could, for example, include human ratings of image quality, as is now standard in preference-learning. But we can go even further, and use a pre-existing image model like CLIP to add in metadata about how bad an image looks, how far away from the intended image a generated SVG is, by running CLIP on the rendered image of that SVG versus CLIP on the ground-truth image, and then encoding that into the metadata to condition on. (“Here is an SVG, which yields an image which doesn’t look anything like it’s supposed to, it’s off by 0.35 in embedding space. On the other hand, here’s an SVG which looks nearly-identical to what it’s supposed to look like, a distance of 0.00.”) This might be particularly useful for tricky SVG tasks: if we generate a hundred possible SVGs, and they all fail to yield the target image, that’s not much of a training signal; but if they all turn into new training data, and they all include an objective absolute measure of how far away from the target image they were, that is rich feedback which will improve future attempts; and this could be done repeatedly. (This would potentially enable the model to slowly bootstrap itself up to a specific image, by trying out many possible SVGs, training on the results after being run through the SVG renderer, and trying slightly better SVGs the next time, until it ‘predicts’ an SVG distance of 0.00 and is correct.)