(If you read fast)
In which we will learn:
- WTF a token is
- Why each Stable Diffusion prompt gets 77 tokens but you only get 75
- Some of the horrible things that can happen if you run out of tokens
- Why commas in prompts are stupid and what to use instead
- How Styles use tokens in WOMBO’s Dream app & Discord bot Wombot and what happens to those styles as they run out of tokens.
- Some artists’ names that use just one token and one that uses 30
A warning
to those of you that stuff your prompts full of “512k, ultramegasuperdetailed, hyperrealistic, in the style of Ilya Kuvshinov, William-Adolphe Bouguereau, and King Kamehameha, trending on Country Living and Metal Injection”: Stable Diffusion and all the apps that use its API, like Wombot and Dream by WOMBO, have a maximum of 77 tokens available per prompt. It takes one token to start the prompt and one to end it, so a user really only gets 75 to play with. And you just spent more than 40 of them.
Oh, and be careful how you use “in the style of Ilya Kuvshinov.” That name may be a teeny tiny bit cursed.
King Kamehameha is cool, though.
But WTF is a token?
Simply put, a token is a word or part of a word in a prompt. As of version 1.5, Stable Diffusion had a list of approximately 30,0000 words that it knew. If your prompt contains a word that is on that list, that word will use one token. When you use a word that is not in that list, Stable Diffusion breaks the word up into chunks until it gets to words it knows, like “Art” and “Station” for “Artstation”, or short word parts called morphemes, like “rut,” “kow,” and “ski” for prompter favorite “Greg Rutkowski.” Often, it will use a mix of words and morphemes, as in “hippo,” “potom,” “onst,” “roses,” “quip,” “pedal,” “io,” and “phobia” for hippopotomonstrosesquippedaliophobia, the fear of long words. You can find the list of all 50,000 of Stable Diffusion’s one-token words, morphemes, phrases, and symbols here.
What gets to be on the list would be worth a post all to itself. There are an endless* number of sports teams, TV shows, and special occasions on the list. Someone included every Premier League football team (the round ball kind- sorry, no Steelers, 49ers, or Redskins on the list). And I would love to meet the person who added “WineWednesday,” “WomanCrushWednesday,” and “WynonnaEarp” to Stable Diffusion’s “These Are The 30,000 Most Important Things In The World” list. I bet I can convince her to put ToplessVelociraptor on the next list. I can definitely convince whoever put “WeTheNorth” on the list. That is some serious Raptor love there. And it explains why I keep accidentally generating images of a shirtless Chris Boucher. Not that I’m complaining.
Endless, in this case, being the rabbit concept of “hrair,” or any number greater than 4.
Full names usually use at least two tokens
and most get broken into more. Depending on how famous they are, a few artists and celebrities only need one. For instance, prompter favorite “Greg Rutkowski” requires four tokens, for “Greg, “Rut,” “kow,” and “ski,” but “Michelangelo,” “Leonardo,” and “Raphael” only need one each. At the other extreme, “Pablo Diego José Francisco de Paula Juan Nepomuceno María de los Remedios Cipriano de la Santísima Trinidad Ruiz y Picasso” takes thirty tokens. (“Picasso” by itself only takes one, though.)
Sucks to be Donatello.
Only a few celebrities have one-token names on this list. Those with one name like Aaliyah and Zendaya use one token, but superstars like “elonmusk,” “priyankachopra”, and “kimkardashian” only need one token as well. Scarlett Johannson, Ryan Reynolds, and Emma Watson all require two tokens. And there are many celebrities, artists, and even words that are not on the official list, but that Stable Diffusion still knows.
If you want a picture of “Charlize Theron and Djimon Hounsou, wearing Edwardian clothes in the style of William-Adolphe Bouguereau,” Stable Diffusion can make one, even though the only name on the one-token list out of all of those is “William.” It might decide to give Mr. Hounsou boobs, though. Apparently, Bouguereau didn’t draw a lot of men.
(I tried as best as I could to generate a boob-less Hounsou. The AI just kept making him more and more feminine. At some point, he went full-on woman. I tried to counteract this by specifying that he is bald. So of course, the AI gave me this. And I have to say, I’m not sad about it.)
One easy way to save tokens
is to write numbers as words. Numbers use one token per digit – “ten billion” only uses two tokens, but 10000000000 uses eleven. And “ten billion dollars” uses three tokens, but “$10,000,000,000” uses fifteen, because punctuation marks and symbols like “$” and “@” use a token each as well. This is particularly important, when it comes to prompt engineering, because many people, not knowing any better, use commas – or sometimes dashes – to separate phrases in their prompts – certainly, I used to.
Here is an example of how I used to write prompts: “a painting of a beautiful woman, long brown hair, blue eyes, wearing a sundress, smiling and laughing, in the Neoclassical, Academic style of John William Waterhouse, John William Godward, and William-Adolphe Bouguereau.” This prompt is forty-eight tokens long, but it uses more than twenty percent of those tokens on punctuation – eight tokens for commas and one each for the dash in “William-Adolphe and the period at the end.” Stable Diffusion ignores all of these. It can understand “a painting of a beautiful woman long brown hair blue eyes wearing a sundress smiling and laughing in the Neoclassical Academic style of John William Waterhouse John William Godward and William Adolphe Bouguereau” just as well.
Still, that prompt is difficult for humans to read. Here is another option:
Painting of beautiful woman
long brown hair
blue eyes
sundress
smiling laughing
by John William Waterhouse John William Godward William Adolphe Bouguereau
Anytime you would put a comma or other punctuation
Just hit return instead
It keeps prompts organized
And Stable Diffusion sees returns as spaces
So they don’t use tokens
You probably also noticed that several of the words in the original prompt are missing. Stable Diffusion doesn’t need articles like “a” or “the,” or conjunctions like “and.” It is also smart enough to figure out that “sundress” is something you wear, and that all these painters use a Neoclassical Academic art style. Using “by” instead of “in the style of“ saves another three tokens. Don’t leave out the by, or you’ll get a picture of three old dudes who paint, wearing sundresses.
To save still more tokens, you could remove “painting of,” because Stable Diffusion knows you want a painting when you ask for an image by painters; the first and middle names of the painters themselves, because their last names are unique; “long brown hair” because nearly all of the women in Neoclassical paintings have long brown hair; and “laughing,” because I couldn’t for the life of me get any of these women to laugh. You can even take out “woman” – Stable Diffusion can figure out that you want an image of a woman just from “beautiful blue eyes sundress.”
Removing all unnecessary words and punctuation, the prompt
Beautiful
Blue eyes
Sundress
Smiling
By Waterhouse Godward Bouguereau
uses only thirteen tokens. Quite a drop from the original forty-eight!
And yes, I realize that “Blue eyes” was about as useful as “laughing.” Like long brown hair, the vast majority of neoclassical paintings’ subjects had brown eyes. But I left it in anyways – because
1. it made the dress a lovely shade of blue instead,
2. just mentioning the eyes made them prettier, and
3. I was afraid that without mentioning her eyes, I would get a beautiful smiling sundress, and that sounded creepy AF.
Why are short prompts important?
The shorter the prompt, the faster the render. And the more room you have to add in details. What if you wanted a red dress? Or wanted your subject in a library overlooking beach? Or you absolutely needed her to have blue eyes? Now you have room to brute force them into being. – just keep repeating “Blue eyes” until you get some. You can say it another 31 times without going over the 75-token limit.
If you do use more than 75 tokens, things will start to go poorly for you. Tacking on a long list of so-called “beautifying phrases” like “in the style of Greg Rutkowski, trending on Artstation and CGSociety” at the end of your prompts when you don’t have the necessary 16 tokens left for it can be a waste of time (at best), can knock something crucial out of the rest of the prompt (at worst), or can cause the prompt to go wrong in unpredictable ways (at weirdest). Who cares if your velociraptor looks like Greg Rutkowski painted it if that means that the crucial earlier phrase “stunning enormous goazongas” might get ignored?
In most ways, WOMBO’s Dream app
and Wombot Discord bot are simpler than Stable Diffusion. The only thing that’s possibly more complicated is how prompts and styles interact. Instead of using models like Stable Diffusion, in the Dream app, all images get a prompt and a style. Your end image is the result of
- 1. your prompt and
- 2 how the style you are using changes the image generated by that prompt.
Dream has a built-in failsafe against over-tokening.
Users get 200 characters per prompt. It is still technically possible to go over 75 tokens by, say, if you use all 2 and 3 (dang it, I can’t think of a two- or three-letter word for letter or words.) character lexemes throughout the entirety of your prompting, but that is trickier than it sounds.
Scratch that. After many, many complaints, WOMBO caved and increased the prompt length in Dream to 350 characters. And people are still complaining. I try to tell them that unless they are unceasingly utilizing extraordinarily elongated lexemes, their commandment will exhaust its supply of tokens a considerable duration prior to the consummation of their maximum character allotment. Especially because – and I’ll get to this later – styles are token thieves. But first, let’s talk about Wombot.
Wombot has More Style than Dream
On Wombo’s Discord server, their art bot Wombot adds a style to your prompts just like on the Dream app, but it can use two kinds of styles: the basic ones provided by WOMBO, or optional custom styles built by users on top of the official styles. At last count, there were just over ten bajillion custom styles, ranging from several hundred takes on anime to dozens of futuristic cyberpunk apocalypse styles to a couple of styles that make Funko Pops. There is even one style – and only one, AFAIK – that turns everything it touches into topless velociraptors.
With a custom style, your end image is the result of
- 1. Your prompt and
- 2. How the base WOMBO style changes the image generated by that prompt AND
- 3. How the custom style changes the image generated by THAT prompt.
For each style, the image created in the step before it is used as an input image, and your prompt is used as the prompt. Custom styles can have as many as three steps within each style. At each step, an image is generated, and becomes the input image for the next step’s prompt. Each step gets its own style and prompt – you could start by generating a Neoclassical woman in a blue dress in the first step, and then make her into a cartoon, or a steampunk engineer, or a velociraptor in the second or third step.
If you run out of tokens while using a custom style,
the style stops working and reverts to whichever of Wombo’s styles it is based on. Interestingly, this doesn’t happen all at once.
This is a style I created based on WOMBO’s 3D Anime-v2 style and the 2D art style of 80’s pop art icon Patrick Nagel, the greatest American artist ever to be killed by aerobics. I named it, imaginatively, “Anime+Nagel v2.”
The pictures look less and less like Nagel’s 2D style and more and more like Anime-v2’s realistic 3D style as the prompt uses more tokens. Once it uses the custom style’s allotment of tokens, the image almost entirely abandons Nagel in favor of Anime-v2. If the prompt keep burning through tokens, Anime-v2 will stop working as well, and the resulting image will look almost entirely realistic.
2, tokens
4 tokens
(This is where things really start to break down. From here on out, it’s all 3D, Anime-v2 until that runs out of tokens as well.) 15 tokens.
Notice that 44 tokens < 75 tokens
In Wombot and Dream, you almost never get your full 75 tokens. There are a few styles – notably Realistic-V2, Buliojourney-V2, and possibly Illustrated-V2 that are “pure” styles straight from Stable Diffusion, and supposedly do not use up any tokens. I have been told that Anime-V2 is a pure style as well, and maybe it is on Dream, but it definitely uses up tokens on Wombot. All of Wombo’s other styles – and especially custom styles – use tokens just as if they were prompts. When I make a complicated style, my prompts have to be much simpler or the style will stop working the way the Nagel style above stopped working. The more token-heavy the style, the faster it will lose influence. For example, take my custom style Ooze, which makes translucent blue slimewomen. Wombot really doesn’t understand the concept of translucent blue people, and needs to be told separately that their skin, hair, face, and eyes are all both blue and translucent. This takes so many tokens that if the prompt asks for more than one piece of clothing, The style will start to break, and the translucent blue slimewoman will start to become a normal, opaque, non-blue woman.
style:ooze-anime
style: ooze-anime
style: ooze-anime
As you can see, Slimewoman in a Pond seems completely translucent. Wearing a sundress and hat, she is still blue, but her skin is decidedly opaque. Her hair, hat, and sundress are still translucent. Add gloves and sunglasses to that, and Woman in a Pond loses her blue skin and hair entirely. Only her clothes remain translucent. It’s a good thing slimewomen don’t need clothes!
Does that clear things up?
Do you still have questions about how tokens work? Want to know what all the weird symbols at the bottom of Stable Diffusion’s token list are? Leave a comment, and when I figure it out myself, I’ll let you know. I know the symbols are letters that English doesn’t use, like the Spanish letter “ñ”, (enye, pronounced like the “ny” in “canyon”), or letters that we used to use and should bring back, like “ð” and “þ” (eth and thorn, pronounced like the “th” in “the” and the “th” in “think,” respectively), and most of them seem to be Stable Diffusion’s weird translation of unicode. I know for sure (please don’t ask me how) that “ðŁIJ” is the emoji for “horse,” and I think “âĿ¤ï¸ıâĿ¤ï¸ıâĿ¤ï¸ıâĿ¤ï¸ı” summons an Elder God, but I can never remember which one. If you can figure out any of the rest, let me know in the comments!