Using Interrogation to Get Better Results


In my last blog I talked about character consistency through the development of a character sheet in Stable Diffusion. That blog, my first on the topic, surprisingly generated a few thousand views and tons of great questions. Are there ways to refine results? What about the body? Expressions? How to limit variations? Why not just use ROOP? Why not just individual generated images?

The feedback highlighted that there are a broad range of experience levels when it comes to using AI for image generation, from absolute beginners to technically advanced. Since I didn’t want that blog to be too long, I made a lot of assumptions, and cut a few corners, figuring folks would experiment and ‘fill in the gaps’ on their own. At the same time, the community interaction helped me refine my process, which was great! Thank you.

Needless to say, I thought instead of jumping right into an advanced topic like ‘training a LoRA’ with such a broad range of experience levels. I’d create an interim topic that I think is super helpful: Using interrogation to get the results you want.

What is… and Why Interrogation?

Interrogation, or captioning helps us refine the prompts we use, enabling us to see how the AI system tags and classifies, and what terms it uses. By looking at these we can further refine our images to attain the concept we have in mind, or remove them via negative prompts.

The main reason we need this is ‘bias’, which is an inherent challenge of AI models. When training AI models we want to minimize bias, but bias creeps in due to the subjectivity of terms we use. When we use a term in the negative prompt, for example ‘ugly’ we’re injecting our own bias and potentially limiting the variability of the ‘rugged character’ we want. There are terms that heavily weight towards women results vs. men. It’s also the reason why a lot of AI generated images have similar faces. The more detail we can provide, and drive bias in our favor, the better and unique the results.

We know that through prompting we can tell Stable Diffusion “I want more of this, and less of this”, but how do we define what ‘this’ is. This is where interrogation tools come in (BLIP, CLIP, WD14, DeepBooru) in Stable Diffusion you can find two of these on the img2img tab.


For this walkthrough I’d also recommend installing the extension ‘clip-interrogator-ext’ from the Stable Diffusion extensions tab, as this gives some enhanced features that will be super helpful, and I’m going to use it a bit below.

Let’s get Started

In this walkthrough I’m going to create a Ghibli styled ‘mercenary’ character, and the workflow is going to look like this, where we use interrogation at each stage to achieve the closest output to the input.

Steps 1 and 2 actually are 90% of the work, and that’s where we’ll focus our efforts today.

Defining a Character Archetype

For this step, I’m going to use RevAnimated as my base model. My initial prompt will be simple:
Prompt: Portrait of a mercenary, facing camera

I’ll skip the negative prompt for now and start with all default settings with a batch count of 10:


My results:


Fortunately it’s starting on the right foot and I’m headed the general direction of my ‘ideal’ for my mercenary. The closest I see in this set is the last image on the 1st row. So let’s interrogate that. By interrogating we’ll get a sense of the terms often used to describe this. To do this we’ll click on the image and ‘send to img2img’. Then on the img2img tab click on ‘Interrogate CLIP’, for my image it comes up with this prompt (highlights are what I see as positive vs. negative traits):

Prompt: a man with a gun and a beard wearing a uniform and a fur collar and a beard with a gun in his hand, Andor Basch, cushart krenz, a character portrait, serial art

Now click on ‘Interrogate DeepBooru’, my results:

Prompt: 1boy, ak-47, assault rifle, beard, blonde hair, bullet, bullpup, coat, facial hair, finger on trigger, fur-trimmed jacket, fur collar, fur trim, gun, handgun, heckler & koch, holding gun, kalashnikov rifle, m4 carbine, machine gun, magazine (weapon), male focus, manly, pistol, revolver, rifle, scope, sling, sniper rifle, solo, stubble, submachine gun, suppressor, weapon, weapon on back

The other traits are useful, but I’m mostly focusing on the ‘look’ rather than the guns and other details. Let’s now move it to the interrogator tab (as mentioned int he opening it’s an add-on extension installable via the extension panel: clip-interrogator-ext). To do this right-click on the image in txt2img, and just save-as to your desktop and open it in the tab. You’ll notice you can select from a bunch of different ‘CLIP’ models, but we’ll use the default here.


Click on ‘Generate’, the prompt I get back is this:

Prompt: a painting of a man with a gun in his hand, charlie bowater character art, blonde shaggy hair, aged shaggy ex military soldier, detailed anime soft face, test subject supersoldier, blond furr, dating app icon, bloodborn, name of the character is chad, quint, behance. polished, heterochromia, joongwon jeong, caracal cyborg

This prompt response has some funny terms, who knew ‘heterochromia’ was a usable prompt term? šŸ™‚ I don’t want my character to have that so I’ll nix it. You’ll also notice I split ‘detailed anime’ from ‘soft face’. I know I want ‘detailed anime’ I’m uncertain I want ‘soft face’, I’d prefer more rugged.

Lastly, we’ll stay on this Stable Diffusion tab, but click on its ‘Analyze’ sub-tab, upload your image again here, and then click at the bottom ‘Analyze’, here are the general results I get.


Great, so let’s use this to glean a few more terms:

Positives:
* Rugged male ranger
* Detailed character portrait
* Artstation
* serial art

Negatives:
* furry art

Compiled all together we now have the following traits we want to add to our prompt and negative prompt:

So let’s test it out, modifying around our original prompt to now be:

New Prompt: Portrait of an aged ex miliatry officer, mercenary with a gun and beard, black short hair, brown eyes, facing camera, facial hair, male focus, manly, detailed anime, rugged ale ranger, detailed character portrait, artstation, serial art

Negative Prompt: fur collar, blonde, hair, fur-trimmed jacket, fur trim, blonde shaggy hair, soft face, heterochromia, furry art

Above, I also added in black short hair and brown eyes, since that’s what I want and I didn’t define it in the original prompt. So, let’s give it a go, I’ll run 20, again just using the base settings.


Ok, this is getting very close to what I was thinking. From here if the model still isn’t what you want you can put in a few prompt terms of your own, maybe a ‘facial scars, scars’ added to negative prompt, or run it through interrogation again and do some more fine-tuning, play with it.

One thing for me is I want a bit more of a ‘studio Ghibli’ vibe, so I’m going to add that in to my prompt. Which reminds me, art styles really influence output. I’d recommend 100% to download the Stable Diffusion Cheat Sheet from GitHub to help guide your outputs to the art-styles you like, you can even mix art styles, be as creative as you want.

So here’s my final output:


My final prompt for this was: Portrait of an ex miliatry officer, mercenary with a gun and beard, black short hair, brown eyes, facing camera, facial hair, male focus, manly, detailed anime, rugged ale ranger, detailed character portrait, artstation, serial art, (by studio Ghibli:1.3)

Final Negative Prompt: fur collar, blonde, hair, fur-trimmed jacket, fur trim, blonde shaggy hair, soft face, heterochromia, furry art , blurry

For this set I’m going with the bottom left corner (Seed: 1178841349)


At this point in my workflow what I like to do is recycle the seed and run another 20 using a small variation number, like .15 to do that we’ll modify the lower part of the top panel.



Results:


I might be being nit-picky, but the last mercenary seems to be the closest ‘archetype’ of what I was thinking, his variation is (Seed: 1700595277), now save the image to your desktop.


Creating The Panels:

Why create panels? Well, if your intention is a one-shot ‘I got my mercenary headshot’ and that’s all you need, well then you’re done. But, if your intention is ‘I want a character that I can pose, work around, put into different environments, outfit styles, etc…’ Then creating a character sheet (From this blog) is a step to get you there.

We’ll start with creating a 3-panel by using these two assets:


We’re going to stay in txt2img for the moment, but we’re going to use Controlnet pretty extensively, similar to how we did in the previous blog.

First we’re going to alter the prompt slightly to give guidance to the AI, my last used prompt was: Portrait of an ex miliatry officer, mercenary with a gun and beard, black short hair, brown eyes, facing camera, facial hair, male focus, manly, detailed anime, rugged ale ranger, detailed character portrait, artstation, serial art, (by studio Ghibli:1.3)

We’re simply going to change it to this: A series of portraits of an ex military officer, mercenary with a gun and beard, black short hair, brown eyes, facing camera, identical, facial hair, male focus, manly, detailed anime, rugged ale ranger, detailed character portrait, artstation, serial art, (by studio Ghibli:1.3)

Use the following settings:


In controlnet we’re going to use 3 controlnet controls to get our output: Pose, Lineart, Reference. If you only show 1 controlnet, that’s OK. Hop over to Settings->Controlnet and change ‘Multi ControlNet: Max models amount (requires restart)’ to how many you want, mine is set to 4.

Panel 1:


Panel 2:


Panel 3:


Here most likely we won’t need to use interrogation, but if your panels start showing something too funky you can’t describe in the negative prompt, use the interrogation tools from above. One thing to note here, since each portrait panel is smaller than 512×512 this is going to lower detail and accuracy, that’s actually OK. We don’t need to be exact here, just close enough a ‘fair representation’. Think of things like ‘head shape’ ‘eye placement’, etc… very general characteristics. Once we have the right base, we’ll take it to img2img for the final steps.


Here’s one example, two of the panels are pretty close. What I’ll usually do is save this, then crop and save the 1st image and interrogate it. In this it came up with ‘1girl, young woman’ so I added those.

Some other things in looking at the output, my character has a bit more of ‘heavy’ and ’round’ or ‘wide’ face, this is coming out with a bit too narrow, so I added (heavy and round face:1.1), I also realized I never explicitly said ‘man’ in the normal prompt (A great example of bias, the model inferred heavily that I meant male), so added that there. One other prompt change I added to the negative prompt is ‘squint, squinting’, And, then ran another set of 20.

Final Prompt: A series of portraits of an ex military officer, mercenary with a gun and beard, (man:1.3), black short hair, brown eyes, open eyes, facing camera, (identical:1.3), (wide and round face:1.2), (grey background:1.3), full cheeks, facial hair, male focus, manly, detailed anime, rugged ale ranger, detailed character portrait, artstation, serial art, (by studio Ghibli:1.3)

Negative Prompt: fur collar, blonde, hair, fur-trimmed jacket, fur trim, blonde shaggy hair, soft face, heterochromia, furry art , blurry, child, boy, girl, woman, youth, (young man), (young woman), (1girl), squint, squinting, gaunt, ghailan!, caleb from critical role, ponytail hair

Final result on this stage:


MOVING TO IMG2IMG:

PRO TIP: Here’s something I’ve found super valuable, and I’m not sure if people are generally aware of this. If you read my previous blog, you’ll notice I described a grey background. The reason for that is the background can influence the outcome of your images. It can bias the AI to produce a particular style and take-away from what you’re trying to achieve. Because of this there’s 2 things I do at this point (1) Use the remove background extension that’s available for Stable Diffusion via extensions (2) Use Photoshop, Gimp, or any other app that let’s you fill the now empty background of the image with a neutral gray color (I find that works best), and then I add in the borders, which are 8 pixels wide, again for guardrails. I sometimes add a little noise to the background as well, which I did to this base image for the next step. I’ll save that part of conversation for a future blog, basically ‘how you can use noise to modify results’.

That said, if you want to skip that part, just click “send to img2img”, and used the following settings. YMMV on the settings, some images require more, some less, when it comes to CFG & Denoising.

My final prompt: A series of portraits of an ex military officer by (Studio Ghibli:1.3), mercenary with a gun and beard, (man:1.1), (wearing a green shirt:1.2), black short hair, (brown eyes:1.4), open eyes, (identical:1.3), (wide and round face:1.2), (grey background:1.3), animation, facial hair, male focus, manly, detailed anime, rugged ale ranger, artstation, serial art
Negative prompt: fur collar, blonde, hair, fur-trimmed jacket, fur trim, blonde shaggy hair, soft face, heterochromia, furry art , blurry, child, boy, girl, woman, youth, (young man), (young woman), (1girl), squint, squinting, gaunt, ghailan!, caleb from critical role, ponytail hair

Settings:

Notice I removed the background, replaced it with some noisy grey and added the bars back in.


I played around with the settings a tiny bit after a few samples, mostly with CFG, my final result:

Here’s the output if I didn’t put any lines in or add noise:


You’ll notice in the above, the second image where I added some noise for the output helped to add to the overall Ghibli effect.

CONCLUSION:

So that gets us through this tutorial about using interrogation and getting close-to-concept results in Stable Diffusion. I plan on my next post being about adding in the bodies to the images we’ve already generated. For that though it’ll require the images to be post processed using a Python script, that will ease a lot of the manual work. I’m in the process of putting that together and should be done right after the 4th of July.

Until then, enjoy!

You may also like...

5 Responses

  1. Nicolaas Grobler says:

    The posts just keep getting better! Thank you for creating such detailed guides. For us beginners it is really a great help!

  2. Cainezen says:

    One thing of note on performance. In my testing, a batch (10 x 1) is much slower than say a batch of (2 x 5). 10 seconds versus 14. Also, doing a batch of (1 x 8) takes half the time at 7 seconds. I know Auto1111 limits the Batch size to 8, but if you’re looking to do this in batches increasing the Batch Count while leaving the Batch size of (1), results in poor performance.

    • Dave Packer says:

      Definitely, I often run 10×4 depending what I’m doing. The only caveat here is that batch size is a concurrent process on the GPU, the larger the batch size the more VRAM is being consumed. For a lot of 8GB boards increased batch size can lead to CUDA OOM errors.

  3. Nikhil Kumar says:

    I am working with bodies these days for my character. WIll share any insights if I get any. Thanks for this work man!

  4. Nikhil Kumar says:

    Iā€™m looking forward to the next one