Monday, April 6, 2026

Image Generation with ComfyUI From Bash

The best tool I've found to run AI image and video-generation models is ComfyUI. It supports a lot of different models and has a dataflow programming architecture that allows fairly sophisticated use cases. The problem for me, however, is that it's awkward to use it just to generate an image or two. It's also hard to use the web interface to iterate over different prompts or parameters. So this post describes a workflow for calling ComfyUI from Bash and previewing images directly from a terminal.

ComfyUI has a REST interface that can be accessed from the command line via curl. This is used by the web interface, so you get the full power of the program without having to click on little boxes to change parameters.

I won't cover installing ComfyUI. You'll have to figure that out yourself. But let's assume you have ComfyUI installed and running, and you've downloaded the model files for Z-Image Turbo and can generate images in your browser. Now you want to do the same from a command line.

First, you'll need a JSON file for your workflow. Load up the ComfyUI web interface and open up your workflow. As of this writing, you can right-click on the workflow tab at the top of the window and select "Export (API)" to create this file. So for example, if you do this with the default Z-Image Turbo workflow, you'll get something like this:

{
  "9": {
    "inputs": {
      "filename_prefix": "z-image-turbo",
      "images": [
        "57:8",
        0
      ]
    },
    "class_type": "SaveImage",
    "_meta": {
      "title": "Save Image"
    }
  },
  "57:30": {
    "inputs": {
      "clip_name": "qwen_3_4b.safetensors",
      "type": "lumina2",
      "device": "default"
    },
    "class_type": "CLIPLoader",
    "_meta": {
      "title": "Load CLIP"
    }
  },
  "57:33": {
    "inputs": {
      "conditioning": [
        "57:27",
        0
      ]
    },
    "class_type": "ConditioningZeroOut",
    "_meta": {
      "title": "ConditioningZeroOut"
    }
  },
  "57:8": {
    "inputs": {
      "samples": [
        "57:3",
        0
      ],
      "vae": [
        "57:29",
        0
      ]
    },
    "class_type": "VAEDecode",
    "_meta": {
      "title": "VAE Decode"
    }
  },
  "57:28": {
    "inputs": {
      "unet_name": "z_image_turbo_bf16.safetensors",
      "weight_dtype": "default"
    },
    "class_type": "UNETLoader",
    "_meta": {
      "title": "Load Diffusion Model"
    }
  },
  "57:27": {
    "inputs": {
      "text": "A sea lion on a beach, holding a sign that says, \"Command-Line Interfaces Rock!\"",
      "clip": [
        "57:30",
        0
      ]
    },
    "class_type": "CLIPTextEncode",
    "_meta": {
      "title": "CLIP Text Encode (Prompt)"
    }
  },
  "57:13": {
    "inputs": {
      "width": 1024,
      "height": 1024,
      "batch_size": 1
    },
    "class_type": "EmptySD3LatentImage",
    "_meta": {
      "title": "EmptySD3LatentImage"
    }
  },
  "57:11": {
    "inputs": {
      "shift": 3,
      "model": [
        "57:28",
        0
      ]
    },
    "class_type": "ModelSamplingAuraFlow",
    "_meta": {
      "title": "ModelSamplingAuraFlow"
    }
  },
  "57:3": {
    "inputs": {
      "seed": 277911290314474,
      "steps": 8,
      "cfg": 1,
      "sampler_name": "res_multistep",
      "scheduler": "simple",
      "denoise": 1,
      "model": [
        "57:11",
        0
      ],
      "positive": [
        "57:27",
        0
      ],
      "negative": [
        "57:33",
        0
      ],
      "latent_image": [
        "57:13",
        0
      ]
    },
    "class_type": "KSampler",
    "_meta": {
      "title": "KSampler"
    }
  },
  "57:29": {
    "inputs": {
      "vae_name": "ae.safetensors"
    },
    "class_type": "VAELoader",
    "_meta": {
      "title": "Load VAE"
    }
  }
}

This JSON file describes the graph that's used to generate the image. You'll need to wrap this in a larger JSON object and send that to ComfyUI to get an image back. So create a file named zimageturbo.json and put the following in it:

{
  "client_id": "e70c5721-5e9f-43b4-bcb3-23bf05fec938",
  "prompt_id": "5719972c-88d3-45b7-b441-ef06f0b1a011",
  "prompt":
    /* insert your workflow JSON from above here */
}

The long HEX numbers are just UUIDs I generated with uuidgen. You'll need to generate a new prompt_id UUID one for each prompt you submit and a new client_id from each computer you want to connect from.

Now you can send this to the ComfyUI instance with curl. Assuming your server is at the default http://127.0.0.1:8188, you can do:

$ curl -s --url "http://127.0.0.1:8188/prompt" --json @zimageturbo.json

Now you wait for the image to be generated. You can submit requests to the history/[prompt_id] endpoint (filling in the prompt_id to match your JSON file) to see when the image has completed and where to find it:

$ curl -s --url "http://127.0.0.1:8188/history/5719972c-88d3-45b7-b441-ef06f0b1a011"

Before the image is completed, this will return nothing. After the image is completed, this will return some JSON:

{
  "5719972c-88d3-45b7-b441-ef06f0b1a011": {
    "prompt": [
      7,
      "5719972c-88d3-45b7-b441-ef06f0b1a011",
      {
        "9": {
          "inputs": {
            "filename_prefix": "z-image-turbo",
            "images": [
              "57:8",
              0
            ]
          },
          "class_type": "SaveImage",
          "_meta": {
            "title": "Save Image"
          }
        },
        "57:30": {
          "inputs": {
            "clip_name": "qwen_3_4b.safetensors",
            "type": "lumina2",
            "device": "default"
          },
          "class_type": "CLIPLoader",
          "_meta": {
            "title": "Load CLIP"
          }
        },
        "57:33": {
          "inputs": {
            "conditioning": [
              "57:27",
              0
            ]
          },
          "class_type": "ConditioningZeroOut",
          "_meta": {
            "title": "ConditioningZeroOut"
          }
        },
        "57:8": {
          "inputs": {
            "samples": [
              "57:3",
              0
            ],
            "vae": [
              "57:29",
              0
            ]
          },
          "class_type": "VAEDecode",
          "_meta": {
            "title": "VAE Decode"
          }
        },
        "57:28": {
          "inputs": {
            "unet_name": "z_image_turbo_bf16.safetensors",
            "weight_dtype": "default"
          },
          "class_type": "UNETLoader",
          "_meta": {
            "title": "Load Diffusion Model"
          }
        },
        "57:27": {
          "inputs": {
            "text": "A sea lion on a beach, holding a sign that says, \"Command-Line Interfaces Rock!\"",
            "clip": [
              "57:30",
              0
            ]
          },
          "class_type": "CLIPTextEncode",
          "_meta": {
            "title": "CLIP Text Encode (Prompt)"
          }
        },
        "57:13": {
          "inputs": {
            "width": 1024,
            "height": 1024,
            "batch_size": 1
          },
          "class_type": "EmptySD3LatentImage",
          "_meta": {
            "title": "EmptySD3LatentImage"
          }
        },
        "57:11": {
          "inputs": {
            "shift": 3.0,
            "model": [
              "57:28",
              0
            ]
          },
          "class_type": "ModelSamplingAuraFlow",
          "_meta": {
            "title": "ModelSamplingAuraFlow"
          }
        },
        "57:3": {
          "inputs": {
            "seed": 277911290314474,
            "steps": 8,
            "cfg": 1.0,
            "sampler_name": "res_multistep",
            "scheduler": "simple",
            "denoise": 1.0,
            "model": [
              "57:11",
              0
            ],
            "positive": [
              "57:27",
              0
            ],
            "negative": [
              "57:33",
              0
            ],
            "latent_image": [
              "57:13",
              0
            ]
          },
          "class_type": "KSampler",
          "_meta": {
            "title": "KSampler"
          }
        },
        "57:29": {
          "inputs": {
            "vae_name": "ae.safetensors"
          },
          "class_type": "VAELoader",
          "_meta": {
            "title": "Load VAE"
          }
        }
      },
      {
        "client_id": "e70c5721-5e9f-43b4-bcb3-23bf05fec938",
        "create_time": 1775495893332
      },
      [
        "9"
      ]
    ],
    "outputs": {
      "9": {
        "images": [
          {
            "filename": "z-image-turbo_00000_.png",
            "subfolder": "",
            "type": "output"
          }
        ]
      }
    },
    "status": {
      "status_str": "success",
      "completed": true,
      "messages": [
        [
          "execution_start",
          {
            "prompt_id": "5719972c-88d3-45b7-b441-ef06f0b1a011",
            "timestamp": 1775495893332
          }
        ],
        [
          "execution_cached",
          {
            "nodes": [],
            "prompt_id": "5719972c-88d3-45b7-b441-ef06f0b1a011",
            "timestamp": 1775495893333
          }
        ],
        [
          "execution_success",
          {
            "prompt_id": "5719972c-88d3-45b7-b441-ef06f0b1a011",
            "timestamp": 1775495906934
          }
        ]
      ]
    },
    "meta": {
      "9": {
        "node_id": "9",
        "display_node": "9",
        "parent_node": null,
        "real_node_id": "9"
      }
    }
  }
}

You need to look at the contents of [prompt_id]["outputs"]["9"]. You can extract this with the jq tool:

$ curl -s --url "http://127.0.0.1:8188/history/5719972c-88d3-45b7-b441-ef06f0b1a011" | 
jq '.["5719972c-88d3-45b7-b441-ef06f0b1a011"].outputs.["9"]'

Which will display:

{
  "images": [
    {
      "filename": "z-image-turbo_00000_.png",
      "subfolder": "",
      "type": "output"
    }
  ]
}

With that information, we can use the view endpoint to fetch the image:

$ curl -s --get --url "http://127.0.0.1:8188/view" -d 'filename=z-image-turbo_00000_.png' \
-d 'subfolder=' -d 'type=output' -o z-image-turbo_00000_.png

You can view the resulting image with a tool like timg or chafa. If you have a new enough terminal emulator, this can even display the full-resolution image.

Sea lions can't read or write, so this might be the best you can hope for.

And that's all there is to it. You can modify the JSON file to edit the prompt, image width and height, random seed, diffusion steps, or anything else you want to change. You can also export other workflows and use any ComfyUI-supported model from the command line this way.

Friday, March 27, 2026

Generating Formatted Long Division Practice Problems

Here's a Python script I wrote to generate a bunch of random long division problems with solutions, showing work.

#!/usr/bin/env python3

import random

# Print 100 random 4-by-1 digit long division problems,
# showing the complete solution process.

def long_division(dividend, divisor):
    quotient = dividend // divisor
    remainder = dividend - quotient * divisor

    indent = " "*6

    print(indent + "      %4d R %d" % (quotient, remainder))
    print(indent + "    ------")
    print(indent + "%3d ) %4d" % (divisor, dividend))

    q = ((quotient // 1) % 10,
         (quotient // 10) % 10,
         (quotient // 100) % 10,
         (quotient // 1000) % 10)

    d = ((dividend // 1) % 10,
         (dividend // 10) % 10,
         (dividend // 100) % 10,
         (dividend // 1000) % 10)

    indent += " "*4

    rem = 0
    digit = 3

    # find the first non-zero quotient digit
    while digit > -1:
        rem = rem * 10 + d[digit]
        if q[digit] > 0:
            break
        digit -= 1
    if digit < 0:
        return

    while digit >= 0:

        # subtract the product from the remainder
        print(indent + "- %*d" % (4 - digit, q[digit] * divisor))
        print(indent + "-------")
        rem = rem - (q[digit] * divisor)
        print(indent + "  %*d" % (4 - digit, rem), end="")

        digit -= 1

        # find the next non-zero quotient digit, bringing more quotient
        # digits down

        while digit >= 0:
            rem = rem * 10 + d[digit]
            print("%d" % d[digit], end="")
            if q[digit] > 0:
                break
            digit -= 1

        print("")

for i in range(100):

    print("")
    print("-"*20)
    print("#%d" % (i+1))

    dividend = random.randint(1000,9999)
    divisor = random.randint(2,9)

    long_division(dividend, divisor)

This just prints the problems to stdout, and you can grep for the important lines if you want to remove the solutions:

$ python3 division.py > solutions
$ grep -E "\#|\)| ------$" solutions > problems

Here's a sample output:

--------------------
#1
             207 R 1
          ------
        6 ) 1243
          - 12
          -------
             043
          -   42
          -------
               1
--------------------
#2
             555 R 4
          ------
        9 ) 4999
          - 45
          -------
             49
          -  45
          -------
              49
          -   45
          -------
               4
--------------------
#3
             609 R 2
          ------
        6 ) 3656
          - 36
          -------
             056
          -   54
          -------
               2
--------------------
#4
            1014 R 3
          ------
        6 ) 6087
          - 6
          -------
            008
          -   6
          -------
              27
          -   24
          -------
               3

Monday, March 9, 2026

Image Generation with stable-diffusion.cpp

This post documents how I set up stable-diffusion.cpp to generate images on my PC, running on Linux Mint 22.2, with an Nvidia GeForce RTX 5070 GPU.

Prerequisites

The nvidia-cuda-toolkit package in Linux Mint 22.2 is too old for the RTX 50x0, so I had to go to Nvidia's CUDA web site to download a newer version. This is a little tricky because you have to know what version of Ubuntu your Linux Mint is based off of. Check /etc/upstream-release/lsb-release to see, then download and install the appropriate deb packages from Nvidia's site. Then put the CUDA directory in your PATH

You'll also need some random packages to build this. I can't remember them all, but at least cmake and git are required.

If you go to the stable-diffusion.cpp Git repository, there's a build guide that should walk you through the process of downloading and building the software. However, if you have multiple versions of the nvidia-cuda-toolkit package installed, you need to tell CMake where to find the correct nvcc. Most of the instructions I found to do this were incorrect for my situation (maybe because I had previously installed the system nvidia-cuda-toolkit package). I had to use the command:

$ cmake .. -DSD_CUDA=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.8/bin/nvcc

in place of the command

$ cmake .. -DSD_CUDA=ON

from the stable-diffusion.cpp/build/ directory in order to detect the Nvidia-provided version of nvcc.

If you don't have a GPU but have at least 16 GB of RAM, you can compile stable-diffusion.cpp to run on your CPU instead and still have decent image generation, but it will be dramatically slower. My Ryzen 5700X CPU is about 130x slower than my RTX 5070 for this. With less RAM, you may still be able to get away with heavily quantized models but might not like the results.

After you've built stable-diffusion.cpp, you need some model files. There are links to these in the Markdown files in stable-diffusion.cpp/docs/. I recommend looking at z_image.md for Z-Image Turbo. This is probably the best all-around local image generation model as of early March 2025. It is notable for running in a relatively small footprint, following prompts very well, and avoiding "body horror" like giving people extra limbs.

You'll need a GGUF file for Z-Image Turbo. The different options are various sizes, differing in the amount of quantization. You'll want the largest one that fits in your GPU's VRAM, though you have to leave some space for the work buffers. The Q8_0 version should work for GPUs with 8 GB of VRAM or more.

You also need a VAE safetensors file and the Qwen3 4B text encoder GGUF file linked in the Markdown file. Once again, the Q8_0 version of Qwen3 4B is probably fine if you have 8 GB of VRAM.

Generating Images

There are some serious limitations with the way stable-diffusion.cpp uses VRAM, so the best way to do all of this is to run headless. Kill any GUI programs running on your Linux box, log out of the desktop GUI, and SSH in from another computer.

Once you've built stable-diffusion.cpp and downloaded all the model files, you can run stable-diffusion.cpp/build/bin/sd-cli. I put my models in the ~/ai/models directory, generated a GGUF version of the VAE, and renamed the model files to simplify them, so my invocation looks like:

$ ./sd-cli --mmap --diffusion-model ~/ai/models/z_image_turbo.gguf \
--vae ~/ai/models/flux_ae.gguf --llm ~/ai/models/qwen_3_4b.gguf \
--cfg-scale 1.0 --offload-to-cpu --diffusion-fa -H 1024 -W 1024 \
--steps 9 -s 42 -p "Cute illustration of a sea lion on a rocky beach, \
holding a sign that says \"Will generate images for fish\""

You can look at sd-cli's help text to see what the options do. Some notable ones are --offload-to-cpu and --diffusion-fa flags which reduce VRAM usage.

This command will generate the image file output.png which you can view from your terminal with a tool like timg or chafa if you have a compatible terminal emulator (Ghostty for example). Here's what mine looked like:

Notice that the text doesn't match the prompt. That's pretty common with small image generators (especially quantized versions), and you just have to keep trying with different seed values (the -s parameter above) or slightly tweaked prompts until you get what you want.

What's the Point?

Why bother with all this to run Z-Image Turbo locally when you can just use Google or ChatGPT to generate images? I think this is most useful for integrating image generation into a larger workflow. For example, you can generate hundreds of random images overnight to help generate ideas for a project. Or you could use an edit model to change a whole directory of family photos into cartoon-style drawings. And it all runs locally on your machine, which lets you deploy it to places without fast network access. With a fast enough GPU, you could build a camera that automatically cartoonifies live images.

Tips

Z-Image Turbo prefers long, detailed prompts. I think the easiest way to work with it is to use a large language model (LLM) to generate long prompts from your shorter ones. I've had good results with Ministral 3 8B
Since Z-Image Turbo tends to produce similar images even for different seeds, to get more diversity, use an LLM to rewrite your prompts and turn up the randomness.
Unlike PyTorch-based projects like ComfyUI that can load models incrementally in whatever VRAM is available, stable-diffusion.cpp requires enough VRAM to load the entire diffusion model with extra space for working memory. Until that's fixed, this approach is only useful for smaller models (unless you have a giant GPU or don't mind running much slower on the CPU).
Once you have generation working, you can try an edit model like Flux.2 Klein 4B. It's a little tricky to use but opens up new fun applications.

What About the Ethics of AI Art?

Yes, it is true that AI image models are trained on huge quantities of other peoples' art without permission or attribution. Feel free to avoid this technology completely if you want.

On the other hand, these models are very fun to goof around with! After you generate a few hundred images, however, you'll notice a bland sameness to the outputs and a casual disregard of physics that shows the limits of current tech.

The two best uses I've found are generating random images to help brainstorm and generating filler content that nobody is going to scrutinize. For anything else, you'll want a real artist to at least refine the output and tell you all the things you're not seeing.

Friday, January 19, 2024

Coroutines on the NES

When you're writing event-driven code, you often end up implementing a lot of state machines that run once per event. For example, in a game, you have a bunch of state machines that run once per frame to draw graphics, play music, check for input, etc. I always find these tedious and error prone to write. For me, it's much easier to write them as coroutines for simple cooperative multitasking.

The Concept

The basic idea is writing functions that can be suspended and resumed in the middle. Then, you write the state machine as normal branching and looping code. For example, to play a sound effect with an ADSR envelope, the state machine approach would look something like:

switch(sound->state) {
  case STATE_ATTACK:
    sound->vol += sound->attackRate;
    if (sound->vol >= maxVolume) nextState = STATE_DECAY;
    break;
  case STATE_DECAY:
    sound->vol -= sound->decayRate;
    if (sound->vol == sound->sustainLevel) {
        sound->sustainTimer = sound->sustainTime;
        nextState = STATE_SUSTAIN;
    }
    break;
  case STATE_SUSTAIN:
    if (--sound->sustainTimer == 0) {
        nextState = STATE_RELEASE;
    }
    break;
  case STATE_RELEASE:
    sound->vol -= sound->releaseRate;
    if (sound->vol <= 0) {
        sound->vol = 0;
        nextState = STATE_OFF;
    }
    break;
  default:
    sound->vol = 0;
    break;
}
updateSound(sound);
sound->state = nextState;

Or you could write it as a coroutine:

do {
    sound->vol += sound->attackRate;
    updateSound(sound);
    YIELD();
} while(sound->vol <= maxVolume);
do {
    sound->vol -= sound->decayRate;
    updateSound(sound);
    YIELD();
} while(sound->vol >= sound->sustainVolume);
sound->sustainTimer = sound->sustainTime;
do {
    sound->sustainTimer--;
    updateSound(sound);
    YIELD();
} while(sound->sustainTimer);
do {
    sound->vol -= sound->decayRate;
    updateSound(sound);
    YIELD();
}
sound->vol = 0;
updateSound(sound);
YIELD();

The sound handler is run once per frame. In the coroutine version, the function runs from one YIELD() to the next on each frame. I think this is a little easier to work with than the explicit state machine version. The state machine is implicit in the control flow instead of being explicitly written out with state transitions.

So how do you implement this on a 6502-like CPU? You need to write the YIELD() function so that it saves registers to the stack, picks another coroutine to run, switches to that routine's stack, then restores the registers and resumes.

The 6502 has a very limited stack pointer, though. Its high byte is fixed at 0x01, so the stack can only hold 256 bytes. This could be partitioned between the coroutines, but the approach I took was to save and restore the entire stack on each YIELD(). Since the YIELD() normally happens when there's not much on the stack, this isn't too expensive.

The other thing that makes this cheap is that you don't need to automatically save processor registers. Since YIELD() is implemented as a function call, there's no expectation that registers will be preserved. So the calling code can save them if it wants, but it isn't done automatically. And conveniently, calling YIELD() saves the PC of the calling site to the stack, so the task switching logic can just use a normal rts instruction to resume.

The Code

Here's my implementation in two files. First, tasks.inc:

; tasks.inc - include this in your asm file.
task_count .set 0

.macro addMainTask name, stack
        lda #<stack
        sta name+1
        lda #>stack
        sta name+2
        lda #<name
        sta TASKS
        lda #>name
        sta TASKS+1
task_count .set task_count + 1
.endmacro

.macro addTask name, stack, entry
        lda #2
        sta name
        lda #<stack
        sta name+1
        lda #>stack
        sta name+2
        lda #<(entry-1)
        sta stack
        lda #>(entry-1)
        sta stack+1
        lda #<name
        sta TASKS+(task_count*2)
        lda #>name
        sta TASKS+(task_count*2)+1
task_count .set task_count + 1
.endmacro

.macro initTaskList main_task
        lda #0
        sta CURRENT_TASK
        lda #task_count
        sta NUM_TASKS
        lda #<main_task
        sta TASK
        lda #>main_task
        sta TASK+1
.endmacro

.import TASKS
.importzp TASK
.importzp CURRENT_TASK
.importzp NUM_TASKS
.import yield, switchTask, sleep

And here's the actual implementation in tasks.asm:

; tasks.asm - coroutine yield implementation file
; Define MAX_TASKS in your build system.
; Requires a 2-byte zero-page variable called PTR defined elsewhere

.segment "ZEROPAGE"
TASK: .res 2  ; points to current task
CURRENT_TASK: .res 1 ; index of running task
NUM_TASKS: .res 1 ; number of tasks

.segment "DATA"
TASKS: .res 3*MAX_TASKS

.exportzp TASK, CURRENT_TASK, NUM_TASKS
.importzp PTR
.export yield, switchTask, TASKS, sleep

.segment "CODE"
; Call this subroutine to suspend current task and switch to the next
.proc yield
        ; task struct address is in zero-page reg TASK
        ldy #1
        lda (TASK), y
        sta PTR
        iny
        lda (TASK), y
        sta PTR+1

        ldy #0

        ; pop from the stack and store in stack array
        tsx
        inx
        beq copyDone
copyLoop:
        pla
        sta (PTR), y
        iny
        inx
        bne copyLoop
copyDone:
        tya
        ; store the size of the stack in the task state
        ldy #0
        sta (TASK), y
nextTask:
        ldy CURRENT_TASK
        iny
        cpy NUM_TASKS
        bcc :+
        ldy #0
:       sty CURRENT_TASK
        tya
        asl
        tay
        lda TASKS, y
        sta TASK
        lda TASKS+1, y
        sta TASK+1
        jmp switchTask
.endproc

; helper function, no need to call it directly
.proc switchTask
        ; stack is supposed to be empty here
        ; task struct address is in zero-page reg TASK
        ldy #1
        lda (TASK), y
        sta PTR
        iny
        lda (TASK), y
        sta PTR+1
        ldy #0
        lda (TASK), y  ; stackBytes
        tay
        dey
copyLoop:
        lda (PTR), y
        pha
        dey
        bpl copyLoop
copyDone:
        rts
.endproc

.proc sleep
; call this subroutine to sleep for number of frames in A
loop:
        pha
        jsr yield
        pla
        clc
        adc #$ff
        bpl loop
        rts
.endproc

How to use it

To use it, you need to declare all your tasks, then call yield. For example:


.segment "DATA"
; task state structure -> 1-byte stack size followed by 2-byte pointer to stack array
main_task: .res 3
music_task: .res 3
.align 256
main_task_stack: .res 128   ; this can be way smaller
music_task_stack: .res 128  ; this can also be way smaller

        ; ...
        ; in the initialization of the main task
        addMainTask main_task, main_task_stack
        addTask music_task, music_task_stack, music_main
        initTaskList main_task
        ; Run all tasks to their first yield
        jsr yield
        
        ; ... do whatever here
        
mainLoop:
        ; ... do your beginning-of-frame stuff here ...
        
        ; run all tasks until the next yield
        jsr yield
        
        ; ... do your end-of-frame stuff here ...
        
        clc
        bcc mainLoop

Note that this doesn't handle returning from a coroutine. All your coroutines should be infinite loops. The main task can add or remove coroutines from the task list TASKS dynamically.

Conclusion

This is certainly not production quality, and my 6502 coding skills are probably not great, but it shows that implementing coroutines on a 6502-like CPU is not that complicated. If you keep the stacks small when you jsr yield so only the PC and a few state variables are saved, you could use this for a bunch of routines. It wouldn't be too difficult to launch coroutines for each enemy, sound effect, or any other thing that needs a frame-triggered state machine. Even better, this technique could potentially use fewer cycles than the explicit state machine style because it maps better onto the underlying hardware.

More Testing of Mismatched RAM Modules

Here's a simple program to test whether dual-channel mode is being used for RAM.

Using this, I found that read speeds on my system increase even when I mismatch DIMM capacities in the A and B memory channels, but write speeds do not.

The code

// test.c
// compile with gcc -o test -O0 test.c
// then run with ./test
#include <stddef.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>
#include <stdio.h>
#include <sys/time.h>

#define GB 8.0
#define SZ ((size_t) (GB*1073741824UL))

static double getTime(void)
{
  struct timeval t;
  gettimeofday(&t, NULL);
  return t.tv_sec + (double) t.tv_usec/1e6;
}

int main(int argc, char *argv[])
{
  uint8_t *p;
  double t1, t2;

  p = malloc(SZ);
  memset(p, 'a', SZ); // must write non-zero data first so the OS will actually map the pages to RAM

  // test write
  t1 = getTime();
  memset(p, 0, SZ);
  t2 = getTime();
  printf("Wrote %.1f GB in %f seconds -> %3.2f GB/s\n", GB, t2-t1, ((double)GB)/(t2-t1));

  // test write
  t1 = getTime();
  void *n = memchr(p, 'a', SZ);
  t2 = getTime();
  printf("Read %.1f GB in %f seconds -> %3.2f GB/s\n", GB, t2-t1, ((double)GB)/(t2-t1));
  printf("%p\n", n); // must use n for something to avoid memchr() being optimized out

  return 0;
}

The memset() and memchr() standard library functions are presumed to be heavily optimized for writing and reading from RAM, respectively. And in fact, memset() runs just a little slower than the theoretical limit, while memchr() is faster than rep lodsq, so I assume whoever wrote it is better at optimizing than I am.

With 16GB + 8GB of DDR4-3200 installed in memory channels A and B on my system, this prints:

Wrote 8.0 GB in 0.337975 seconds -> 23.67 GB/s
Read 8.0 GB in 0.249023 seconds -> 32.13 GB/s

The maximum possible bandwidth per channel is something like 3200e6*8 = 23.8 GB/s. It turns out that on a Ryzen 5700 processor, writes are bottlenecked somewhere between the CPU and the RAM controller, so we don't get the benefit of dual-channel writes. We'd expect that the reads would be up to twice as fast, but they're not. In fact, if I run the test for a 1-GB block instead of 8, reads and writes are the same speed. Presumably I'm getting memory that's mapped to a single DIMM in that scenario. I only get higher speeds for larger allocations that are more likely to use both DIMMs. At 16 GB, reads reach almost 33 GB/s.

So in conclusion, mismatched DIMM capacities in dual-channel mode on a Ryzen 5700 give some benefit, but read speeds don't reliably double.

By the way, benchmarks of this sort keep getting harder to write. The OS does not actually map pages to physical RAM until you write to them. If you try to use calloc() to initialize to zero, the OS maps all the pages to a read-only zero-filled page and still only maps them to physical RAM when you write. If you use malloc() followed by bzero() instead of calloc(), gcc's optimizer can replace that with a call to calloc()! And calls to memchr() are optimized out even at -O0 if you don't use the result somewhere. On the other hand, hand-written assembly is not reliably fast on today's CPUs. So be careful doing these kinds of tests with a modern compiler.

Thursday, January 18, 2024

Testing RAM Performance with Mismatched Modules

I upgraded the RAM in one of my computers and ended up with some extra 8 GB DDR4 sticks. My computers are mostly mini-ITX systems with only two RAM slots, so I normally buy matched pairs of RAM sticks, but for various reasons my main workstation had only a single slot filled with a 16 GB module. So I thought I'd see whether performance increased if I added an 8 GB module for 24 GB of RAM total.

A little research suggested that "dual-channel" mode would be a bit faster with matched sizes, but there's not widespread agreement on whether this would also work with mismatched sizes. For example, Socket 939 processors supported a "ganged" mode (see page 16) that combined two 64-bit channels into a 128-bit channel if DIMM sizes were matched across the two channels. Some people online spoke of a "flex mode" for partial dual-channel operation with mismatched modules, but others claimed this was an Intel-only technology.

It looks like AMD's Socket 939 had "A" and "B" pins for memory control signals, but the above document says they are identical signals, with an "A" and "B" pin provided for each to reduce loading when 4 DIMMs are connected. With the AMD 10h family, they changed this to be two independent memory channels but still supported "ganged" mode as a BIOS option for years.

So the question is, if the two DIMMs are accessed using independent memory controller channels, are the addresses still interleaved between DIMMs? If yes, is it still true if sizes are mismatched? I couldn't find any documentation of the Zen3 RAM controller that might help answer the question for current CPUs, so I just had to test it and see for myself.

The conclusion? Yes, mixing different-sized modules on different channels does increase RAM bandwidth for sequential writes, but not as much as using matched modules.

Test Details

I didn't know the standard way to test memory speeds on Linux, so I used a suggestion from here: https://serverfault.com/questions/372020/what-are-the-best-possible-ways-to-benchmark-ram-no-ecc-under-linux-arm

The suggestion amounted to running these commands:

$ cd /mnt
$ sudo mount -t tmpfs /mnt/test1 /mnt/test1
$ sudo dd if=/dev/zero of=/mnt/test1/test bs=1M

This just creates a RAM disk that's half as big as your RAM and fills it up with a bunch of zeros, then reports the average data rate. I tried it with three configurations:

A single 8-GB DDR4-3200 module: 5.7 GB/s
A 8-GB DDR4-3200 module plus a 16-GB DDR4-3200 module: 5.9 GB/s (+3.5%)
Two 8-GB DDR4-3200 modules: 6.2 GB/s (+8.7%)

What's going on here? Well, naively, you'd expect that filling RAM with zeroes would get close to the maximum speed of 3200e6 * 8 = 23.8 GB/s. But it's a little more complicated, since we're actually reading from /dev/zero, then copying the resulting buffer. The buffer is small enough to fit into my Zen3 CPU's L3 cache, so most of this is not limited by the RAM speed. In particular, the RAM controller doesn't have to switch between reading and writing, which could slow things down a lot. But for each byte, we're writing twice and reading once. On top of that, we have system call overhead and filesystem overhead from using tmpfs. With a single channel, this crude benchmark is getting about 25% of the maximum speed.

According to the unofficial AM4 pinout, each RAM channel has dedicated data, address, and control signals. So they should be able to operate independently. This would potentially double the maximum bandwidth to 57.6 GB/s if writes were interleaved between the two banks. My understanding is that Zen3's CPU cores can read and write 32 bytes (256 bits) from L3 on each ~4000-MHz clock cycle, so it should have no problem saturating two 64-bit RAM channels at 3200 MT/s.

If we interpreted the single-channel result as saying that 25% of the time was spent writing to RAM at the maximum transfer rate, then doubling the RAM speed would give us a 14% speedup (1/(1-0.25/2)=1.14). Instead, we get an 8.7% speedup.

The 24-GB case with mixed sizes is more interesting. Here, we get a 3.5% speedup. So presumably the memory controller is able to use both channels some of the time.

What's not clear is how the memory interleaving works. Is the memory controller able to interleave addresses with 8-byte granularity in both dual-channel cases? The 3.5% speed improvement versus 8.7% for matched sizes suggests that less than half of accesses are interleaved, even though we'd expect 2/3 of addresses to be interleavable.

I'd like to redo the testing with a much simpler rep movsb loop that has very little overhead. Unfortunately, after doing all this testing in my cramped mini-ITX case, the 16 GB RAM stick quit working, so I can't repeat it with a better benchmark!

Monday, January 8, 2024

Using sim65 to Test NES Code

I needed a period-63 LFSR implementation in 6502 assembly for some NES code I was writing. Normally, I like to unit test code, but it wasn't obvious how to do that with an NES. Thankfully, the cc65 project already has a great simulator with semi-hosted C stdio support that you can use for this sort of thing. This post explains how I tested my LFSR implementation.

I started with the LFSR code itself, in lfsr.asm:

.export _lfsr_state
.export _lfsr_update

.segment "DATA"
_lfsr_state: .res 1

.segment "CODE"

.proc _lfsr_update
        ; period 63 lfsr with polynomial x^6 + x^5 + 1, computed in top 6 bits
        ; Bottom 2 bits must always be 0.
        lda _lfsr_state
        asl
        bpl :+
        eor #4
:       bcc :+
        eor #4
:       sta _lfsr_state
        rts
.endproc

I'm still new to 6502 coding, so I'm sure this is suboptimal, but it seemed like a good start. The whole point of the exercise is to make it easier to optimize small routines, after all.

Then I made main.c:

#include <stdio.h>
#include <stdint.h>

extern void lfsr_update(void);
extern uint8_t lfsr_state;

int main(void)
{
  uint8_t i = 0;
  uint8_t first;
  
  lfsr_state = 0x4;
  first = lfsr_state;
  do {
    printf("%d: %02X\n", i++, lfsr_state);
    lfsr_update();
  } while (first != lfsr_state);
  return 0;
}

This initializes the LFSR (in the top 6 bits of lfsr_state) to a non-zero value, then updates it until it sees the same value again, printing the intermediate values.

Assuming you installed cc65 already, you can compile this with the commands:

$ cl65 --target sim6502 -o test main.c lfsr.asm

Then you can simulate it with the command:

$ sim65 test

This should give the output:

0: 04
1: 08
2: 10
3: 20
4: 40
5: 84
6: 0C
7: 18
8: 30
9: 60
10: C4
11: 88
12: 14
13: 28
14: 50
15: A4
16: 4C
17: 9C
18: 3C
19: 78
20: F4
21: E8
22: D0
23: A0
24: 44
25: 8C
26: 1C
27: 38
28: 70
29: E4
30: C8
31: 90
32: 24
33: 48
34: 94
35: 2C
36: 58
37: B4
38: 6C
39: DC
40: B8
41: 74
42: EC
43: D8
44: B0
45: 64
46: CC
47: 98
48: 34
49: 68
50: D4
51: A8
52: 54
53: AC
54: 5C
55: BC
56: 7C
57: FC
58: F8
59: F0
60: E0
61: C0
62: 80

This shows that the LFSR cycles after 63 steps, as intended. It also always leaves bits 0-1 clear.

The major limitation of this approach for testing is that if your routines are not using cc65's normal calling convention, you may have to write a wrapper function to be able to call them from C.

The sim65 tool also allows cycle counts to help optimization, but I don't think it can produce any sort of instruction trace that would make it more useful for debugging.

Joel's Blog