softglow's notebook

Dispatches from the Depths of a Super Nintendo

Libco: Sync via Calling Without Endless Recursion

Continuing our discussion of higan from yesterday, let’s take a moment to look deeper into the abyss.

SMP::Threaded

This isn’t pthreads or anything; it’s libco threads. libco appears to be named for coroutines and focuses on user-space or “green” threading rather than OS threads. In fact, libco ships with generic implementations for Windows using fibers, and for GCC using a GNU Pth inspired setjmp/longjmp technique. Including a little trickery with sigaltstack to actually set up the alternate allocation as a stack.

AFAICT, everything is set up with Threaded=true and the emulator would actually break if it were otherwise. Everything would keep calling thing.enter() until eventually the emulator ran out of stack space and crashed.

Tracing cpu.add_clocks

From our previous post, we know that add_clocks is called when the CPU needs to perform bus operations, so our adventure today really begins with the implementation of this function, inside sfc/cpu/timing/timing.cpp.

The function starts off by unlocking IRQs, then computing a number of ticks (which is just clocks/2), and calling tick() for each tick. tick itself merely updates the horizontal and vertical counters, essentially keeping track of where the raster scanout is happening in the PPU. (It was hard to find; it’s in sfc/ppu/counter/counter-inline.hpp which CPU inherits, and it’s loaded in via sfc/sfc.hpp and sfc/cpu/cpu.cpp.)

After ticking, the step function is called, which reduces the clock cycle of each active subsystem (SMP, PPU, and any coprocessors present) before synchronizing the controllers. Next, the joysticks are polled every 256 clocks, DRAM refreshed when needed (at a cost of 40 more cycles), and when DEBUGGER is defined, all the other chips are fully synchronized again.

clocks, then…

It appears that the clocks are tracking “cycles ahead/behind in emulation:” where “0” means perfectly synchronized with the current CPU state, negative means the given part is behind, and of course, positive means ahead. When the CPU runs, it reduces the smp.clock and dsp.clock values, and when the SMP gets a chance to run, it increases its smp.clock value. When it crosses 0, then the SMP is synchronized, and if the synchronization mode calls for it, emulation will sync up the DSP (by the same mechanism: the SMP has been decrementing clocks, so when sync happens, the DSP increments its clock value), then return to the CPU.

Synchronization points

The SMP is actually invoked to re-sync itself with the main CPU whenever the chips need to communicate (i.e. if the CPU wants to read or write the SMP IO ports), on each new scanline (when tick brings the hcounter back to 0), and on a debug build, every time the CPU is updated. I presume that last one keeps everything in perfect sync from the debugger’s point of view.

The CPU calls to synchronize_smp() in sfc/cpu/mmio/mmio.cpp (for port access) and sfc/cpu/timing/timing.cpp for the other cases.

Meanwhile, the SMP subsystem returns the favor by calling synchronize_cpu from SMP::add_clocks in sfc/smp/timing.cpp—either every sample produced in debug builds, or after 24 samples—and from sfc/smp/memory.cpp on port IO.

Wait, your post’s title was libco, wtf dude?

The synchronization points listed above bring us back to libco. If the CPU calls into the SMP to synchronize it, and it calls into the CPU afterward, then the call stack is going to get infinitely deep. In this contrived case where I ignore a whole lot of stuff (both the other hardware, and the rest of the call stack involved in actually emulating the SMP/CPU between calls to enter), the stack would get filled up looking like:

inside smp.enter
inside cpu.enter
inside smp.enter
inside cpu.enter
inside smp.enter
inside cpu.enter
…

When it eventually exhausted the stack limit, the process would receive a segfault and most likely, unceremoniously die after emulating a finite number of cycles. That’s where libco comes in.

Because Threaded is always true, direct calls aren’t made; they’re sent through co_switch instead, which does some trickery to swap out the “thread” that is running the active processor, and replace it with the thread of the target.

Instead of pushing a frame onto the current stack and starting the target function anew, libco allows for suspending the current thread+stack entirely, (re)loading the other one, and resuming it—exactly as if it had called and returned, except this appears in C++ as a call and call.

This gives byuu a lot of freedom to just call synchronize_thing all over the place without worrying, because it’s not consuming a limited amount of call depth to do so. But it makes it a little harder to just yank out the code and call it from a QT program.