softglow's notebook

Dispatches from the Depths of a Super Nintendo

Libco: Sync via Calling Without Endless Recursion

Continuing our discussion of higan from yesterday, let’s take a moment to look deeper into the abyss.

SMP::Threaded

This isn’t pthreads or anything; it’s libco threads. libco appears to be named for coroutines and focuses on user-space or “green” threading rather than OS threads. In fact, libco ships with generic implementations for Windows using fibers, and for GCC using a GNU Pth inspired setjmp/longjmp technique. Including a little trickery with sigaltstack to actually set up the alternate allocation as a stack.

AFAICT, everything is set up with Threaded=true and the emulator would actually break if it were otherwise. Everything would keep calling thing.enter() until eventually the emulator ran out of stack space and crashed.

Tracing cpu.add_clocks

From our previous post, we know that add_clocks is called when the CPU needs to perform bus operations, so our adventure today really begins with the implementation of this function, inside sfc/cpu/timing/timing.cpp.

The function starts off by unlocking IRQs, then computing a number of ticks (which is just clocks/2), and calling tick() for each tick. tick itself merely updates the horizontal and vertical counters, essentially keeping track of where the raster scanout is happening in the PPU. (It was hard to find; it’s in sfc/ppu/counter/counter-inline.hpp which CPU inherits, and it’s loaded in via sfc/sfc.hpp and sfc/cpu/cpu.cpp.)

After ticking, the step function is called, which reduces the clock cycle of each active subsystem (SMP, PPU, and any coprocessors present) before synchronizing the controllers. Next, the joysticks are polled every 256 clocks, DRAM refreshed when needed (at a cost of 40 more cycles), and when DEBUGGER is defined, all the other chips are fully synchronized again.

clocks, then…

It appears that the clocks are tracking “cycles ahead/behind in emulation:” where “0” means perfectly synchronized with the current CPU state, negative means the given part is behind, and of course, positive means ahead. When the CPU runs, it reduces the smp.clock and dsp.clock values, and when the SMP gets a chance to run, it increases its smp.clock value. When it crosses 0, then the SMP is synchronized, and if the synchronization mode calls for it, emulation will sync up the DSP (by the same mechanism: the SMP has been decrementing clocks, so when sync happens, the DSP increments its clock value), then return to the CPU.

Synchronization points

The SMP is actually invoked to re-sync itself with the main CPU whenever the chips need to communicate (i.e. if the CPU wants to read or write the SMP IO ports), on each new scanline (when tick brings the hcounter back to 0), and on a debug build, every time the CPU is updated. I presume that last one keeps everything in perfect sync from the debugger’s point of view.

The CPU calls to synchronize_smp() in sfc/cpu/mmio/mmio.cpp (for port access) and sfc/cpu/timing/timing.cpp for the other cases.

Meanwhile, the SMP subsystem returns the favor by calling synchronize_cpu from SMP::add_clocks in sfc/smp/timing.cpp—either every sample produced in debug builds, or after 24 samples—and from sfc/smp/memory.cpp on port IO.

Wait, your post’s title was libco, wtf dude?

The synchronization points listed above bring us back to libco. If the CPU calls into the SMP to synchronize it, and it calls into the CPU afterward, then the call stack is going to get infinitely deep. In this contrived case where I ignore a whole lot of stuff (both the other hardware, and the rest of the call stack involved in actually emulating the SMP/CPU between calls to enter), the stack would get filled up looking like:

inside smp.enter
inside cpu.enter
inside smp.enter
inside cpu.enter
inside smp.enter
inside cpu.enter
…

When it eventually exhausted the stack limit, the process would receive a segfault and most likely, unceremoniously die after emulating a finite number of cycles. That’s where libco comes in.

Because Threaded is always true, direct calls aren’t made; they’re sent through co_switch instead, which does some trickery to swap out the “thread” that is running the active processor, and replace it with the thread of the target.

Instead of pushing a frame onto the current stack and starting the target function anew, libco allows for suspending the current thread+stack entirely, (re)loading the other one, and resuming it—exactly as if it had called and returned, except this appears in C++ as a call and call.

This gives byuu a lot of freedom to just call synchronize_thing all over the place without worrying, because it’s not consuming a limited amount of call depth to do so. But it makes it a little harder to just yank out the code and call it from a QT program.

Synchronization in Higan (SMP/DSP)

Background Matter

In case you missed my master plan (e.g.: you’re reading my Github and were never on metconst), I’m trying to build a music editor for, at the very least, SPC files from Super Metroid.

It seems like the fastest way for such an editor to be able to play the music being edited, is to have a real SPC player built-in. But not one that dumps straight to the platform audio output device—one that buffers into RAM to make play/pause and seeking possible.

To that end, I’m presently trying to rip the SMP/DSP subsystem out of higan (née bsnes) and pull it into a library that runs in a subthread and lets the main thread request blocks of audio from it.

Oh yes, I’m going to be building in QT. I may, or may not, eventually borrow some source code from qtractor (certainly not the audio output chain, though; AFAIK, if you don’t really want audio to play, then JACK is an excellent backend for you.) QT appears to be the one cross-platform library that’s monolithic enough to be an easy (if fat) install on Windows, that also provides everything I absolutely need: threads, audio, GUI widgets, and a C++ FFI for binding to higan’s SMP core.

Higan v093

So with that out of the way, let’s take a closer look at higan. I’m specifically using v093 here, which is, as of late September 2013, the latest version published. I hear it even builds on OS X now.

Why higan? Accuracy. There doesn’t seem to be any point in going for anything less, when I won’t have competition for host time from the CPU and PPU. I’ve also heard (but been unable to find any trace of the actual code involved) that the SMW community came up with some music editor(s) once upon a time, but they were based on the ZSNES emulation, and failed hard on the real hardware because ZSNES cheated on echo buffering. Echo didn’t get written back to PSRAM and destroy the code/instruments under emulation.

That’s a mistake that’s worth not repeating.

Synchronization

higan does threading (when threaded) by using its built-in public domain libco (coroutine, I presume) library. But if you just look at sfc/smp/smp.cpp you will find some synchronization functions, but no callers… right there. And the obvious ones like step() don’t seem to do much.

There are some more functions in sfc/smp/timing/timing.cpp, but that turns out to be a very small file, because it’s just providing some more sync functions to be called. The real synchronization is driven from the memory files, sfc/smp/memory/memory.cpp. Quoting my notes:

step(clocks): adjusts clocks only, no sync
cycle_edge: tick SMP timers
add_clocks: step(); sync DSP; may sync CPU
...
add_clocks called in op_{io|read|write}

It appears that higan’s philosophy is to synchronize the bus state of the system. That is, when a CPU (the S-SMP in the case of audio) makes some sort of move like loading the next instruction from memory, it triggers a synchronization with the rest of the system. However many cycles the DSP should have taken while that one SMP instruction was running becomes accounted for, and the DSP state is “caught up” so that its current bus state will be visible to the emulation of the SMP when the thread returns.

That’s pretty much how the main CPU and the SMP are sync’ed, as well: whenever either side writes to an IO port connecting them, at end-of-scanline, and any time they get more than 24 samples of audio output out-of-sync. Or, at every bus transaction, if the debugger is built (or maybe enabled? I forget which.)

I didn’t look, but I’d guess that’s how the CPU and PPU are wired, as well.

struct Everything

Apparently in the C++ world, a struct is a class with default-public instead of default-private. byuu uses a lot of struct and very few, if any, class keywords.

Why you would use C-compatible keywords to build non-C features? I’m baffled. It’s just how Bjarne rolls, I guess.

The Mystery of Privilege

There’s apparently a “privileged” access modifier. How this may differ from private or protected, I’m not really certain, because so far I’ve been unable to find a description of it online. It’s just too generic a word: every discussion of public vs. private tends to use it to talk about access control.

Unfortunately, there’s a good chance I may need to understand it before finishing this library.