Continuing our discussion of higan from yesterday, let’s take a moment to look deeper into the abyss.
SMP::Threaded
This isn’t pthreads or anything; it’s libco
threads. libco appears to be
named for coroutines and focuses on user-space or “green” threading rather
than OS threads. In fact, libco ships with generic implementations for
Windows using fibers, and for GCC using a GNU Pth inspired setjmp/longjmp
technique. Including a little trickery with sigaltstack
to actually set up
the alternate allocation as a stack.
AFAICT, everything is set up with Threaded=true
and the emulator would
actually break if it were otherwise. Everything would keep calling
thing.enter()
until eventually the emulator ran out of stack space and
crashed.
Tracing cpu.add_clocks
From our previous post, we know that add_clocks
is called when the CPU needs
to perform bus operations, so our adventure today really begins with the
implementation of this function, inside sfc/cpu/timing/timing.cpp
.
The function starts off by unlocking IRQs, then computing a number of ticks
(which is just clocks/2
), and calling tick()
for each tick. tick itself
merely updates the horizontal and vertical counters, essentially keeping track
of where the raster scanout is happening in the PPU. (It was hard to find;
it’s in sfc/ppu/counter/counter-inline.hpp
which CPU
inherits, and it’s
loaded in via sfc/sfc.hpp
and sfc/cpu/cpu.cpp
.)
After ticking, the step function is called, which reduces the clock cycle of
each active subsystem (SMP, PPU, and any coprocessors present) before
synchronizing the controllers. Next, the joysticks are polled every 256
clocks, DRAM refreshed when needed (at a cost of 40 more cycles), and when
DEBUGGER
is defined, all the other chips are fully synchronized again.
clocks, then…
It appears that the clocks are tracking “cycles ahead/behind in emulation:”
where “0” means perfectly synchronized with the current CPU state, negative
means the given part is behind, and of course, positive means ahead. When the
CPU runs, it reduces the smp.clock
and dsp.clock
values, and when the SMP
gets a chance to run, it increases its smp.clock
value. When it crosses 0,
then the SMP is synchronized, and if the synchronization mode calls for it,
emulation will sync up the DSP (by the same mechanism: the SMP has been
decrementing clocks, so when sync happens, the DSP increments its clock
value), then return to the CPU.
Synchronization points
The SMP is actually invoked to re-sync itself with the main CPU whenever the chips need to communicate (i.e. if the CPU wants to read or write the SMP IO ports), on each new scanline (when tick brings the hcounter back to 0), and on a debug build, every time the CPU is updated. I presume that last one keeps everything in perfect sync from the debugger’s point of view.
The CPU calls to synchronize_smp()
in sfc/cpu/mmio/mmio.cpp
(for port
access) and sfc/cpu/timing/timing.cpp
for the other cases.
Meanwhile, the SMP subsystem returns the favor by calling synchronize_cpu
from SMP::add_clocks
in sfc/smp/timing.cpp
—either every sample produced in
debug builds, or after 24 samples—and from sfc/smp/memory.cpp
on port IO.
Wait, your post’s title was libco, wtf dude?
The synchronization points listed above bring us back to libco. If the CPU calls into the SMP to synchronize it, and it calls into the CPU afterward, then the call stack is going to get infinitely deep. In this contrived case where I ignore a whole lot of stuff (both the other hardware, and the rest of the call stack involved in actually emulating the SMP/CPU between calls to enter), the stack would get filled up looking like:
inside smp.enter
inside cpu.enter
inside smp.enter
inside cpu.enter
inside smp.enter
inside cpu.enter
…
When it eventually exhausted the stack limit, the process would receive a
segfault and most likely, unceremoniously die after emulating a finite number
of cycles. That’s where libco
comes in.
Because Threaded
is always true, direct calls aren’t made; they’re sent
through co_switch
instead, which does some trickery to swap out the “thread”
that is running the active processor, and replace it with the thread of the
target.
Instead of pushing a frame onto the current stack and starting the target
function anew, libco
allows for suspending the current thread+stack
entirely, (re)loading the other one, and resuming it—exactly as if it had
called and returned, except this appears in C++ as a call and call.
This gives byuu a lot of freedom to just call synchronize_thing
all over the
place without worrying, because it’s not consuming a limited amount of call
depth to do so. But it makes it a little harder to just yank out the code and
call it from a QT program.