I'm guessing whoever designed the system wasn't sure whether they would ever need to be different, and designed it so that they could be. It turned out that they didn't need to be, but it was either more work than it was worth to change it (considering that simply passing the same parameter twice is trivial), or they wanted to leave the flexibility in the system in case it's needed in future.
I've definitely had APIs like this in a few places in my code before.
E.g in their example in the link above for deferred rendering (figure 4) the multiple G buffers won't actually need to leave the on-chip tile buffer - unless there's a partial render before the final shading shader is run.
For partial rendering all samples must be written out, but for the final one you can resolve(average) them before writeout.
In OpenGL, the driver would have to scan the following commands to see if it can discard the depth data. If it doesn't see the depth buffer get cleared, it has to be conservative and save the data. I assume mobile GPU drivers in general do make the effort to do this optimization, as the bandwidth savings are significant.
In Vulkan, the application explicitly specifies which attachment (i.e. stencil, depth, color buffer) must be persisted at the end of a render pass, and which need not. So that maps nicely to the "final render flush program".
The quote is about Metal, though, which I'm not familiar with, but a sibling comment points out it's similar to Vulkan in this aspect.
So that leaves me wondering: did Rosenzweig happen to only try Metal apps that always use MTLStoreAction.store in passes that overflow the TVB, or is the Metal driver skipping a useful optimization, or neither? E.g. because the hardware has another control for this?
PowerVR has its roots in a desktop video card with somewhat limited release and impact. It really took off when it was used in the Sega Dreamcast home console and the Sega Naomi arcade board. It was only later that people put them in phones.
In practice, you might think of TMEM like it’s a cache, it’s just that you have to manage this cache manually. You can use as much RAM as you like for textures.
TMEM is also not part of main RAM, like the RSP’s DMEM and IMEM.
NV seems to rasterize vertexes in small batches (i.e. immediately) but buffers the rasterizer output on die in tiles. There can still be significant overlap between vertex generation and rasterization. Those tiles are flushed to the framebuffer, potentially before they are fully rendered, and potentially multiple times per draw call depending on the vertex ordering. They do some primitive reordering to try to avoid flushing as much, but it's not a full deferred architecture.
It is amazing to me how complicated these systems have become. I am looking over the source for the single triangle demo. Most of this is just about getting information from point A to point B in memory. Over 500 lines worth of GPU protocol overhead... Granted, this is a one-time cost once you get it working, but it's still a lot to think about and manage over time.
I've written software rasterizers that fit neatly within 200 lines and provide very flexible pixel shading techniques. Certainly not capable of running a cyberpunk 2077 scene, but interactive framerates otherwise. In the good case, I can go from a dead stop to final frame buffer in <5 milliseconds. Can you even get the GPU to wake up in that amount of time?
Considering the fact that there are 240 Hz monitors nowadays, which means that an entire frame must be rendered in about 4ms, it has to be possible.
That makes me wonder whether the other GPUs with position-only shading - Intel and Adreno - do the same.
As for PowerVR, I've never seen them described as position-only shaders - I think they've always done full vertex processing upfront.
edit: slides are at
I suppose the difference is whether the render target lives in the "SM" and is explicitly loaded and flushed (by a shader, no less!) or whether it lives in a separate hardware block that acts as a cache.
The big difference is the end of the pipe, as mentioned; whether you have ROPs or whether your shader cores load/store from a framebuffer segment. Basically, whether or not framebuffer clears are expensive (assuming no fast-clear cheats), or free.
there probably are tools these days for debugging shaders, potentially commercial packages if Nsight Studio doesn't have it, but yeah, that sort of thing isn't easy.
Patent lawyers love this one silly trick.
Then they have since scrubbed the internet of all such claims and to this day pay for an architecture license. I think it's similar to an ARM architecture license - where it's a license for any derived technology and patents rather than actually being given the RTL for powervr-designed cores.
I worked at PowerVR during that time (I have Opinions, but will try to keep them to myself), and my understanding was that Apple hadn't actually taken new PowerVR RTL for a number of years and had significant internal redesigns of large units (e.g. the shader ISA was rather different from the PowerVR designs of the time), but presumably they still use enough of the derived tech and ideas that paying the architecture license is necessary. This transfer was only one way - we never saw anything internal about Apple's designs, so reverse engineering efforts like this are still interesting.
And as someone who worked on the PowerVR cores (not the Apple derivatives) I can assure you all this discussed in the original post is extremely familiar.
Let's just say that the legal shenanigans of the time caused me to lose my job (part of the sale of Imagination Technologies required closing some countries offices to avoid more interference from various regulatory bodies). Judge bias accordingly.
And all their noise about "Ground up redesign using no PowerVR tech" kinda conflicts with them still to this day paying for an architecture license - the very thing that they claimed they would be dropping in their press release that caused the imagination technologies share crash and corresponding sale. And this is without even going to court - they issued a press release then immediately relented (and have continued to relent for over 5 years now) at the slightest question. And then scrubbed all mention of that press release.
My general suspicion is apple intended to game the market by intentionally dropping the share price and simply purchase PowerVR at a discount - but in the process pissed off enough people that they rejected the offer, even if it was "better" in terms of value. Or just let them go under and pick everything they want off the resulting fire sale - I heard rumors that apple had already put in an offer to purchase the company that was rejected, and under UK regulation a failed takeover attempt can't be re-attempted for some time, that much of this happened within (again, according to fuzzy scuttlebutt, nothing definite)
That or the legal/C-suite of apple don't actually speak to the engineers of apple anymore - they honestly thought that it was a completely ground-up design that didn't derive anything from PowerVR tech, and just send out the press release thinking "Why are we paying for this??" - then the engineers shuffled in saying that actually they couldn't put together anything better that wasn't a direct derivative, and their noise about a completely internally designed-from-scratch apple GPU was a bit of a stretch.
Also, if you read Hector Martin's tweets (he's doing the reverse-engineering), Apple replacing the actual logic while maintaining the "API" of sorts is not unheard of. It's what they do with ARM themselves - using their own ARM designs instead of the stock Cortex ones while maintaining ARM compatibility.*
*Thus, Apple has a right to the name "Apple Silicon" because the chip is designed by Apple, and just happens to be ARM-compatible. Other chips from almost everyone else use stock ARM designs from ARM themselves. Otherwise, we might as well call AMD an "Intel design" because its x86 by the same logic.
They did this with ADB, early PowerPC systems contained a controller chip that has the same API that was implemented in software in the 6502 IOP coprocessor in the IIfx/Q900/Q950.
> arm64 is the Apple ISA, it was designed to enable Apple’s microarchitecture plans. There’s a reason Apple’s first 64 bit core (Cyclone) was years ahead of everyone else, and it isn’t just caches
> Arm64 didn’t appear out of nowhere, Apple contracted ARM to design a new ISA for its purposes. When Apple began selling iPhones containing arm64 chips, ARM hadn’t even finished their own core design to license to others.
> ARM designed a standard that serves its clients and gets feedback from them on ISA evolution. In 2010 few cared about a 64-bit ARM core. Samsung & Qualcomm, the biggest mobile vendors, were certainly caught unaware by it when Apple shipped in 2013.
> > Samsung was the fab, but at that point they were already completely out of the design part. They likely found out that it was a 64 bit core from the diagnostics output. SEC and QCOM were aware of arm64 by then, but they hadn’t anticipated it entering the mobile market that soon.
> Apple planned to go super-wide with low clocks, highly OoO, highly speculative. They needed an ISA to enable that, which ARM provided.
> M1 performance is not so because of the ARM ISA, the ARM ISA is so because of Apple core performance plans a decade ago.
> > ARMv8 is not arm64 (AArch64). The advantages over arm (AArch32) are huge. Arm is a nightmare of dependencies, almost every instruction can affect flow control, and must be executed and then dumped if its precondition is not met. Arm64 is made for reordering.
This is such an interesting counterpoint to the occasional “Just ship it” screed (just one yesterday I think?) we see on HN.
I have to say, I find this long form delivery of tech to be enlightening. That kind of foresight has to mean some level of technical saaviness at high decision making levels. Whereas many of us are caught at companies with short sighted/tech naive leadership who clamor to just ship it so we can start making money and recoup the money we’re losing on these expensive tech type developers.
Therefore, there is no basis for saying that AArch64 is a cleaned-up MIPS-like ISA. Only RISC-V is a MIPS-like ISA.
One of the few features of AArch64 that can be said to be similar to MIPS was its main mistake.
In the initial ARMv8.0 version, the only means provided for implementing atomic operations was a load-and-reserve/store-conditional instruction pair.
This kind of instruction has been popularized by MIPS II, but it had not been invented by MIPS, but by Jensen et al. (November 1987), for the S-1 AAP multiprocessor.
While this instruction pair allows the implementation of lock-free/wait-free data structures, it can be extremely inefficient for implementing locks in systems with many cores (because progress is not guaranteed), so in the ARMv8.1 version the initial mistake has been corrected, by adding atomic instructions of the type fetch-and-op, besides the MIPS-like LL/SC pair.
It is a good feature, which can reduce substantially the number of instructions that must be implemented, because many single-operand operations are just special cases of double-operand operations with one null operand.
This is why it was used in many early computers, which had to be simple due to the limitations of their technology, and then it was used again in most RISC CPUs, which have been simplified intentionally (and not only in MIPS; among the more successful RISC ISAs also IBM POWER has it; only 32-bit ARM does not have it, due to its unusually low number of general-purpose registers, in comparison with the other RISC ISAs).
My StrongArm powered RiscPC was amazing for the time. It was strange that the contemporaneous Newton was powered by the same (and in some ways better) processor.
The connection between ARM processors being used in desktop and mobile devices is in its early DNA.
IMO the Kyro/820 wasn't a major failure, it turned out a lot better than the 810 which had A53/A57 cores.
And then they decided they needed a mobile CPU team again and bought Nuvia for ~US$1 Billion.
As for Apple, they've designed their own cores since the Apple A6 which used the Swift core. If you go to the Wikipedia page, you can actually see the names of their core designs, which they improve every year. For the M1 and A14, they use Firestorm High-Performance Cores and Icestorm Efficiency Cores. The A15 uses Avalanche and Blizzard. If you visit AnandTech, they have deep-dives on the technical details of many of Apple's core designs and how they differ from other core designs including stock ARM.
The Apple A5 and earlier were stock ARM cores, the last one they used being Cortex A9.
For this reason, Apple is about as much an ARM chip as AMD is an Intel chip. Technically compatible, implementation almost completely different. It's also why Apple calls it "Apple Silicon" and it is not just marketing, but actually justified just as much as AMD not calling their chips Intel derivatives.
Before that, they had Scorpion and Krait, which were both quite successful 32 bit ARM compatible cores at the time.
Kryo started as an attempt to quickly launch a custom 64 bit ARM core and the attempt failed badly enough that Qualcomm abandoned designing their own cores and turned to licensing semi-custom cores from ARM instead.
> NVIDIA and Samsung, up to this point, have gone the processor license route. They take ARM designed cores (e.g. Cortex A9, Cortex A15, Cortex A7) and integrate them into custom SoCs. In NVIDIA’s case the CPU cores are paired with NVIDIA’s own GPU, while Samsung licenses GPU designs from ARM and Imagination Technologies. Apple previously leveraged its ARM processor license as well. Until last year’s A6 SoC, all Apple SoCs leveraged CPU cores designed by and licensed from ARM.
> With the A6 SoC however, Apple joined the ranks of Qualcomm with leveraging an ARM architecture license. At the heart of the A6 were a pair of Apple designed CPU cores that implemented the ARMv7-A ISA. I came to know these cores by their leaked codename: Swift.
Yes, Apple has been designing and using non-reference cores since the A6 era, and were one of the first to the table with ARMv8 (apple engineers claim it was designed for them under contract to their specifications, but this part is difficult to verify with anything more than citations from individual engineers).
I expect that Apple has said as much in their presentations somewhere, but if you're that keen on finding such an incredibly specific attribution, then knock yourself out. It'll be in an apple conference somewhere, like WWDC. They probably have said "apple-designed silicon" or "custom core" at some point, and that would be your citation - but they also sell products, not hardware, and they don't extensively talk about their architectures since they're not really the product, so you probably won't find a deep-dive like Anandtech from Apple directly where they say "we have 8-wide decode, 16-deep pipeline... etc" sorts of things.
What amazing work and great writing that takes an absolute graphics layman (me) on a very technical journey yet it is still largely understandable.
> And with that, we get our bunny.
So what was the configuration that needed to change? Don't leave us hanging!!!
(Yes she tells you how to figure it out yourself)
I can't give too many details unfortunately. But, there's a specific step I took in my career, which was completely random at the time. I was still a student, and I decided not to work somewhere. I resigned two weeks in. Had I not done that, I wouldn't be where I am today. My situation would be totally different.
Yes, some people are very talented. But it does take quite a lot of work and dedication. And yes, sometimes you cannot afford to dedicate your time to learning something because life happens.
When I see talent wasted in things like scientology, then I get depressed.
Here’s her CV: https://rosenzweig.io/resume.pdf