Gast
2003-09-20, 14:22:37
;)
Aus meiner Sicht ein sehr guter "Artikel" von DaveH auf Beyond3D. Er erklärt (spekulativ) sehr schön warum Nvidia diese oder jene Designentscheidung getroffen hat :
http://www.beyond3d.com/forum/viewtopic.php?t=7974&sid=163ce3977ae789f6f9d242c3fe5bd47c
WaltC wrote:
The DX9 feature set layout began just a bit over three years ago, actually. It began immediately after the first DX8 release. I even recall statements by 3dfx prior to its going out of business about DX9 and M$.
Exactly. DX9 discussions surely began around three years ago, and the final decision to keep the pixel shaders to FP24 with an FP16 option was likely made at least two years ago, thus about 15 months ahead of the release of the API. I just don't think you have an understanding of the timeframes involved in designing, simulating, validating and manufacturing a GPU. Were it not for its process problems with TSMC, NV30 would have been released around a year ago. Serious work on it, then, would have begun around three years prior.
As I'm sure you know, ATI and Nvidia keep two major teams working in parallel on different GPU architectures (and assorted respins); that way they can manage to more-or-less stick to an 18 month release schedule when a part takes over three years from conception to shipping. This would indicate that serious design work on NV3x began around the time GeForce1 shipped, in Q3 1999. (Actually, high-level design of NV3x likely began as soon as high-level design of the GF2 was finished, probably earlier in 1999.) A more-or-less full team would have been assigned to the project from the time GF2 shipped, in Q1 2000. Which is around the point when it would have been too late for a major redesign of the fragment pipeline without potentially missing the entire product generation altogether.
NV40 will be the first Nvidia product to have any hope of being designed after the broad outlines of the DX9 spec were known. Of course at that time Nvidia may have thought that their strategy of circumventing those DX9 specs through the use of runtime-compiled Cg would be successful, in which case NV40 might not reflect the spec well either.
Quote:
Secondly, your presumption would also mean that merely by sheer chance ATi hit everything on the head by *accident* instead of design, for DX9. Extremely unlikely--just as unlikely as nVidia getting it all so wrong by accident.
Of course it's not by accident. When choosing the specs for the next version of DX, MS consults a great deal both with the IHVs and the software developers, and constructs a spec based around what the IHVs will have ready for the timeframe in question, what the developers most want, and what MS thinks will best advance the state of 3d Windows apps.
Both MS and the ARB agreed on a spec that is much closer to what ATI had planned for the R3x0 than what Nvidia had planned for NV3x. I don't think that's a coincidence. For one thing, the R3x0 pipeline offers a much more reasonable "lowest common denominator" compromise between the two architectures than something based more on NV3x would. For another, there are plenty of good reasons why mixing precisions in the fragment pipeline is not a great idea; sireric (IIRC) had an excellent post on the subject some months ago, and I wouldn't be surprised if the arguments he gave were exactly the ones that carried the day with MS and the ARB.
Third, FP24 is a better fit than FP32 for realtime performance with current process nodes and memory performance. IIRC, an FP32 multiplier will tend to require ~2.5x as many transistors as an FP24 multiplier designed using the same algorithm. Of course the other silicon costs for supporting FP32 over FP24 tend to be more in line with the 1.33x greater width: larger registers and caches, wider buses, etc. Still, the point is that while it was an impressive feat of engineering for ATI to manage a .15u core with enough calculation resources to reach a very nice balance with the available memory technology of the day (i.e. 8 vec4 ALUs to match a 256-bit bus to similarly clocked DDR), on a .13u transistor budget FP24 would seem the sweet spot for a good calculation/bandwidth ratio. Meanwhile the extra transistors required for FP32 ALUs are presumably the primary reason NV3x parts tend to feature half the pixel pipelines of their R3x0 competitors. (NV34 is a 2x2 in pixel shader situations; AFAICT it's not quite clear what exactly NV31 is doing.) And of course FP16 doesn't have the precision necessary for a great many calculations, texture addressing being a prime example.
So a good case can be made that the PS 2.0 and AFB_fragment_program specs made the clearly better decision. So what the hell was Nvidia thinking when they designed the CineFX pipeline?
IMO the answer can be found in the name. Carmack made a post on Slashdot a bit over a year ago touting how a certain unnamed GPU vendor planned to target its next consumer product at taking away the low-end of the non-realtime rendering market. Actually, going by what Carmack wrote, "CineFX" was something of a slight misnomer; he expected most of the early adopters would be in television, where the time and budget constraints are sufficiently tighter, and the expectations and output quality sufficiently lower, that a consumer-level board capable of rendering a TV resolution scene with fairly complex shaders at perhaps a frame every five seconds could steal a great deal of marketshare from workstations doing the same thing more slowly.
AFAIK this hasn't yet come to pass in any significant way, but Carmack's post along with much of the thrust of Nvidia's original marketing indicates they really did intend NV3x derivatives to fill this role. Plus it helps explain nearly all of the otherwise idiotic design decisions in the NV3x fragment pipeline. FP32--a bad choice for realtime performance in today's process nodes as discussed above--appears to have been viewed by Nvidia as necessary to play in the non-realtime space. The decision to support shader lengths into the thousands of instructions--bizzare and inexplicable if you think the target of the design is realtime interactive rendering (after all, the damn thing can't even hit 30fps running shaders a couple dozen instructions long)--makes a great deal of sense if the target isn't realtime after all. And then there's this:
Colourless wrote:
I honestly don't think that Nvidia, or anyone else, would have needed the DX9 specs, or any specs at all, to be able to realise that the register count performance issues would be a real problem. IMO, that is the only real flaw in the architechture.
While the register usage limitations are not the only flaw in the NV3x fragment pipeline architecture, they are clearly the most significant. (If NV3x chips, like R3x0, could use all the registers provided for by the PS 2.0 spec without suffering a performance penalty, their comparitive deficit in calculation resources would still likely leave them ~15-25% behind comparable ATI cards in PS 2.0 performance. But that is nothing like the 40-65% we're seeing now.) The question is why on earth did Nvidia allow these register limitations to exist in the first place. Clearly the answer is not "sheer incompetence". Then what were they thinking?
One possibility is that it's a bug--or rather, the result of a workaround. Some other functionality in the fragment pipeline wasn't working properly, and so registers that would otherwise be free to store temps are instead used as part of the workaround. This seemed pretty likely to me at first, but the fact that NV35 has the same limitations as NV30 and the rest does seem to indicate that if this is indeed the result of a bug, it is not one that can be fixed with only a trivial reworking of the architecture. It will be interesting to see if the extra time they've had to work on NV38 has allowed Nvidia to come up with a fix; if not, perhaps the problem is too deeply rooted to really describe it as a bug after all.
And even if a bugfix exacerbated the problem, it seems unlikely it's the main cause of it. I had a discussion a few months ago with someone here (Arjan or Luminescent IIRC) who pointed out that, unlike for a CPU pipeline which only needs to store one state that is shared amongst all instructions in flight (belonging, as they do, to a single thread), a fragment pipeline needs to store a seperate set of state data for each pixel in flight. The CPU equivalent is fine-grained mutlithreading, in which state for N threads is stored in the processor at once, and each thread takes its turn executing for one cycle, with a full rotation after N cycles.
The benefits of this sort of arrangement are control simplicity and latency hiding. The effective penalty for a long latency operation--in the context of GPUs, a texture read, and in particular one that misses cache--is effectively cut by a factor of N. Meanwhile, the performance costs and complexity of managing context switches as on a traditional CPU pipeline are avoided.
The drawback is in the transistor cost dedicated to storing all that state. GPU registers need to be very highly ported, considering a typical operation is an arbitrary vec4 MAC, and thus the transistor cost rises very steeply with the depth of the pipeline. Pretty soon you get into a direct tradeoff between the degree of latency hiding and the number of registers you can have.
Where R3x0 and NV3x come into this is in performance for dependent texture reads. I'd be interested in having more specifics, but as I understand it the R3x0 pipeline is deep enough to do an alright job minimizing the latency from one level of dependent texture reads (particularly if you fetch the texture as much in advance of when you need it as possible), but woe be unto you if you're looking for two or three levels. Meanwhile, NV3x is reputed to do just fine mixing several levels of texture reads in to the normal instruction flow--meaning that it must do a gangbusters job at hiding latency. Meaning it has a hell of a deep pipeline.
Meaning every exposed full-speed register is replicated many times, meaning that in order to fit in a given transistor budget, the number of full-speed registers might have to be cut pretty low. As low as only 256 bits per pixel in flight? I dunno about that--leaving just 4 FP16 or 2 FP32 registers is so awful that is seems there must be some erratum involved. But significantly lower than on R3x0, yes, definitely.
So what are the advantages and disadvantages of privileging multiple levels of dependent texture reads? It depends on the complexity of the shaders your want your architecture to target. If you're trying to achieve realtime framerates on the sort of hardware available in the current timeframe, it's doubtful you'll be running shaders complex enough to need more than one level of texture indirection. But if you want to accelerate the sorts of tasks that have previously been handled at the low-end of offline rendering, you might indeed want to benefit flexible texture reads by hiding as much latency as possible, even at a potential cost to general shader performance.
So the theory is this: Nvidia tried to address two different markets with a single product, and came up with something that does neither particularly well. Meanwhile, MS and the ARB, being focused primarily on realtime rendering, chose specs better targeted to how that goal can be best achieved in today's timeframe.
Nvidia can probably be properly accused of hubris in thinking that they could tailor their product to address a new and rather different market segment (low-end production rendering) while still maintaining product superiority in the consumer market. Or of arrogance in assuming they were the only IHV worth paying attention to, and thus could influence future specs to reflect their new architecture instead of one that better targeted realtime performance.
Obviously one can correctly accuse their marketing of all sorts of nasty things.
But I don't think one can really accuse Nvidia of incompetence, or stupidity, or laziness, or whatnot. NV3x is not really a bad design. It's unquestionably a decent design when it comes to performance on DX7 and DX8 workloads. I can't entirely judge, but I would guess it's about as good as could be expected in this timeframe as an attempt to replace offline rendering in the low-end of video production; I just don't think that's quite good enough yet to actually capture any real part of the market.
The only thing it's truly bad at is rendering simple DX9-style workloads (and yes, HL2 is very much on the simple end of the possibilities DX9 represents) at realtime interactive framerates. And--except with the benefit of hindsight--it doesn't seem obvious to me that Nvidia should have expected any serious use of DX9 workloads in the games of the NV3x timeframe. This prediction turns out to have been very, very wrong. (What I mean by "the NV3x timeframe" does not end when NV40 ships, but rather around a year after NV3x derivatives are dropped from the mainstream of Nvidia's product lineup. After all, the average consumer buying a discrete video card expects it to hold up decently for at least a while after his purchase.)
It turns out that DX9 gaming is arriving as a major force quite a bit ahead of DX9 special effects production. And Nvidia will rightly pay for betting the opposite. But, viewed in the context of such a bet, their design decisions don't seem that nonsensical after all.
Aus meiner Sicht ein sehr guter "Artikel" von DaveH auf Beyond3D. Er erklärt (spekulativ) sehr schön warum Nvidia diese oder jene Designentscheidung getroffen hat :
http://www.beyond3d.com/forum/viewtopic.php?t=7974&sid=163ce3977ae789f6f9d242c3fe5bd47c
WaltC wrote:
The DX9 feature set layout began just a bit over three years ago, actually. It began immediately after the first DX8 release. I even recall statements by 3dfx prior to its going out of business about DX9 and M$.
Exactly. DX9 discussions surely began around three years ago, and the final decision to keep the pixel shaders to FP24 with an FP16 option was likely made at least two years ago, thus about 15 months ahead of the release of the API. I just don't think you have an understanding of the timeframes involved in designing, simulating, validating and manufacturing a GPU. Were it not for its process problems with TSMC, NV30 would have been released around a year ago. Serious work on it, then, would have begun around three years prior.
As I'm sure you know, ATI and Nvidia keep two major teams working in parallel on different GPU architectures (and assorted respins); that way they can manage to more-or-less stick to an 18 month release schedule when a part takes over three years from conception to shipping. This would indicate that serious design work on NV3x began around the time GeForce1 shipped, in Q3 1999. (Actually, high-level design of NV3x likely began as soon as high-level design of the GF2 was finished, probably earlier in 1999.) A more-or-less full team would have been assigned to the project from the time GF2 shipped, in Q1 2000. Which is around the point when it would have been too late for a major redesign of the fragment pipeline without potentially missing the entire product generation altogether.
NV40 will be the first Nvidia product to have any hope of being designed after the broad outlines of the DX9 spec were known. Of course at that time Nvidia may have thought that their strategy of circumventing those DX9 specs through the use of runtime-compiled Cg would be successful, in which case NV40 might not reflect the spec well either.
Quote:
Secondly, your presumption would also mean that merely by sheer chance ATi hit everything on the head by *accident* instead of design, for DX9. Extremely unlikely--just as unlikely as nVidia getting it all so wrong by accident.
Of course it's not by accident. When choosing the specs for the next version of DX, MS consults a great deal both with the IHVs and the software developers, and constructs a spec based around what the IHVs will have ready for the timeframe in question, what the developers most want, and what MS thinks will best advance the state of 3d Windows apps.
Both MS and the ARB agreed on a spec that is much closer to what ATI had planned for the R3x0 than what Nvidia had planned for NV3x. I don't think that's a coincidence. For one thing, the R3x0 pipeline offers a much more reasonable "lowest common denominator" compromise between the two architectures than something based more on NV3x would. For another, there are plenty of good reasons why mixing precisions in the fragment pipeline is not a great idea; sireric (IIRC) had an excellent post on the subject some months ago, and I wouldn't be surprised if the arguments he gave were exactly the ones that carried the day with MS and the ARB.
Third, FP24 is a better fit than FP32 for realtime performance with current process nodes and memory performance. IIRC, an FP32 multiplier will tend to require ~2.5x as many transistors as an FP24 multiplier designed using the same algorithm. Of course the other silicon costs for supporting FP32 over FP24 tend to be more in line with the 1.33x greater width: larger registers and caches, wider buses, etc. Still, the point is that while it was an impressive feat of engineering for ATI to manage a .15u core with enough calculation resources to reach a very nice balance with the available memory technology of the day (i.e. 8 vec4 ALUs to match a 256-bit bus to similarly clocked DDR), on a .13u transistor budget FP24 would seem the sweet spot for a good calculation/bandwidth ratio. Meanwhile the extra transistors required for FP32 ALUs are presumably the primary reason NV3x parts tend to feature half the pixel pipelines of their R3x0 competitors. (NV34 is a 2x2 in pixel shader situations; AFAICT it's not quite clear what exactly NV31 is doing.) And of course FP16 doesn't have the precision necessary for a great many calculations, texture addressing being a prime example.
So a good case can be made that the PS 2.0 and AFB_fragment_program specs made the clearly better decision. So what the hell was Nvidia thinking when they designed the CineFX pipeline?
IMO the answer can be found in the name. Carmack made a post on Slashdot a bit over a year ago touting how a certain unnamed GPU vendor planned to target its next consumer product at taking away the low-end of the non-realtime rendering market. Actually, going by what Carmack wrote, "CineFX" was something of a slight misnomer; he expected most of the early adopters would be in television, where the time and budget constraints are sufficiently tighter, and the expectations and output quality sufficiently lower, that a consumer-level board capable of rendering a TV resolution scene with fairly complex shaders at perhaps a frame every five seconds could steal a great deal of marketshare from workstations doing the same thing more slowly.
AFAIK this hasn't yet come to pass in any significant way, but Carmack's post along with much of the thrust of Nvidia's original marketing indicates they really did intend NV3x derivatives to fill this role. Plus it helps explain nearly all of the otherwise idiotic design decisions in the NV3x fragment pipeline. FP32--a bad choice for realtime performance in today's process nodes as discussed above--appears to have been viewed by Nvidia as necessary to play in the non-realtime space. The decision to support shader lengths into the thousands of instructions--bizzare and inexplicable if you think the target of the design is realtime interactive rendering (after all, the damn thing can't even hit 30fps running shaders a couple dozen instructions long)--makes a great deal of sense if the target isn't realtime after all. And then there's this:
Colourless wrote:
I honestly don't think that Nvidia, or anyone else, would have needed the DX9 specs, or any specs at all, to be able to realise that the register count performance issues would be a real problem. IMO, that is the only real flaw in the architechture.
While the register usage limitations are not the only flaw in the NV3x fragment pipeline architecture, they are clearly the most significant. (If NV3x chips, like R3x0, could use all the registers provided for by the PS 2.0 spec without suffering a performance penalty, their comparitive deficit in calculation resources would still likely leave them ~15-25% behind comparable ATI cards in PS 2.0 performance. But that is nothing like the 40-65% we're seeing now.) The question is why on earth did Nvidia allow these register limitations to exist in the first place. Clearly the answer is not "sheer incompetence". Then what were they thinking?
One possibility is that it's a bug--or rather, the result of a workaround. Some other functionality in the fragment pipeline wasn't working properly, and so registers that would otherwise be free to store temps are instead used as part of the workaround. This seemed pretty likely to me at first, but the fact that NV35 has the same limitations as NV30 and the rest does seem to indicate that if this is indeed the result of a bug, it is not one that can be fixed with only a trivial reworking of the architecture. It will be interesting to see if the extra time they've had to work on NV38 has allowed Nvidia to come up with a fix; if not, perhaps the problem is too deeply rooted to really describe it as a bug after all.
And even if a bugfix exacerbated the problem, it seems unlikely it's the main cause of it. I had a discussion a few months ago with someone here (Arjan or Luminescent IIRC) who pointed out that, unlike for a CPU pipeline which only needs to store one state that is shared amongst all instructions in flight (belonging, as they do, to a single thread), a fragment pipeline needs to store a seperate set of state data for each pixel in flight. The CPU equivalent is fine-grained mutlithreading, in which state for N threads is stored in the processor at once, and each thread takes its turn executing for one cycle, with a full rotation after N cycles.
The benefits of this sort of arrangement are control simplicity and latency hiding. The effective penalty for a long latency operation--in the context of GPUs, a texture read, and in particular one that misses cache--is effectively cut by a factor of N. Meanwhile, the performance costs and complexity of managing context switches as on a traditional CPU pipeline are avoided.
The drawback is in the transistor cost dedicated to storing all that state. GPU registers need to be very highly ported, considering a typical operation is an arbitrary vec4 MAC, and thus the transistor cost rises very steeply with the depth of the pipeline. Pretty soon you get into a direct tradeoff between the degree of latency hiding and the number of registers you can have.
Where R3x0 and NV3x come into this is in performance for dependent texture reads. I'd be interested in having more specifics, but as I understand it the R3x0 pipeline is deep enough to do an alright job minimizing the latency from one level of dependent texture reads (particularly if you fetch the texture as much in advance of when you need it as possible), but woe be unto you if you're looking for two or three levels. Meanwhile, NV3x is reputed to do just fine mixing several levels of texture reads in to the normal instruction flow--meaning that it must do a gangbusters job at hiding latency. Meaning it has a hell of a deep pipeline.
Meaning every exposed full-speed register is replicated many times, meaning that in order to fit in a given transistor budget, the number of full-speed registers might have to be cut pretty low. As low as only 256 bits per pixel in flight? I dunno about that--leaving just 4 FP16 or 2 FP32 registers is so awful that is seems there must be some erratum involved. But significantly lower than on R3x0, yes, definitely.
So what are the advantages and disadvantages of privileging multiple levels of dependent texture reads? It depends on the complexity of the shaders your want your architecture to target. If you're trying to achieve realtime framerates on the sort of hardware available in the current timeframe, it's doubtful you'll be running shaders complex enough to need more than one level of texture indirection. But if you want to accelerate the sorts of tasks that have previously been handled at the low-end of offline rendering, you might indeed want to benefit flexible texture reads by hiding as much latency as possible, even at a potential cost to general shader performance.
So the theory is this: Nvidia tried to address two different markets with a single product, and came up with something that does neither particularly well. Meanwhile, MS and the ARB, being focused primarily on realtime rendering, chose specs better targeted to how that goal can be best achieved in today's timeframe.
Nvidia can probably be properly accused of hubris in thinking that they could tailor their product to address a new and rather different market segment (low-end production rendering) while still maintaining product superiority in the consumer market. Or of arrogance in assuming they were the only IHV worth paying attention to, and thus could influence future specs to reflect their new architecture instead of one that better targeted realtime performance.
Obviously one can correctly accuse their marketing of all sorts of nasty things.
But I don't think one can really accuse Nvidia of incompetence, or stupidity, or laziness, or whatnot. NV3x is not really a bad design. It's unquestionably a decent design when it comes to performance on DX7 and DX8 workloads. I can't entirely judge, but I would guess it's about as good as could be expected in this timeframe as an attempt to replace offline rendering in the low-end of video production; I just don't think that's quite good enough yet to actually capture any real part of the market.
The only thing it's truly bad at is rendering simple DX9-style workloads (and yes, HL2 is very much on the simple end of the possibilities DX9 represents) at realtime interactive framerates. And--except with the benefit of hindsight--it doesn't seem obvious to me that Nvidia should have expected any serious use of DX9 workloads in the games of the NV3x timeframe. This prediction turns out to have been very, very wrong. (What I mean by "the NV3x timeframe" does not end when NV40 ships, but rather around a year after NV3x derivatives are dropped from the mainstream of Nvidia's product lineup. After all, the average consumer buying a discrete video card expects it to hold up decently for at least a while after his purchase.)
It turns out that DX9 gaming is arriving as a major force quite a bit ahead of DX9 special effects production. And Nvidia will rightly pay for betting the opposite. But, viewed in the context of such a bet, their design decisions don't seem that nonsensical after all.