Mr. Lolman
2007-04-24, 17:30:35
aktueller Übersetzungsstatus von Graham Penny (Freelance Translator aus dem 3dfxzone Forum):
[Page 1]
15 December 2000 signalled the end of an era: on that date over six years ago, 3dfx, the pioneers of filtered polygon acceleration on PCs, closed its doors. nVidia, their main rival, bought them out for 112 million dollars - patents, technologies, the lot. It was a black day for the 3dfx community, which until then had been considerable in size. One question remained, however: could they have signed off with a flourish?
This article is dedicated to answering precisely that question by looking at the fastest of the Voodoo 5 graphic cards, an artifact of days long past, namely the Voodoo5 6000 AGP, the masterpiece that never officially made it out of 3dfx's doors.
Background
Ah, but we remember it well: the graphics card made quite an impact when it was first presented to the public at Comdex in early 1999, astonishing everyone with its four graphics processors. What we didn't know then was that the card would never make it to retail.
The model that was displayed then was based on the "original design", which still used a 2x2 chip arrangement - and was nowhere near functioning. The development process revealed major issues with the card layout, and it was eventually discarded in favour of the four-in-a-row layout that has since become famous. Even that caused problems right up to the end, primarily with instability under heavy load, a bug that is caused from the PCI subsystem and continues to plague all existing Voodoo5 6000 prototypes. It is primarily this bug that prevented the Voodoo5 6000 from appearing on store shelves.
Subsequent efforts to remedy the problem, such as underclocking newer revisions from the anticipated 183 MHz to 166 MHz were also in vain. Then the axe dropped, caused by the massive development costs and repeated disappointing quarterly results. On 13 November 2000 3dfx reported that it had transferred all of its SLI patents to its Quantum3D subsidisary (which still exists today) and had pulled out of the graphics card business. All the Voodoo5 6000 prototypes that had been made up to that point also went to Quantum3D. Well, almost all of them...
It is estimated that some 200 such prototypes exist, but not all of these are working cards - word on the Internet is that that figure is more like 100. Worldwide. Who it was that smuggled some of these out from under Quantum3D's noses will likely forever remain a mystery, and the number of V56k cards that have ceased to function in the meantime is similarly unknown. Most prototypes have found their way into the hands of genuine fans, but even today cards appear for sale on online auction sites from time to time.
All of these 6-year-old graphics cards have one thing in common: they cost a pretty penny. A defect card will set you back at least 500 euros, while a working Voodoo5 6000 clocked at 183 MHz can easily cost two or three times as much. However, that figure becomes much more palatable when considered against the background of the card's "population density" and the cost of current high-end graphics cards, as unlike the latest products to come off ATI and nVidia's respective production lines, the Voodoo5 6000 has maintained its value astonishingly well over the years - much to the disappointment of those of us with not quite so much cash to splash.
This is where we come in. It took us almost 6 years, but today we are able to proudly present what we were all so cruelly denied back then - a review of the Voodoo5 6000. Sure, we know we're not the first, as the odd test has appeared on the Net in the intervening period (not to mention in German print magazine PC Games Hardware last year), but we can guarantee that we have found the answers to all the questions about 3dfx's last functioning product - those questions that we asked ourselves when we read the other tests (and more besides).
The test cards
Two identical Voodoo5 6000 cards were used for this article, both of which were examples of the best known and most stable revision built, the "Final Revision 3700-A". The numbers indicate that they were produced in the 37th calendar week of 2000. Prototypes like these can almost be described as fully functional graphics cards - and that caveat is only because they, too, display the PCI bug described above. The kicker is that this problem was also solved, but unfortunately only after 3dfx had invested copious resources on it and, ultimately, collapsed.
Hank Semenec, the "Godfather" of the Voodoo5 6000, who now plies his trade at Quantum3D, came up with the so-called "PCI Rework" in his free time, which removed the instabilities. The bug fix manifests in two ways, one internal and one external, each of which was in evidence on one of our test cards. With the fix, both are fully useable and the revolutionary AA modes, which play a significant role in the tests to follow, are completely stable. We must also thank Hank Semenec for the repair that allowed one of our 2 Voodoo5 6000 cards to even function. Our thanks once again!
[The Voodoo5 6000 AGP Rev. 3700-A is exactly 31 centimetres in length. At one end you can see the upside-down HiNT-Bridge, to the left of the GPUs, together with the power supply soldered directly to the PCB. The original "Voodoo Volts" power supply concept is not really necessary with current PSUs (and is very rare anyhowe), but at the time it was the only way to make some computers Voodoo5 6000 compatible (Image: © PC Games Hardware).]
[On the rear can be seen masses of SMDs (Surface Mounted Devices) laid out in what was at that time a very high density (Image: © PC Games Hardware).]
[The external PCI rework on our test card. Note the "Not for Sale" sticker just next to it, evidence that the card was clearly a prototype (Image: © PC Games Hardware).]
Before delving into framerates and such like, we're first going to take a loook at the underlying technology, the resultant picture quality and, of course, the test settings.
[Page 2]
The VSA-100 chip
The last generation of cards from Messrs. 3dfx were still based on the original Voodoo Graphics design dating back to 1996. It's because of this, "Rampage", the true successor to Voodoo, never reached the public, despite several attempts to remarket it. In short, this was caused - as with so many of the negative aspects of 3dfx - by insufficient R&D resources.
Unfortunately, modern features such as environment and per-pixel lighting and hardware-accelerated transformations didn't find their way onto the VSA-100. In reality that was an irrelevancy for this generation of cards, as there were next to no games that used such features. 3dfx went in a different direction with VSA-100, implementing some of the ideas from Rampage that could be used in every game and noticeable improved picture quality - in particular the famous "T-Buffer" (the "T" was from the surname of Gary Tarolli, 3dfx's co-founder and chief designer). It enabled:
* scaleable SG-FSAA (sparse grid full-scene antialiasing - more about this later)
* depth of field (depth blurring, to increase realism in games)
* motionblur (a type of temporal antialiasing where the actual framerates are split in order to merge several frames into one. The amount of temporal data for the end image therefore increases for each frame used. It should not be confused with Motion Trail, which simply blurs between consecutive frames.)
* soft shadows (smooth graduation of shadows)
* soft reflections (smooth graduation of reflections)
The transistor budget for the VSA-100 was clearly invested in speed and scalability rather than on a checklist of features. Incidentally, "VSA" is an abbreviation of Voodoo Scalable Architecture, and an eponymous one at that: by scaling the number of chips on the graphics card it was possible to accommodate every segment of the graphics card market at once, resulting in massive savings on R&D resources, because only one GPU had to be developed and completed.
This was made possible by the use of scanline interleaving (SLI - not to be confused with nVidia's SLI). The frame to be rendered was split into lines, and each GPU was responsible for rendering its own section. More SLI'd GPUs meant less work for each of the VSA-100 chips to perform. The RAMDAC would then assemble the fully rendered "line combs" from the GPUs. This has the advantage over the now standard AFR/SFR approach that there is no need for CPU-intensive loadbalancing or the precalculation of several images, as all of the GPUs are working on the same one.
Unlike with AFR (Alternate Frame Rendering), the video memory is also involved in scaling, at least in part. Although there is redundant retention of textures due to the GPUs not having access to the texture memory of each of the other GPUs (which would also have been extremely difficult to achieve), the frame buffer required for each GPU actually ends up being smaller, because the image being rendered by each of them is smaller.
The bandwidth scaling is also excellent, with each VSA-100 having its own memory interface. On a Voodoo5 5500 this would have been two independent 128-bit SDR SDRAM interfaces. This should be considered as far more effective than a single 128-bit wide DDR SDRAM interface, because it means that there is less wasted bandwidth and there are fewer stalls in the graphics pipeline caused by memory reads. It's for precisely this reason that ATI and nVidia subsequently also later split their memory interfaces into several smaller ones.
All of these advantages mean that the SLI process gives an unrivaled level of efficiency when scaling. It was possible - and worthwhile - to link together up to 32 such GPUs, but this remained the preserve of the professional military simulators produced by 3dfx subsidiary Quantum3D. For the consumer market, if we include the Voodoo5 6000 this would have been limited to 4 GPUs per graphics card.
The reason this approach died a death following 3dfx's demise is simple: the hardware-accelerated transformation suffers from one major disadvantage. All of the GPUs in an SLI setup have to calculate the same geometry, meaning that the performance in terms of geometry is not scaled accordingly, although the TnL-less VSA-100 chip was only mildly affected by this (presumably the Rampage was intended to have an external geometry unit). In addition, the texture bandwidth is not scaled linearly in an SLI setup, since there was no way of determining which texture parts were required and which were not, only whether textures needed to be available in their entirety. Again, however, the VSA-100 apparently had enough bandwidth per texel to ensure that this did not became a more significant disadvantage.
In terms of the standards in 2000, the VSA-100 incorporated 32-bit framebuffer support and increased the maximum texture size from the antiquated 256x256 pixels to 2048x2048, making hi-res textures a reality on Voodoo cards. To implement this feature sensibly and prevent it from overloading the texture bandwidth 3dfx - like its competitors - used texture compression.
While the Voodoo3 supported NCC (narrow channel compression), this proprietary system was never fully used by games. The VSA-100, by contrast, also made use of S3 Texture Compression (S3TC), which was widely adopted. Because using S3TC in OpenGL games meant paying a licence fee to S3 Graphics, however, 3dfx developed a similar system called FXT. Metabyte's well-known OpenGL Glide wrapper "WickedGL" allows users to convert every texture in OpenGL games to the FXT format.
The VSA-100 was the first Voodoo chip to have 2 independent pixel pipelines, meaning it could process 2 different pixels per cycle, which was often more efficient than the Voodoo3's multi-texturing pipeline. The design was therefore widened, but shortened at the same time. Rival companies were already using quad pipelines to texture polygons, which sounds progressive, but in this instance that not the case:
Broadly speaking, a quad pipeline is a single pipeline repeated four times, which can calculate one pixel quad, i.e. a 2x2 pixel array (shown in blue on the diagram) per cycle. This approach saves considerably on the number of transistors needed for operating logic, but it also means that all of the pixel pipelines are limited to performing the same operation, which particularly with small triangles is not always optimal (as shown by the white pixel within the blue line on the diagram). The VSA-100, like all pre-GeForce renderers, does not suffer from this.
Although the Voodoo5 6000 carries the number "5", it in actual fact belongs to the fourth generation of Voodoo products. One can only assume that 3dfx wanted to differentiate clearly between the single chip variants (the Voodoo4 4500) and the multi-chip cards so as to emphasise their performance to buyers.
Voodoo4 4500 Voodoo5 5500 Voodoo5 6000
Pixel pipelines 2 4 8
TMUs per pipeline 1 1 1
Production process 220nm 220nm 220nm
Chip speed 166 MHz 166 MHz 183 MHz
Memory speed 166 MHz 166 MHz 183 MHz
antialiasing 2x SG-SSAA 4x SG-SSAA 8x SG-SSAA
Fill rate MTex/sec 333 666 1464
Bandwidth GB/sec 2.6 5.3 11.7
Maximum texture size 2048x2048 2048x2048 2048x2048
Graphics memory 32 MB 2x32 MB 4x32 MB
3dfx promoted the VSA-100 range using the slogan "Fill rate is King", but the delay meant they were unable to pit the range against its intended competition, nVidia's GeForce 256, a confrontation that it could have walked away from with little more than a ruffled collar. Sometimes things just don't work out as planned. Still, our Voodoo5 6000 cards are ample proof of this slogan: combined with the efficiency gains from SLI, the fill rate of 1.46 gigapixels per second and the bandwidth of 11.7 GB/s are unequivocal. None of the Voodoo5 6000's high-end, single-chip contemporaries would have been able to keep up with it. We will go into this more in the benchmarks.
[Page 3]
Texture filtering
What can we expect from a graphics card in terms of picture quality when its feature set is already seen as outdated at launch? The picture quality is only comparable in a limited sense with that of other graphics cards of the same generation - though this is not necessarily meant negatively.
Trilinear filtering is not a standard fixture of the Voodoo5 using the official drivers, even though the graphics card would of course have been able to do it. Granted, the VSA-100 can only achieve this in one processor cycle with single texturing, but the chip was also able to soften the abrupt LOD transitions with textured surfaces. To that end the drivers have a feature called "mipmap dithering", which works by combining adjacent mipmap levels to generate a sub-image.
The obvious disadvantage to this mode is the dithering itself, which is evidenced by on-screen granularity that is generally quite apparent and cannot be entirely compensated for using high supersampling modes. On the plus side, this mode gives in increase in picture quality with next to no resultant loss in performance (depending on where it's used, e.g. in Quake III Arena).
By far the most interesting mode is a type of bilinear filter that's achieved by running two VSA-100s in supersampling mode, with one chip offsetting the mipmap LOD bias by -0.5. This only works because unlike normal supersampling, with Voodoo5 cards several images can be merged within the T-Buffer with the same balance. The end result is an image with a 1-bit LOD fraction, which by definition passes for a trilinear filter. That said, it should be mentioned that while textured surfaces with a 1-bit LOD fraction produce a noticeable improvement over pure bilinear filtering, it is still not sufficient for consistent linear mipmap interpolation.
This partial LOD-shift is also only possible with SLI antialiasing, so a Voodoo4 4500 is lacking the one thing it needs to achieve trilinear filtering, since it only has one VSA-100 chip and consequently only one LOD bias register. By the same token, however, it also means that a Voodoo5 6000 can cope with 4 images with different LOD biases instead of 2, so instead of one additional mipmap transition you have three, i.e. a 2-bit LOD fraction.
The drivers do prevent one resultant use of this feature: while it's possible with active antialiasing to achieve performance-friendly, optically minimalist trilinear filtering using the original drivers, it only works correctly with a LOD bias of 0.0 or -1, because unfortunately changes to the LOD bias are only applied to the first graphics chip, and the second chip always uses an LOD of -0.5. Thankfully there are modified Glide drivers that also shift the LOD bias on the second graphics chip correctly.
Some of you may be wondering at this point why the LOD bias needs to be shifted at all, since 0.0 is optimal. The reason is simple: whereas rival companies achieved supersampling by increasing the resolution internally (oversampling) and then downsampling afterwards, which automatically gives an increase in texture quality, 3dfx's method did not automatically produce sharper textures, just oversampled texels. Shifting the mipmap LOD bias compensates for this amply. The theoretical maximum shift possible for four SSAA samples is -1.
In practical terms this comes down to a matter of personal taste, as the ideal balance of maximum sharpness and maximum texture stability differs >from game to game. Tests have shown that LOD shifts of -0.5 for 2xAA and -1.5 for 4xAA are reliable. At 4xAA this gives a first mipmap blend at around -2 to -2.5, which is roughly comparable to 4xAF sharpness (albeit without achieving the same texture stability of the latter). This relatively large shift of -1.5 may not be in line with the Nyquist-Shannon sampling theorem, but the theorem applies to the theoretical worst case anyhow in the shape of highest frequency textures (such as a pixel-sized black/white chessboard pattern).
With four supersamples, therefore, if we play it safe, the actual maximum sharpness possible at which you don't have to run the risk of underfiltering is 2x AF [ld(n)=x]. This is only the case for ordered grid antialiasing, however; a less conservative approach would, in addition to the actual sample positions, also take into account the amplitudes of the mipmaps, which in comparison to the base map are necessarily lower and automatically permit the use of a greater LOD shift, at least insofar as the mipmaps are already being sampled. While this did not enable a fourfold "textbook" AF to be calculated, the same was rarely needed given the average texture sharpness in games at that time.
The quest for sharp textures was not triggered by the emergence of anisotropic filters. In fact, a variety of tricks had already been used in Unreal to try and circumvent the limitations of hardware of the day. In addition to detail texture mapping (additional texture overlays used to improve the appearance of surfaces at close range), which is hardly needed nowadays, and macro texture mapping (additional texture overlays used to improve the appearance of surfaces at a distance), which is more likely to still be used today, the Unreal Engine 1.x automatically applied a mipmap LOD bias of -1.5 under Glide.
This means that with bilinear filtering without supersampling, a single screen pixel had to be shared by up to 16 of the base texture's texels. A 1:1 ratio is the ideal, so in effect this amounts to 16-fold underfiltering. By today's standards that would of course be unacceptable, but in 1999 3dfx were praised for the good picture quality in Unreal Tournament, and ironically, the 3dfx accelerators outdid themselves in the performance stakes as well, topping all the benchmarks despite the increase in load from the shifted mipmap LOD bias (although in all fairness it should be noted that the Unreal Engine's design was pretty much perfect for Glide and the Voodoo cards).
Antialiasing
What is it that's so special about the Voodoo5 6000's antialiasing that its quality across the board has still not been bettered by any other manufacturer some 6 years after the demise of 3dfx?
By overlaying several images in the multisample buffer known as the "T-Buffer" it is possible to freely define the AA sample positions within a raster, something that can't be done with simple ordered grid supersampling (OGSS), or oversampling. With this method, the antialiasing is created by a slight displacement of the scenery. Using this "sparse grid", even with just 2 subpixels (i.e. 2xSGSSAA) both the X and Y axes are each sampled twice as accurately, whereas with a square subpixel arrangement (OGSSAA) this requires 2x2 = 4 subpixels (and therefore twice as much load).
A "sparse grid" is a cut-down version, so to speak, of an "ordered grid". The cuts are sensible, mind: while there is a negligible loss of quality in the antialiasing on the most important edges, the corresponding performance gain is considerable. In principle, a 2x OGSSAA can, only upsample one axis and accordingly smooth edges either horizontally or vertically. To achieve an edge equivalent resolution of 8x SGSSA using an ordered grid you have to use 8x8 (= 64x) OGSSAA, which is a good indication that as far as consumers were concerned OGSSAA was more of a token gesture than anything and was only implemented so as to offset technical deficiencies. nVidia matched the texture quality of 64x OGSSAA with the 8x AF on the GeForce3. One of our earlier articles goes into this subject in more detail, covering not just the basics of antialiasing but also, amongst other things, the differences between the different masks.
3dfx is now no longer the only exponent of this method of antialiasing. The R200 (Radeon 8500) was also originally supposed to support rotated grid supersampling, but could actually only do this when no fog was used. S3's 9x SGSSAA mode is the only one that can in fact improve visually on 3dfx's 8x SGSSAA, but this is nigh on unusable as the sample distribution in fullscreen mode appears to be arbitrary, resulting in poor picture quality. Those people who can be persuaded to run games in windowed mode, however, can enjoy the only antialiasing that at the very least matches that found on the Voodoo5 6000.
Of course, both ATI and nVidia have since produced more efficient AA modes, but it is not all that difficult to one up with situations where AAA/TSAA (OpenGL) don't work - once we look beyond the fact that the only smoothing these processes provide is with alpha samples (textures with binary alpha values, i.e. each pixel is either completely solid or completely transparent). The G80's new 8x MSAA mode offers a higher 8x8 axis resolution for polygon edges, but other parts of the image are not processed at all, whereas 3dfx's 8x AA still uses 8 supersamples.
The next alternative to 3dfx's 8xAA was nVidia's 16xS mode using TSSAA, which was introduced with the NV40 (and is currently not available on the G80). Like 3dfx's 8xAA, this resulted in an eightfold increase in axis resolution for alpha samples and polygon edges, but it differed from 3dfx's in that it only gave four texture samples rather than eight (AF can of course be used to make up for this). It should also be noted that the "normal" 16xS mode alone (without TSSAA) could not entirely outstrip 3dfx's 4xAA, as they only display alpha sample edges in the OGSSAA portion at 2x rather than 4x - and that at the expense of a higher CPU load!
Until the introduction of the G70 (and the concomitant introduction of TSAA, which also applied retroactively to NV4x chips) the only option for picture quality fanatics was the highly inefficient 16x OGSSAA, which like 3dfx's 4xAA applied an EER of 4x4 to the whole image. The circumstances of its implementation meant that there were already 16 texture samples in use rather than 4, which in terms of acceptable texture filtering under the high-quality driver settings would have been wasteful. Moreover, of course, this mode required roughly 94% of the available fill rate just for antialiasing, which reduced the performance level of a GeForce 6800 GT down to approximatelythat of a Voodoo4 4500 (350 to 333 MTexel/s). Interestingly, a Voodoo5 6000 (clocked at 166 MHz) using 4x SGSSAA has precisely the same raw performance of a Voodoo4 4500 without antialiasing as well.
[Page 4]
The 22-bit postfilter
To understand the reasoning behind the postfilter nowadays, it must be borne in mind that even in 2000, 32-bit rendering was by no means a given. While all of the graphics cards of the era could manage 32-bit rendering, for the majority it resulted in a hefty drop in performance that for the most part was in no reasonable sense proportionate to the visual improvement. 32-bit rendering was of course promoted by the various manufacturers and desired by nearly all developers, but it was also clear that the increasing graphical demands would eventually be too much for the well-established 16-bit rendering to cope with.
The development of Quake III Arena in 1998 essentially proved that particular point - and how. Q3A was one of the first games that looks markedly better in 32-bit than in 16-bit. Moreover, 3dfx did themselves no favours when their Voodoo3 graphics card hit the shelves in 1999, since it was limited to 16-bit rendering. However, from the outset the Voodoo chips had a postfilter mechanism that was fairly effective in eliminating the artifacts caused by 16-bit dithering.
Up to and including the Voodoo2 this was a 4x1 linear filter, which would, within specified threshold values, simply determine an average value based on four adjacent pixels. This did get rid of the irritating dithering artifacts, but in certain situations it created lines that were clearly visible, although this did not affect the picture quality quite so much as the dithering that was caused by rounding errors under 16-bit. With the Banshee, 3dfx had increased the size of the cache in the RAMDAC, which meant they could incorporate a second line from an image in the filter as well.
The result is the 2x2 box filter, which is usually what is being referred to when talking about 3dfx's "22-bit" rendering. 3dfx themselves talked at the time about approximately 4 million as the maximum number of colours a postfiltered image of this sort could have, which is roughly consistent with 22-bit definition. 3dfx's postfilter was by no means a catch-all solution, of course: while it could smooth out existing artifacts by interpolating four pixels, it could not prevent those artifacts from occurring, which becomes apparent once you introduce heavy alphablending and the high number of rounding errors this causes in visible dithering. This is because the threshold value within which the postfilter works is exceeded in such cases, meaning that the dithering artifacts are left untouched.
Another flaw is that the filter is unable differentiate between a dither pattern caused by 16-bit rendering and one that is a result of a texture's desired structure when smoothing. Accordingly, there are also instances where the box filter can have a negative impact on a texture's design. This effect might be quite practical for mipmap dithering, but the dither pattern was often so intense that the threshold for postfiltering was exceeded and as a result it couldn't be smoothed either. In practice, however, the postfilter did its job so effectively that for a long while 3dfx users were unable to reproduce the described problems with 16-bit rendering, while the output from other cards had clearly degenerated because of it.
Even with the introduction of native 32-bit rendering, the postfilter became even more significant when it came to the Voodoo5, as Voodoo5 users were for the most part inclined towards changing a setting that would allow them to enjoy the unrivalled AA modes. The outcome was that 32-bit rendering would be ignored in favour of high performance at the highest possible resolutions, which in most games of the time was achieved with only minimal visual impact.
So the performance loss that comes from activating the postfilter, while measurable, was in reality not noticeable for the most part. The 3dfx supersampling then has the effect of oversampling the on-screen pixel, which reduces the 16-bit dithering considerably before the postfilter is applied in the RAMDAC. This was in fact so effective that 3dfx deactivated the postfilter completely in 4xAA and still managed to conjure up a 22-bit on-screen image that was entirely devoid of artifacts.
Z-buffer accuracy
There is one downside of 16-bit rendering, however, that neither the postfilter nor supersampling could address: accuracy issues with the z-buffer. The ever-lengthening draw distances and increasingly detailed textures meant that screen depth was also becoming a demand on the accuracy of the z-buffer. While Glide, which used a non-linear quantisation of depth information not too dissimilar to the w-buffer, for the most part remained unaffected by precision problems, people playing newer games that used a 16-bit z-buffer accuracy had to contend with polygon shimmer, also known as "z-fighting". This, too, can be remedied, however - for example, w-buffering is available in the game Mafia, which gives more measured accuracy by implementing a modified interpolation, completely eliminating the z-fighting in some instances.
At this juncture we need to take a moment to mention "WickedGL", the OpenGL Glide wrapper developed by Metabyte, which we used ourselves for our benchmarks. In addition to enabling the user to force allowing texture compression, the wrapper's sleek and speedy API (it weighs in at only 340kb, compared to 320kb for the last official Glide3x driver) can use Glide, which is quite capable of satisfactory 16-bit precision under normal circumstances, to provide perfectly adequate 16-bit precision for OpenGL games as well, often removing the only remaining advantage of 32-bit.
Picture quality: the consequences
Overall it's evident that 3dfx didn't make life particularly easy for themselves. For a long time the end user just had to blindly accept the overly enthusiastic claims of 3dfx's marketing department or get an idea of the picture quality first-hand, as it was simply not possible to take screenshots that included the postfiltering. This meant that eager hardware sites such as 3DConcept were left to twiddle their thumbs until changes were made to the HyperSnapDX software late in the day that allowed them to go into the topic in more depth. By then, however, it was already too late to do anything about the glut of fabricated screenshots that had by then already been published along with every Voodoo card up to and including the Voodoo3. Furthermore, the software was never adapted to work with the Voodoo5, which resulted in screenshots of active 2x antialiasing that showed obvious color banding which was not actually visible on the screen.
There was also another problem, albeit one that only manifested on the Voodoo5 6000: it is virtually impossible to correctly depict any antialiasing mode under Direct3D in screenshots. It was only possible in Glide and OpenGL games (either natively or with the WickedGL wrapper). The mode also suffered from another problem, namely that it applied a partial custom mipmap LOD to bilinear filtering with supersampling, meaning there was absolutely no way of capturing the true display quality of a Voodoo5 6000 in screenshots when using 8x SGSSAA and a LOD of -2.
Because of this, all the 8x SGSSAA screenshots in this article have a LOD of -1, which is roughly equivalent to 2xAF. In actual fact there is no reason why you can't set the LOD to -2 in the tested games at this high antialiasing mode, which due to the simple trick of LOD shifting to the second chip as explained above approximates the sharpness level of 5xAF. This means that the Voodoo5 6000's can produce pictures of a quality that is still eye-catching even by today's standards - with an associated processor load that more often than not has a significant impact on end performance. Still, in extreme circumstances (alpha samples galore, say) even an a Radeon X1950 XTX can consume up to a tenth of the overall performance when running at 6x AAA + ASBT, but a performance hit like that is still tolerable so long as things remain playable.
By contrast, nVidia's G80 chip comes with no such power-hungry AA modes. While the 16x OGSSAA mode that was still - unofficially - available on the G70 cards was a massive performance hit, leaving just one sixteenth of the original fill rate free for actual rendering, the G80's antialiasing repertoire is presently headed by its 16xQAA multisampling mode with TSSAA. While this is considerably more efficient in terms of the performance cost for image enhancement, not all of the screen's contents are processed.
In addition to polyon edges, the antialiasing samples only enhance alpha samples; textures and pixel shaders are free to flicker on their merry way. Because of this, the G70 must be considered as having a higher maximum possible picture quality overall at this time, even though this is only achieved in older games, or at framerates that are now less than acceptable. To put it another way, this means that a non-mipmapped scene (whether because of a design error or oversight, as is so often the case with console ports) would be all that was needed to make today's state-of-the-art graphics cards look worse than 3dfx's 6-year-old ;).
It is really the T-buffer that represents the focal point of this entire discussion of picture quality. Sadly its practical uses for antialiasing have been exhausted, but if the T-buffer's capabilities were utilised consistently the overall cost of improved picture quality would in turn be much lower, because soft shadows, soft reflections, motion blur and depth of field or antialiasing could all be calculated in a single cycle (antialiasing and depth of field are fullscreen effects, so they each need a frame buffer of their own), meaning there would be no additional processor load (apart from the complex shifting of vertex coordinates). However, there was no API extension available at the time that would have allowed developers to access these features directly: the transformations had to be done manually. A modular T-buffer engine would therefore have had to be written for these features to have found their way into commercial games.
[Page 1]
15 December 2000 signalled the end of an era: on that date over six years ago, 3dfx, the pioneers of filtered polygon acceleration on PCs, closed its doors. nVidia, their main rival, bought them out for 112 million dollars - patents, technologies, the lot. It was a black day for the 3dfx community, which until then had been considerable in size. One question remained, however: could they have signed off with a flourish?
This article is dedicated to answering precisely that question by looking at the fastest of the Voodoo 5 graphic cards, an artifact of days long past, namely the Voodoo5 6000 AGP, the masterpiece that never officially made it out of 3dfx's doors.
Background
Ah, but we remember it well: the graphics card made quite an impact when it was first presented to the public at Comdex in early 1999, astonishing everyone with its four graphics processors. What we didn't know then was that the card would never make it to retail.
The model that was displayed then was based on the "original design", which still used a 2x2 chip arrangement - and was nowhere near functioning. The development process revealed major issues with the card layout, and it was eventually discarded in favour of the four-in-a-row layout that has since become famous. Even that caused problems right up to the end, primarily with instability under heavy load, a bug that is caused from the PCI subsystem and continues to plague all existing Voodoo5 6000 prototypes. It is primarily this bug that prevented the Voodoo5 6000 from appearing on store shelves.
Subsequent efforts to remedy the problem, such as underclocking newer revisions from the anticipated 183 MHz to 166 MHz were also in vain. Then the axe dropped, caused by the massive development costs and repeated disappointing quarterly results. On 13 November 2000 3dfx reported that it had transferred all of its SLI patents to its Quantum3D subsidisary (which still exists today) and had pulled out of the graphics card business. All the Voodoo5 6000 prototypes that had been made up to that point also went to Quantum3D. Well, almost all of them...
It is estimated that some 200 such prototypes exist, but not all of these are working cards - word on the Internet is that that figure is more like 100. Worldwide. Who it was that smuggled some of these out from under Quantum3D's noses will likely forever remain a mystery, and the number of V56k cards that have ceased to function in the meantime is similarly unknown. Most prototypes have found their way into the hands of genuine fans, but even today cards appear for sale on online auction sites from time to time.
All of these 6-year-old graphics cards have one thing in common: they cost a pretty penny. A defect card will set you back at least 500 euros, while a working Voodoo5 6000 clocked at 183 MHz can easily cost two or three times as much. However, that figure becomes much more palatable when considered against the background of the card's "population density" and the cost of current high-end graphics cards, as unlike the latest products to come off ATI and nVidia's respective production lines, the Voodoo5 6000 has maintained its value astonishingly well over the years - much to the disappointment of those of us with not quite so much cash to splash.
This is where we come in. It took us almost 6 years, but today we are able to proudly present what we were all so cruelly denied back then - a review of the Voodoo5 6000. Sure, we know we're not the first, as the odd test has appeared on the Net in the intervening period (not to mention in German print magazine PC Games Hardware last year), but we can guarantee that we have found the answers to all the questions about 3dfx's last functioning product - those questions that we asked ourselves when we read the other tests (and more besides).
The test cards
Two identical Voodoo5 6000 cards were used for this article, both of which were examples of the best known and most stable revision built, the "Final Revision 3700-A". The numbers indicate that they were produced in the 37th calendar week of 2000. Prototypes like these can almost be described as fully functional graphics cards - and that caveat is only because they, too, display the PCI bug described above. The kicker is that this problem was also solved, but unfortunately only after 3dfx had invested copious resources on it and, ultimately, collapsed.
Hank Semenec, the "Godfather" of the Voodoo5 6000, who now plies his trade at Quantum3D, came up with the so-called "PCI Rework" in his free time, which removed the instabilities. The bug fix manifests in two ways, one internal and one external, each of which was in evidence on one of our test cards. With the fix, both are fully useable and the revolutionary AA modes, which play a significant role in the tests to follow, are completely stable. We must also thank Hank Semenec for the repair that allowed one of our 2 Voodoo5 6000 cards to even function. Our thanks once again!
[The Voodoo5 6000 AGP Rev. 3700-A is exactly 31 centimetres in length. At one end you can see the upside-down HiNT-Bridge, to the left of the GPUs, together with the power supply soldered directly to the PCB. The original "Voodoo Volts" power supply concept is not really necessary with current PSUs (and is very rare anyhowe), but at the time it was the only way to make some computers Voodoo5 6000 compatible (Image: © PC Games Hardware).]
[On the rear can be seen masses of SMDs (Surface Mounted Devices) laid out in what was at that time a very high density (Image: © PC Games Hardware).]
[The external PCI rework on our test card. Note the "Not for Sale" sticker just next to it, evidence that the card was clearly a prototype (Image: © PC Games Hardware).]
Before delving into framerates and such like, we're first going to take a loook at the underlying technology, the resultant picture quality and, of course, the test settings.
[Page 2]
The VSA-100 chip
The last generation of cards from Messrs. 3dfx were still based on the original Voodoo Graphics design dating back to 1996. It's because of this, "Rampage", the true successor to Voodoo, never reached the public, despite several attempts to remarket it. In short, this was caused - as with so many of the negative aspects of 3dfx - by insufficient R&D resources.
Unfortunately, modern features such as environment and per-pixel lighting and hardware-accelerated transformations didn't find their way onto the VSA-100. In reality that was an irrelevancy for this generation of cards, as there were next to no games that used such features. 3dfx went in a different direction with VSA-100, implementing some of the ideas from Rampage that could be used in every game and noticeable improved picture quality - in particular the famous "T-Buffer" (the "T" was from the surname of Gary Tarolli, 3dfx's co-founder and chief designer). It enabled:
* scaleable SG-FSAA (sparse grid full-scene antialiasing - more about this later)
* depth of field (depth blurring, to increase realism in games)
* motionblur (a type of temporal antialiasing where the actual framerates are split in order to merge several frames into one. The amount of temporal data for the end image therefore increases for each frame used. It should not be confused with Motion Trail, which simply blurs between consecutive frames.)
* soft shadows (smooth graduation of shadows)
* soft reflections (smooth graduation of reflections)
The transistor budget for the VSA-100 was clearly invested in speed and scalability rather than on a checklist of features. Incidentally, "VSA" is an abbreviation of Voodoo Scalable Architecture, and an eponymous one at that: by scaling the number of chips on the graphics card it was possible to accommodate every segment of the graphics card market at once, resulting in massive savings on R&D resources, because only one GPU had to be developed and completed.
This was made possible by the use of scanline interleaving (SLI - not to be confused with nVidia's SLI). The frame to be rendered was split into lines, and each GPU was responsible for rendering its own section. More SLI'd GPUs meant less work for each of the VSA-100 chips to perform. The RAMDAC would then assemble the fully rendered "line combs" from the GPUs. This has the advantage over the now standard AFR/SFR approach that there is no need for CPU-intensive loadbalancing or the precalculation of several images, as all of the GPUs are working on the same one.
Unlike with AFR (Alternate Frame Rendering), the video memory is also involved in scaling, at least in part. Although there is redundant retention of textures due to the GPUs not having access to the texture memory of each of the other GPUs (which would also have been extremely difficult to achieve), the frame buffer required for each GPU actually ends up being smaller, because the image being rendered by each of them is smaller.
The bandwidth scaling is also excellent, with each VSA-100 having its own memory interface. On a Voodoo5 5500 this would have been two independent 128-bit SDR SDRAM interfaces. This should be considered as far more effective than a single 128-bit wide DDR SDRAM interface, because it means that there is less wasted bandwidth and there are fewer stalls in the graphics pipeline caused by memory reads. It's for precisely this reason that ATI and nVidia subsequently also later split their memory interfaces into several smaller ones.
All of these advantages mean that the SLI process gives an unrivaled level of efficiency when scaling. It was possible - and worthwhile - to link together up to 32 such GPUs, but this remained the preserve of the professional military simulators produced by 3dfx subsidiary Quantum3D. For the consumer market, if we include the Voodoo5 6000 this would have been limited to 4 GPUs per graphics card.
The reason this approach died a death following 3dfx's demise is simple: the hardware-accelerated transformation suffers from one major disadvantage. All of the GPUs in an SLI setup have to calculate the same geometry, meaning that the performance in terms of geometry is not scaled accordingly, although the TnL-less VSA-100 chip was only mildly affected by this (presumably the Rampage was intended to have an external geometry unit). In addition, the texture bandwidth is not scaled linearly in an SLI setup, since there was no way of determining which texture parts were required and which were not, only whether textures needed to be available in their entirety. Again, however, the VSA-100 apparently had enough bandwidth per texel to ensure that this did not became a more significant disadvantage.
In terms of the standards in 2000, the VSA-100 incorporated 32-bit framebuffer support and increased the maximum texture size from the antiquated 256x256 pixels to 2048x2048, making hi-res textures a reality on Voodoo cards. To implement this feature sensibly and prevent it from overloading the texture bandwidth 3dfx - like its competitors - used texture compression.
While the Voodoo3 supported NCC (narrow channel compression), this proprietary system was never fully used by games. The VSA-100, by contrast, also made use of S3 Texture Compression (S3TC), which was widely adopted. Because using S3TC in OpenGL games meant paying a licence fee to S3 Graphics, however, 3dfx developed a similar system called FXT. Metabyte's well-known OpenGL Glide wrapper "WickedGL" allows users to convert every texture in OpenGL games to the FXT format.
The VSA-100 was the first Voodoo chip to have 2 independent pixel pipelines, meaning it could process 2 different pixels per cycle, which was often more efficient than the Voodoo3's multi-texturing pipeline. The design was therefore widened, but shortened at the same time. Rival companies were already using quad pipelines to texture polygons, which sounds progressive, but in this instance that not the case:
Broadly speaking, a quad pipeline is a single pipeline repeated four times, which can calculate one pixel quad, i.e. a 2x2 pixel array (shown in blue on the diagram) per cycle. This approach saves considerably on the number of transistors needed for operating logic, but it also means that all of the pixel pipelines are limited to performing the same operation, which particularly with small triangles is not always optimal (as shown by the white pixel within the blue line on the diagram). The VSA-100, like all pre-GeForce renderers, does not suffer from this.
Although the Voodoo5 6000 carries the number "5", it in actual fact belongs to the fourth generation of Voodoo products. One can only assume that 3dfx wanted to differentiate clearly between the single chip variants (the Voodoo4 4500) and the multi-chip cards so as to emphasise their performance to buyers.
Voodoo4 4500 Voodoo5 5500 Voodoo5 6000
Pixel pipelines 2 4 8
TMUs per pipeline 1 1 1
Production process 220nm 220nm 220nm
Chip speed 166 MHz 166 MHz 183 MHz
Memory speed 166 MHz 166 MHz 183 MHz
antialiasing 2x SG-SSAA 4x SG-SSAA 8x SG-SSAA
Fill rate MTex/sec 333 666 1464
Bandwidth GB/sec 2.6 5.3 11.7
Maximum texture size 2048x2048 2048x2048 2048x2048
Graphics memory 32 MB 2x32 MB 4x32 MB
3dfx promoted the VSA-100 range using the slogan "Fill rate is King", but the delay meant they were unable to pit the range against its intended competition, nVidia's GeForce 256, a confrontation that it could have walked away from with little more than a ruffled collar. Sometimes things just don't work out as planned. Still, our Voodoo5 6000 cards are ample proof of this slogan: combined with the efficiency gains from SLI, the fill rate of 1.46 gigapixels per second and the bandwidth of 11.7 GB/s are unequivocal. None of the Voodoo5 6000's high-end, single-chip contemporaries would have been able to keep up with it. We will go into this more in the benchmarks.
[Page 3]
Texture filtering
What can we expect from a graphics card in terms of picture quality when its feature set is already seen as outdated at launch? The picture quality is only comparable in a limited sense with that of other graphics cards of the same generation - though this is not necessarily meant negatively.
Trilinear filtering is not a standard fixture of the Voodoo5 using the official drivers, even though the graphics card would of course have been able to do it. Granted, the VSA-100 can only achieve this in one processor cycle with single texturing, but the chip was also able to soften the abrupt LOD transitions with textured surfaces. To that end the drivers have a feature called "mipmap dithering", which works by combining adjacent mipmap levels to generate a sub-image.
The obvious disadvantage to this mode is the dithering itself, which is evidenced by on-screen granularity that is generally quite apparent and cannot be entirely compensated for using high supersampling modes. On the plus side, this mode gives in increase in picture quality with next to no resultant loss in performance (depending on where it's used, e.g. in Quake III Arena).
By far the most interesting mode is a type of bilinear filter that's achieved by running two VSA-100s in supersampling mode, with one chip offsetting the mipmap LOD bias by -0.5. This only works because unlike normal supersampling, with Voodoo5 cards several images can be merged within the T-Buffer with the same balance. The end result is an image with a 1-bit LOD fraction, which by definition passes for a trilinear filter. That said, it should be mentioned that while textured surfaces with a 1-bit LOD fraction produce a noticeable improvement over pure bilinear filtering, it is still not sufficient for consistent linear mipmap interpolation.
This partial LOD-shift is also only possible with SLI antialiasing, so a Voodoo4 4500 is lacking the one thing it needs to achieve trilinear filtering, since it only has one VSA-100 chip and consequently only one LOD bias register. By the same token, however, it also means that a Voodoo5 6000 can cope with 4 images with different LOD biases instead of 2, so instead of one additional mipmap transition you have three, i.e. a 2-bit LOD fraction.
The drivers do prevent one resultant use of this feature: while it's possible with active antialiasing to achieve performance-friendly, optically minimalist trilinear filtering using the original drivers, it only works correctly with a LOD bias of 0.0 or -1, because unfortunately changes to the LOD bias are only applied to the first graphics chip, and the second chip always uses an LOD of -0.5. Thankfully there are modified Glide drivers that also shift the LOD bias on the second graphics chip correctly.
Some of you may be wondering at this point why the LOD bias needs to be shifted at all, since 0.0 is optimal. The reason is simple: whereas rival companies achieved supersampling by increasing the resolution internally (oversampling) and then downsampling afterwards, which automatically gives an increase in texture quality, 3dfx's method did not automatically produce sharper textures, just oversampled texels. Shifting the mipmap LOD bias compensates for this amply. The theoretical maximum shift possible for four SSAA samples is -1.
In practical terms this comes down to a matter of personal taste, as the ideal balance of maximum sharpness and maximum texture stability differs >from game to game. Tests have shown that LOD shifts of -0.5 for 2xAA and -1.5 for 4xAA are reliable. At 4xAA this gives a first mipmap blend at around -2 to -2.5, which is roughly comparable to 4xAF sharpness (albeit without achieving the same texture stability of the latter). This relatively large shift of -1.5 may not be in line with the Nyquist-Shannon sampling theorem, but the theorem applies to the theoretical worst case anyhow in the shape of highest frequency textures (such as a pixel-sized black/white chessboard pattern).
With four supersamples, therefore, if we play it safe, the actual maximum sharpness possible at which you don't have to run the risk of underfiltering is 2x AF [ld(n)=x]. This is only the case for ordered grid antialiasing, however; a less conservative approach would, in addition to the actual sample positions, also take into account the amplitudes of the mipmaps, which in comparison to the base map are necessarily lower and automatically permit the use of a greater LOD shift, at least insofar as the mipmaps are already being sampled. While this did not enable a fourfold "textbook" AF to be calculated, the same was rarely needed given the average texture sharpness in games at that time.
The quest for sharp textures was not triggered by the emergence of anisotropic filters. In fact, a variety of tricks had already been used in Unreal to try and circumvent the limitations of hardware of the day. In addition to detail texture mapping (additional texture overlays used to improve the appearance of surfaces at close range), which is hardly needed nowadays, and macro texture mapping (additional texture overlays used to improve the appearance of surfaces at a distance), which is more likely to still be used today, the Unreal Engine 1.x automatically applied a mipmap LOD bias of -1.5 under Glide.
This means that with bilinear filtering without supersampling, a single screen pixel had to be shared by up to 16 of the base texture's texels. A 1:1 ratio is the ideal, so in effect this amounts to 16-fold underfiltering. By today's standards that would of course be unacceptable, but in 1999 3dfx were praised for the good picture quality in Unreal Tournament, and ironically, the 3dfx accelerators outdid themselves in the performance stakes as well, topping all the benchmarks despite the increase in load from the shifted mipmap LOD bias (although in all fairness it should be noted that the Unreal Engine's design was pretty much perfect for Glide and the Voodoo cards).
Antialiasing
What is it that's so special about the Voodoo5 6000's antialiasing that its quality across the board has still not been bettered by any other manufacturer some 6 years after the demise of 3dfx?
By overlaying several images in the multisample buffer known as the "T-Buffer" it is possible to freely define the AA sample positions within a raster, something that can't be done with simple ordered grid supersampling (OGSS), or oversampling. With this method, the antialiasing is created by a slight displacement of the scenery. Using this "sparse grid", even with just 2 subpixels (i.e. 2xSGSSAA) both the X and Y axes are each sampled twice as accurately, whereas with a square subpixel arrangement (OGSSAA) this requires 2x2 = 4 subpixels (and therefore twice as much load).
A "sparse grid" is a cut-down version, so to speak, of an "ordered grid". The cuts are sensible, mind: while there is a negligible loss of quality in the antialiasing on the most important edges, the corresponding performance gain is considerable. In principle, a 2x OGSSAA can, only upsample one axis and accordingly smooth edges either horizontally or vertically. To achieve an edge equivalent resolution of 8x SGSSA using an ordered grid you have to use 8x8 (= 64x) OGSSAA, which is a good indication that as far as consumers were concerned OGSSAA was more of a token gesture than anything and was only implemented so as to offset technical deficiencies. nVidia matched the texture quality of 64x OGSSAA with the 8x AF on the GeForce3. One of our earlier articles goes into this subject in more detail, covering not just the basics of antialiasing but also, amongst other things, the differences between the different masks.
3dfx is now no longer the only exponent of this method of antialiasing. The R200 (Radeon 8500) was also originally supposed to support rotated grid supersampling, but could actually only do this when no fog was used. S3's 9x SGSSAA mode is the only one that can in fact improve visually on 3dfx's 8x SGSSAA, but this is nigh on unusable as the sample distribution in fullscreen mode appears to be arbitrary, resulting in poor picture quality. Those people who can be persuaded to run games in windowed mode, however, can enjoy the only antialiasing that at the very least matches that found on the Voodoo5 6000.
Of course, both ATI and nVidia have since produced more efficient AA modes, but it is not all that difficult to one up with situations where AAA/TSAA (OpenGL) don't work - once we look beyond the fact that the only smoothing these processes provide is with alpha samples (textures with binary alpha values, i.e. each pixel is either completely solid or completely transparent). The G80's new 8x MSAA mode offers a higher 8x8 axis resolution for polygon edges, but other parts of the image are not processed at all, whereas 3dfx's 8x AA still uses 8 supersamples.
The next alternative to 3dfx's 8xAA was nVidia's 16xS mode using TSSAA, which was introduced with the NV40 (and is currently not available on the G80). Like 3dfx's 8xAA, this resulted in an eightfold increase in axis resolution for alpha samples and polygon edges, but it differed from 3dfx's in that it only gave four texture samples rather than eight (AF can of course be used to make up for this). It should also be noted that the "normal" 16xS mode alone (without TSSAA) could not entirely outstrip 3dfx's 4xAA, as they only display alpha sample edges in the OGSSAA portion at 2x rather than 4x - and that at the expense of a higher CPU load!
Until the introduction of the G70 (and the concomitant introduction of TSAA, which also applied retroactively to NV4x chips) the only option for picture quality fanatics was the highly inefficient 16x OGSSAA, which like 3dfx's 4xAA applied an EER of 4x4 to the whole image. The circumstances of its implementation meant that there were already 16 texture samples in use rather than 4, which in terms of acceptable texture filtering under the high-quality driver settings would have been wasteful. Moreover, of course, this mode required roughly 94% of the available fill rate just for antialiasing, which reduced the performance level of a GeForce 6800 GT down to approximatelythat of a Voodoo4 4500 (350 to 333 MTexel/s). Interestingly, a Voodoo5 6000 (clocked at 166 MHz) using 4x SGSSAA has precisely the same raw performance of a Voodoo4 4500 without antialiasing as well.
[Page 4]
The 22-bit postfilter
To understand the reasoning behind the postfilter nowadays, it must be borne in mind that even in 2000, 32-bit rendering was by no means a given. While all of the graphics cards of the era could manage 32-bit rendering, for the majority it resulted in a hefty drop in performance that for the most part was in no reasonable sense proportionate to the visual improvement. 32-bit rendering was of course promoted by the various manufacturers and desired by nearly all developers, but it was also clear that the increasing graphical demands would eventually be too much for the well-established 16-bit rendering to cope with.
The development of Quake III Arena in 1998 essentially proved that particular point - and how. Q3A was one of the first games that looks markedly better in 32-bit than in 16-bit. Moreover, 3dfx did themselves no favours when their Voodoo3 graphics card hit the shelves in 1999, since it was limited to 16-bit rendering. However, from the outset the Voodoo chips had a postfilter mechanism that was fairly effective in eliminating the artifacts caused by 16-bit dithering.
Up to and including the Voodoo2 this was a 4x1 linear filter, which would, within specified threshold values, simply determine an average value based on four adjacent pixels. This did get rid of the irritating dithering artifacts, but in certain situations it created lines that were clearly visible, although this did not affect the picture quality quite so much as the dithering that was caused by rounding errors under 16-bit. With the Banshee, 3dfx had increased the size of the cache in the RAMDAC, which meant they could incorporate a second line from an image in the filter as well.
The result is the 2x2 box filter, which is usually what is being referred to when talking about 3dfx's "22-bit" rendering. 3dfx themselves talked at the time about approximately 4 million as the maximum number of colours a postfiltered image of this sort could have, which is roughly consistent with 22-bit definition. 3dfx's postfilter was by no means a catch-all solution, of course: while it could smooth out existing artifacts by interpolating four pixels, it could not prevent those artifacts from occurring, which becomes apparent once you introduce heavy alphablending and the high number of rounding errors this causes in visible dithering. This is because the threshold value within which the postfilter works is exceeded in such cases, meaning that the dithering artifacts are left untouched.
Another flaw is that the filter is unable differentiate between a dither pattern caused by 16-bit rendering and one that is a result of a texture's desired structure when smoothing. Accordingly, there are also instances where the box filter can have a negative impact on a texture's design. This effect might be quite practical for mipmap dithering, but the dither pattern was often so intense that the threshold for postfiltering was exceeded and as a result it couldn't be smoothed either. In practice, however, the postfilter did its job so effectively that for a long while 3dfx users were unable to reproduce the described problems with 16-bit rendering, while the output from other cards had clearly degenerated because of it.
Even with the introduction of native 32-bit rendering, the postfilter became even more significant when it came to the Voodoo5, as Voodoo5 users were for the most part inclined towards changing a setting that would allow them to enjoy the unrivalled AA modes. The outcome was that 32-bit rendering would be ignored in favour of high performance at the highest possible resolutions, which in most games of the time was achieved with only minimal visual impact.
So the performance loss that comes from activating the postfilter, while measurable, was in reality not noticeable for the most part. The 3dfx supersampling then has the effect of oversampling the on-screen pixel, which reduces the 16-bit dithering considerably before the postfilter is applied in the RAMDAC. This was in fact so effective that 3dfx deactivated the postfilter completely in 4xAA and still managed to conjure up a 22-bit on-screen image that was entirely devoid of artifacts.
Z-buffer accuracy
There is one downside of 16-bit rendering, however, that neither the postfilter nor supersampling could address: accuracy issues with the z-buffer. The ever-lengthening draw distances and increasingly detailed textures meant that screen depth was also becoming a demand on the accuracy of the z-buffer. While Glide, which used a non-linear quantisation of depth information not too dissimilar to the w-buffer, for the most part remained unaffected by precision problems, people playing newer games that used a 16-bit z-buffer accuracy had to contend with polygon shimmer, also known as "z-fighting". This, too, can be remedied, however - for example, w-buffering is available in the game Mafia, which gives more measured accuracy by implementing a modified interpolation, completely eliminating the z-fighting in some instances.
At this juncture we need to take a moment to mention "WickedGL", the OpenGL Glide wrapper developed by Metabyte, which we used ourselves for our benchmarks. In addition to enabling the user to force allowing texture compression, the wrapper's sleek and speedy API (it weighs in at only 340kb, compared to 320kb for the last official Glide3x driver) can use Glide, which is quite capable of satisfactory 16-bit precision under normal circumstances, to provide perfectly adequate 16-bit precision for OpenGL games as well, often removing the only remaining advantage of 32-bit.
Picture quality: the consequences
Overall it's evident that 3dfx didn't make life particularly easy for themselves. For a long time the end user just had to blindly accept the overly enthusiastic claims of 3dfx's marketing department or get an idea of the picture quality first-hand, as it was simply not possible to take screenshots that included the postfiltering. This meant that eager hardware sites such as 3DConcept were left to twiddle their thumbs until changes were made to the HyperSnapDX software late in the day that allowed them to go into the topic in more depth. By then, however, it was already too late to do anything about the glut of fabricated screenshots that had by then already been published along with every Voodoo card up to and including the Voodoo3. Furthermore, the software was never adapted to work with the Voodoo5, which resulted in screenshots of active 2x antialiasing that showed obvious color banding which was not actually visible on the screen.
There was also another problem, albeit one that only manifested on the Voodoo5 6000: it is virtually impossible to correctly depict any antialiasing mode under Direct3D in screenshots. It was only possible in Glide and OpenGL games (either natively or with the WickedGL wrapper). The mode also suffered from another problem, namely that it applied a partial custom mipmap LOD to bilinear filtering with supersampling, meaning there was absolutely no way of capturing the true display quality of a Voodoo5 6000 in screenshots when using 8x SGSSAA and a LOD of -2.
Because of this, all the 8x SGSSAA screenshots in this article have a LOD of -1, which is roughly equivalent to 2xAF. In actual fact there is no reason why you can't set the LOD to -2 in the tested games at this high antialiasing mode, which due to the simple trick of LOD shifting to the second chip as explained above approximates the sharpness level of 5xAF. This means that the Voodoo5 6000's can produce pictures of a quality that is still eye-catching even by today's standards - with an associated processor load that more often than not has a significant impact on end performance. Still, in extreme circumstances (alpha samples galore, say) even an a Radeon X1950 XTX can consume up to a tenth of the overall performance when running at 6x AAA + ASBT, but a performance hit like that is still tolerable so long as things remain playable.
By contrast, nVidia's G80 chip comes with no such power-hungry AA modes. While the 16x OGSSAA mode that was still - unofficially - available on the G70 cards was a massive performance hit, leaving just one sixteenth of the original fill rate free for actual rendering, the G80's antialiasing repertoire is presently headed by its 16xQAA multisampling mode with TSSAA. While this is considerably more efficient in terms of the performance cost for image enhancement, not all of the screen's contents are processed.
In addition to polyon edges, the antialiasing samples only enhance alpha samples; textures and pixel shaders are free to flicker on their merry way. Because of this, the G70 must be considered as having a higher maximum possible picture quality overall at this time, even though this is only achieved in older games, or at framerates that are now less than acceptable. To put it another way, this means that a non-mipmapped scene (whether because of a design error or oversight, as is so often the case with console ports) would be all that was needed to make today's state-of-the-art graphics cards look worse than 3dfx's 6-year-old ;).
It is really the T-buffer that represents the focal point of this entire discussion of picture quality. Sadly its practical uses for antialiasing have been exhausted, but if the T-buffer's capabilities were utilised consistently the overall cost of improved picture quality would in turn be much lower, because soft shadows, soft reflections, motion blur and depth of field or antialiasing could all be calculated in a single cycle (antialiasing and depth of field are fullscreen effects, so they each need a frame buffer of their own), meaning there would be no additional processor load (apart from the complex shifting of vertex coordinates). However, there was no API extension available at the time that would have allowed developers to access these features directly: the transformations had to be done manually. A modular T-buffer engine would therefore have had to be written for these features to have found their way into commercial games.