x86-Overhead - Infos von einem ehemaligen AMD Entwickler [Archiv]

Archiv verlassen und diese Seite im Standarddesign anzeigen : x86-Overhead - Infos von einem ehemaligen AMD Entwickler

Gast

2017-06-21, 11:04:26

Vielleicht kann man die nachfolgenden Infos hier im Forum irgendwo posten. Ich denke die Infos sind nicht uninteressant. Die Seite, auf der cmaier gepostet hat, war schonmal offline.

I've designed both PowerPC processors for Exponential Technology and x86/x86-64 processors for AMD, so I'd be happy to answer any questions people have, preferably in English.

no problem. That quote was copied (not by me) from something I wrote on macrumors, and I was directed by somebody else (not the person who seems to have quoted me) to this site. I have a Ph.D. in electrical engineering (my dissertation involved a ultra-reduced instruction set computer) and worked at Exponential Technology (a startup that Apple invested in in the 1990's) as one of two logic designers on the x704 chip (a powerPC). I then worked at Sun briefly on a Sparc processor before I moved to AMD where I worked for 10 years on x86 chips, including K6 and the initial K8 (where I was in charge of the integer execution and scheduling blocks, and helped define AMD64 extensions to integer ops, among other things). I retired from the microprocessor design business a couple of years ago and became an attorney. I was directed here because someone told me there was some controversy about the CISC penalty, and since I've worked on three different RISC architectures as well as x86 (the only CISC architecture that matters) I know quite a bit about this topic. If you want to check if I'm for real, I have been published in technical journals, for example: Maier, Cliff A. et al. (1997). "A 533-MHz BiCMOS Superscalar RISC Microprocessor". IEEE Journal of Solid-State Circuits, Volume 32, Number 11, pp. 1625–1634, as well as: Atul Garg, Y. L. Le Coz, Hans J. Greub, R. B. Iverson, Robert F. Philhower, Pete M. Campbell, Cliff A. Maier, Sam A. Steidl, Matthew W. Ernest, Russell P. Kraft, Steven R. Carlough, J. W. Perry, Thomas W. Krawczyk Jr., John F. McDonald: Accurate high-speed performance prediction for full differential current-mode logic: the effect of dielectric anisotropy. IEEE Trans. on CAD of Integrated Circuits and Systems 18(2): 212-219 (1999), and Pete M. Campbell, Hans J. Greub, Atul Garg, A. Steidl, Steven R. Carlough, Matthew W. Ernest, Robert F. Philhower, Cliff A. Maier, Russell P. Kraft, John F. McDonald: A very wide bandwidth digital VCO using quadrature frequency multiplication and division implemented in AlGaAs/GaAs HBT's. IEEE Trans. VLSI Syst. 6(1): 52-55 (1998) among others.

Okay, some facts. What is the x86 decode penalty? On K8 (Athlon 64, Opteron) we had a DE block for decode. We also had IF (instruction fetch), EX1 (superscalar issue and retire), EX2 (integer execute), FP (floating point), LS (load/store), IO (i/o's), L1 (cache), L2 (cache), etc. Each team had around the same number of people in it except for the caches which had more, and the register file block which had more. (I was in charge of EX1/EX2 and the register file, so I guess I count for two or three). Overall, about 1 in 15 people worked on x86 decode issues.

In terms of die area, decode was about the same size as EX1 + EX2. It was a fairly small sliver of the die. Maybe 2%. In terms of transistors, it was also around 1-2%. We had around 30-50M non-cache transistors (depending on whether you count buffers, and depending on which transistors you count as part of the core). I think someone threw around some numbers for how many transistors are required for x86 decode, and those numbers match my experience. It's around 1-2%.

Finally, keep in mind that PowerPC and other RISC processors do not decode for free. It's not true anymore that the instruction bits set all the gates and do all the decoding for you. Both PowerPC and SPARC instruction sets do have to go through a decode stage. So the penalty I mentioned above should have subtracted off the decode cost on whatever your alternative architecture is.

In short, the x86 penalty is pretty small in terms of design time/hardware cost/die area. The benefit of x86 is that compilers are heavily optimized for x86, and x86, while it contains a lot of garbage, also contains many features that are specifically tuned to modern operating systems and to the behavior of compilers. As a result, more of the transistors on an x86 are doing useful work in each cycle than in a RISC processor.

RISC, by leaving these things out, can result in smaller, higher-clock speed processors. But power = CV^2f, so higher clock speed is no longer considered as good as higher instructions-per-cycle. Most of the x86 decode work is dedicated to setting up the superscalar issue hardware with hints to maximize instructions-per-cycle.

Finally, there is no way A4 is PWRficient. I can't explain why I know that, but if you pay $99 to Apple, they will tell you ISA A4 is.

Re: some of your questions. I spent the 2000's working on x86, so I was aware of PowerPC improvements only through reading the papers, and through following their engineering to make sure we were competitive. I don't know of anyone who believes PowerPC macs are faster than current Intel Macs. I think that is nostalgia on your part (or perhaps comparing old software on old machines to new software on new machines). Certainly doesn't match my experience. Battery time may or may not be less (my Intel MBP gets 7 hours on average. Seems pretty good to me).

As for ARM benefitting, that's mainly because people are willing to throw out tons of performance for much better battery life (a sensible trade-off for mobiles). But if you tried to scale an ARM up to compete with an i7, for example, you'd end up burning just as much, if not more, power, because you'd have to run at too high a clock frequency to make up for the lack of instruction parallelism.

What I'm saying is that there's nothing you can do with a RISC that you can't do with an x86 - it's just that RISC products, unable to compete with x86 for the meat of the market, are engineered for niches (low power portable and game boxes, super high speed workstations with exotic cooling requirements, etc.) No one, even Intel, has really started with a clean sheet of paper and seen what could be done in the handheld space with an x86, but it's likely they could get pretty close to ARM's performance/watt, though it would be a much more expensive die since it would have larger die area (it always costs more die area to increase instructions per cycle instead of frequency).

Quelle mit weiteren Beiträgen von cmaier: Quelle: http://www.ppcnux.de/?q=tim-cook-und-das-iphon-sdk-zum-a4

Die weiteren Beiträge kann ich auch gerne quoten, sofern der Post hier nicht verworfen wird.

Leonidas

2017-06-21, 12:11:09

Kannst es sehr gern hier posten.

Sofern Du nix dagegen hast, verschiebe ich es nachfolgend ins passende Forum.

Gibt es zudem die Möglichkeit, den Titel etwas zu präzisieren? Was wäre ein präziser das Thema umschreibender Titel?

Gast

2017-06-21, 12:22:04

Und was soll das jetzt aussagen? AMD Mitarbeiter die aus einem jahrealten Nähkästen plaudern?

Exxtreme

2017-06-21, 12:42:18

Und was soll das jetzt aussagen? AMD Mitarbeiter die aus einem jahrealten Nähkästen plaudern?
Es geht wohl eher darum, dass es hier im Forum etliche x86-Hater gibt, die x86 für eine ineffiziente, stromfressende und veraltete Technologie halten. Die sich eh' nur wegen dem riesigen Softwarebestand hält. Und schönere, elegantere und ausgeklügeltere Technologien wie RISC nur deshalb keine Chance haben weil sie mit solchen "unfairen" Startbedingungen zu kämpfen haben.

Offenbar sieht dieser Ex-AMDler das eben nicht ganz so. :)

Gast

2017-06-21, 12:48:14

Weitere Infos zu x86 Overhead. Die Postings sind von 2010.

Your post is long, so I'll get to it's points later, but I'd like to address the verification costs issue. Verification refers to software verification. x86 is very hard to verify because it has decades of software written for it, all of which must work. It is far easier to verify RISC chips with much smaller quantities of software and that run only one or two operating systems that matter. At AMD we had hundreds of thousands of test vectors captured from thousands upon thousands of software programs that we tested to make sure our chip would be fully compatible with every piece of x86 software out there, and on all the OS's that run on x86. In other words, verification is hard on x86 because x86 is so popular, not because there is a problem with x86.

Re: Hannibal - I don't know why he's saying it. I don't know his background. And I don't know anything about P6. Do keep in mind, however, that the P6 was a very simple microarchitecture and had far fewer transistors than the machines I worked on - decode doesn't grow, so as you add transistors to the rest of the core decode takes a smaller percentage of the die. I guarantee you it never took 40%, though. That's just ridiculous.

Re: ipad sdk. You are looking in the wrong place.

Re: niches: performance for watt at the high end is not competitive for powerpc. For 2% more performance you pay 10% more power.

Re: "Interestingly it seems, that I am not the only one having this experience." Ok. I'm sure people also have the opposite experience. I'm sure I can't convince you of anything. Just telling you what I know, as someone who has no axe to grind and no particular bias in favor of one architecture or another since I've worked hard making chips in all sorts of architectures, and don't currently work for any company (and hence have no skin in the game anymore).

[...]
p.s.: Even your blogger friend Hannibal doesn't stick by his 40% number. http://arstechnica.com/old/content/2005/11/5541.ars. He said, all the way back in 2005, that the penalty at that point was 4%-2%=2%. (In the response I was going to make prior to reading your final paragraph, I was going to point out how even RISC processors contain decoders with ROMs for use in patching for bugs and adding features without having to do an all-layers spin, how all modern RISC processors have decode stages, etc. Looks like Hannibal is aware of that.) Here's an excerpt:

I've talked a bit about point #1 in previous articles, especially my look back at the history of the Pentium line. The original Pentium spent about 30% of its transistors on hardware designed solely to decode the unwieldy x86 ISA. Those were transistors that competing RISC hardware like PowerPC and MIPS could spend on performance-enhancing cache and execution hardware. However, the amount of hardware that it took to decode the x86 ISA didn't grow that much over the ensuing years, while overall transistor budgets soared. On a modern processor, if x86 decode hardware takes up twice as many transistors as RISC decode hardware, then you're only talking about a difference of, say, 4% of the total die area vs. 2%. (I used to have the exact numbers for the amount of die area that x86 decode hardware uses on the Pentium 4, but I can't find them at the moment.) That's not a big enough difference to affect performance.

Gast

2017-06-21, 13:07:19

Es geht wohl eher darum, dass es hier im Forum etliche x86-Hater gibt, die x86 für eine ineffiziente, stromfressende und veraltete Technologie halten. Die sich eh' nur wegen dem riesigen Softwarebestand hält. Und schönere, elegantere und ausgeklügeltere Technologien wie RISC nur deshalb keine Chance haben weil sie mit solchen "unfairen" Startbedingungen zu kämpfen haben.

Offenbar sieht dieser Ex-AMDler das eben nicht ganz so. :)
Achso danke.

RISC beschränkt sich auf notwendige Befehle. Als Ausgleich hat man mehr Register so das man öfter schneller Registeroperationen als den langsame Speicheroperationen ausführen kann. Druch die wenigen Register kann man den Prozessor billiger herstellen, inklusive direkter Verdrahtung der Befehle im Dekodierer, wodurch die Ausführung im vereinheitlichten Datenformat und mit einheitlicher Befehlslänge ausgeführt wird. Größter Nachteil ist ja das Fehlen (unterstützen) eines Microcodes (IBM-Standard).

Zu seiner Zeit wollte man nur Zeit und Entwicklungskosten sparen, daher hatte man x86 aufgeblasen, was bis heute anhält obwohl es komplexer wirkt. Dort wo keine IBM Kompatibilität verlangt wird, kann auch RISC zum Einsatz kommen. An Intel lag es damit nicht. Die bauten ja Itanium Prozessoren und vieles mehr auf RISC Basis. AMD wird das nicht anders sehen.

Leonidas

2017-06-22, 03:52:22

Alles, was zur Informationsgewinnung dient, sehe ich als wertvoll an. Wertungen dazu können später erfolgen.

mczak

2017-06-22, 16:29:14

Ein angeblicher Vorteil von x86 soll ja das kompakte Encoding der Befehle sein, weil eben häufige Befehle relativ wenige Bytes benutzen (also besser für Instruction-Cache). Wenn man aber z.B. AVX nutzt ist das ein ziemlicher Witz, denn der durchschnittliche Befehl braucht da mal locker 5-6 Bytes, unter 4 Bytes geht rein gar nichts (2 Byte Vex Prefix, 1 Byte Opcode, 1 Byte Mod r/m), und häufig ist's dann eben auch mehr (bis 3 Byte Vex Prefix, 2 Byte Opcode, 1 Byte Mod r/m, 1 Byte SIB...).
Ich habe das mal (bei quasi nur-AVX Code) analysiert, und tatsächlich war der Code (bei Verwendung von AVX-128) gegenüber SSE am Ende etwas grösser - dies obwohl es deutlich weniger Befehle waren (weil man bei SSE so etwa 20% reg/reg move-Anteil hat wegen der destruktiven 2-op Syntax, die fallen mit AVX ja alle weg). Wobei das geschwindigkeitsmässig ziemlich ein Nullsummenspiel ist (die Register-Moves brauchen bei den modernen Core-CPUs keine Ausführungseinheit weil die mit Register-Renaming erledigt werden).
Die vielen Bytes/Instruktion sind nicht so super nicht nur wegen des L1-I Cache, sondern auch weil gerade bei Intel der Dekoder nur maximal 16 Bytes/Takt verarbeitet, der Dekoder ist also da bei AVX-Code durchaus begrenzt, denn da kriegt man so im Schnitt bloss noch vielleicht 3 Befehle pro Takt hin (im Decoder) mit etwas Glück. Spielt am Ende trotzdem wohl keine Rolle (SIMD-Befehle können eh maximal nur 3 Execution Ports nutzen, ausserdem gibt's ja heutzutage auch post-decode Caches für kleine Loops etc.) aber es gäbe definitiv sinnvollere Encodings (auch relativ komplexe mit variabler Länge) als x86...