Intel 18A be a repeat of Intel 180nm?Intelligrated - Part II
On Dec 1’2024, Pat “retired” from Intel. Don’t know all the board room drama that resulted in that decision, but it seems odd that the decision by either party was on a Sunday in December, when it could have been better on Dec 1’ 2023 for Pat or Dec1’2025 for Intel. Let me explain….
Back in January’2022, I argued that Intel should not be split (Intelligrated). It seems like a pre-mature decision on either side to have this separation. At the core, Intel is facing a 1998-2001 (dot.com) like situation where there was and now is an inflection point in the platform. But at the core, there are 2 questions
a) Will intel’s 18A process node catch up with TSMC and exceed on performance and yield?
b) Does it matter if it did?
Let go back in history and see what happened then, as I had a ring side and was a recipient of intel’s wrath back then.
As you can see (and said before), Intel came from behind and beat (in that case TI and IBM) in all apsects - transistor design, metal stack, copper metallization. If you zoom into the red box…Here’s IEDM data and Intel’s own execution data from that period.
The bottom left chart is the critical one, highlighted in red circles and the shift in the Ion from process node to process node. From 250nm to 180nm to 130nm, Intel, went aggressive on transistor engineering with successive generation gains that separated them from TI and IBM. We (Sun) and TI were focused on copper metal transition in 180nm, while IBM was not as focused on transistor performance as they had other basic challenges with their Power roadmap.
The 2 slides below from Mark Bohr’s presentation from that era on their 90nm transistor engineering. The two notable ones are the gate length scaling and Tox and the slide above on Ion/Ioff. Look at the Ion going from 700ma/uM to almost 1000 in 1 generation (roughly 50+% improvement). Until then we in SPARC land were beating Intel on both architecture of the chip as well as process engineering with our partner TI to deliver best price/performance. In 1998, I launched industry’s lowest cost, highest performance Unix workstation and servers. By 2003 we lost it, in part due to Intel and in part due to Sun’s management decisions (a separate blog on that…). Regardless it was this transistor engineering and prudent choices (postponing copper transition, introducing strain). Sure life was a lot simpler from process engineering back in 1998-2001 relative to now, still it was risky decisions and engineering back then.
Source: Mark Bohr presentation
Source: Mark Bohr presentation
Source: Mark Bohr presentation
The perfect storm occurred for Intel between 1998 and 2001. Improved transistor engineering, emergence of Linux and market for new “open” distributed computing platform via LAMP (1.0) stack and finally getting their microprocessor design more mainstream with what the RISC guys (SPARC, MIPS, Power) were doing instead of getting side-tracked with architectures like Itanium, P4, which was all about MHz marketing. So Process, Processor and Platform were starting to align to enable Intel to take the lead between 2001 and 2010 in the emergent cloud buildout (Distributed computing 1.0). Distributed compute 1.0 is best defined as Cloud 1.0 where the basic computing building block was a 1U, dual socket multi-core CPU with 10G ethernet ad 32GB of DRAM. It was a run unmatched. Not by design, except for pulling all stops on its core value - process manufacturing to be on or ahead of Moore’s law treadmill.
But Since 2010, perhaps post nehalem, Intel had changes in CEO (Departue of Pat as well) from Crag Barret to Paul Ottelini and then to BK and Bob Swan - who were non-technical and ‘non-founder’ CEOs. Paul made two errors. One with mobile that is famously documented and also under-invested in process tooling as well as engineering which continued under BK that is now an issue,10 years later. The semi-conductor business is very unforgiving. Its like a huge cannon. You have to take aim 5 years out and hit it out of the ball park.
Coming to today….Like dot.com and 1998-2001, which launched distributed computing 1.0, We are at the cusp of Distributed computing 2.0 starting in 2025 given training scale and inference scale are bufurcating. Inference is more of a distributed computing 2.0 paradigm with heterogeneous compute and memory than training which is more homogenous compute (GPU only) with massive scale (30K-100K GPUs in large training clusters). Even at 2T parameters (which requires (10TB of HBM memory), the basic building block in a sub-rack as the unit of compute. (at 256GB/HBM - that is 40 GPUs for 10TB model).
Going back to the key questions….
Will intel get parity or take lead w.r.t TSMC starting with 18A (like 180nm was the decider back in 1998) - 25 years back
Will it matter.
Clearly (1) is needed for Intel to stem the tide away from towards AMD. Its both a process and processor design issue. That will also power/performance be it for datacenter or for desktop/Laptop silicon which is bulk of intel revenue
On does it matter? That is a more complicated answer. Intel has basically 4 classes of products.
Desktop/Laptop - this would need process leadership (esp on the power/performance equation) with backside power to fight the onslaught of ARM based CPUs from Qualcomm and others.
Datacenter CPU (x86) - this needs process leadership to match and exceed AMD roadmap, unless they merge, which then solves the problem as x86 is a true monopoly.
Datacenter GPU - this market is lost to Nvidia (even for AMD) as its a scale/systems problem and training is locked up by Nvidia for a small set of frontier model customers. The mis-execution in Gaudi roadmap and a product and solution (full system) gap of 2+ years now going 3 has basically left Intel with no choice but to acquiesce to Nvidia (AMD is not far behind on this, despite have a more credible MIXXX roadmap).
Custom accelerators for 3rd party CSPs. This needs strong process leadership *AND* design with services mentality to win Google (TPU), Amazon (Trainium) and whatever Microsoft is planning.
So process leadership is a critical element of winning (1), (2) and (4). On (3), Intel is facing with the same luck of the draw like it did in the dot.com days.
The bifurcation of training scale platforms to inference scale platforms where training scale is an Nvidia monopoly, while inference scale is new and opens up to all. Inference scale like LAMP 1.0, is defined by what I call as LAMP 2.0. Llama i.e. Accelerator, Memory, Pytorch. LAMP 2.0 defined distributed computing 2.0.
Pat’s decision on 4 process nodes in 5 years to bring back process leadership is spot on to regain leadership with 3 out of the 4 products. Intel at it’s core is process/tech/manufacturing company with great marketing and eco-system builder. Not a great design shop, as it lagged for multiple decades on processor designs compared to MIPS, SPARC and even Power 4. As I have mentioned before the design teams were hiding behind the process leadership and when the tide fell, they were exposed. Doug O’Lauglin of fabricated knowledge makes an excellent case that the board was the wrong for over a decade. Now Pat is not without his faults or blindspots. Notably, Pat did not change the board to be reflective of the challenge at hand except for bringing on Lip Bu Tan, who left earlier this year.
On the product design side, Pat has made some poor decisions. Keeping a bad or ill-defined and perhaps ill-designed architecture for 2+ years, notably the GPU side. While Pat has famously talked about Larrabee, but that is way in the past. As stated above, any bet is a 5 year bet, but need the right team, right architecture and in the case of GPU right software. Pat made a bet with Gaudi2/3 which was flawed from the beginning as it had the wrong core architecture, incomplete software and a team while capable, was not world class.
Perfect storm of being sub-optimal and results have shown. A lesson I learnt from a former VC colleague of mine. When you have the wrong product, instead of continuing to hope to make something out of it, killing it, however painful, will lead to the right solution. Pat’s mistake was holding onto Gaudi2/3 as it presented a false choice for Intel on the rapid rise of GPU since 2022/2023. If Pat had killed Gaudi2, back in 2023, it would have forced Intel to accelerate Falcon Shores (3) and perhaps (4) as the focus on winning custom silicon via packaging of various intel components (better than Broadcom which has >$10B business today). Intel has more IP and capability to do custom silicon for hyperscalers. Its culture was the limiter. This falls under the could’ve, should’ve.
Coming back to the original assertion. Is this a time for separation of Pat with Intel. It all depends on where Intel is on its execution of 18A. If all the metrics point to a successful outcome for 18A, a change of guard at this critical juncture could hurt intel more than it can fix, regardless of whether there are better folks to shepherd the production engineering of 18A. If 18A is truly uncompetitive (and Intel has enough data from 50 years to determine that), then the decision for separation was the right one. Its not clear the board has that data, which is unfortunate.
I quoted the 180nm example in 1998 as Intel has luck on its side, but no room for errors. Inference scale is going to create the opening like LAMP 1.0 did for distributed computing 2.0. Distributed computing 1.0 disrupted Sun’s big SMP thesis back then. Distributed compute 2.0 will disrupt the Nvidia’s roadmap (more on this in a separate blog).
While Intel has lost $150B in value as Nvidia has gained $3T (20x delta) in value in the past 2 years, Intel is facing a strategic opportunity to repeat (not by design, but by chance) the success of dot.com (LAMP 1.0) with LAMP 2.0 (more on this in an upcoming blog).
But at the center of all of this, what Intel has to execute is 18A for other parts of the LAMP 2.0, inference scale platforms to be powered by intel’s roadmap.
The board has perhaps made a mistake or perhaps Pat was too quick in his decision to “retire”. Knowing Pat, I assume, Like in the past 14 years, the board has made the wrong call once again.
If 18A is competitive with TSMC, Dec 1’2024 was bad timing for both board and Pat to separate. If not, then it does not matter.
If Pat were to come back (as some speculate), he has to go into ‘Founder mode’ or ‘Andy Grove mode’ which is…
18A or bust.
“Execute or be executed” famously said by Andy Grove. This is for the Laptop and DC chip teams.
Focus on inference scale platforms. Give up any hopes for training silicon (that is an Nvidia market). Lots of assets Intel has for this including software.
Think out of the box for custom silicon design teams for hyperscalers to grab market share from Broadcom and Marvel or get part of the training silicon TAM.