Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Puget Systems' Perspective on Intel CPU Instability Issues (pugetsystems.com)

181 points by layer8 433 days ago | 91 comments

w10-1 433 days ago [-]

Sorry, unable to believe: 2-4% failure rate for CPU's?

That's for detected/known failures: what about random, unable-to-reproduce, hardly noticed the data skip failures?

Have I been living in a fantasy bubble where CPU's do exactly what you asked of them (and errors come from not holding it right)?

layer8 433 days ago [-]

This is consistent with their report for 2019-2021, preceding the issues with the 13th/14th gen: https://www.pugetsystems.com/labs/articles/most-reliable-pc-...

Apparently it was much better in 2018: https://www.pugetsystems.com/labs/articles/most-reliable-pc-...

On the other hand, 2011 did show 1.5%: https://www.pugetsystems.com/labs/articles/most-reliable-pc-...

GPU failure rates also weren't great 10-15 years ago, in particular for AMD: https://www.pugetsystems.com/labs/articles/video-card-failur...

moffkalast 432 days ago [-]

Had three ATI Radeons back in the day, and that's only because the first two died under warranty one after the other lmao. it was even worse before AMD bought them.

Reason077 433 days ago [-]

> "Sorry, unable to believe: 2-4% failure rate for CPU's?"

It seems high to me too. I can't recall ever having a CPU fail and I must have used/owned hundreds of them in my lifetime. But presumably, failure rates in data centres where CPUs are run 24/7 at high temperatures etc are higher than in consumer applications?

Interesting that failure rates seem to peak in the summer months, too, and this didn't seem to be explained in the article. Perhaps the data center's cooling is working less effectively in summer?

sqeaky 433 days ago [-]

How many CPUs have you owned? Compare this to Rolling a die in a tabletop game. How often do you roll a natural one or snake eyes?

On a 20-sided die a natural one only comes up 5% of the time.

I don't think I'm quite at having owned 40 (desktop) CPUs, but I've had failed CPUs a number of times. On an old Athlon XP 3000+ it had bad CPU cache and was able to change it from crashing repeatedly to working by disabling a large swath of the L2 cache in a bio setting. I am just now retiring a workstation that had an AMD Ryzen 5950x that was generally unstable, that would pass memtest and all the diagnostics I know how to run would pass, but about once a month would print a message on all the consoles I had open about an MCE kernel exception detected on some CPU number at random. One time I had one of those budget TriCore CPUs that had the 4th core turned on in the BIOS by accident, and that generally caused the ton of issues when I figured out it was being detected as a quad core I went in and disabled that last core and it went back to working just fine.

I'm sure at least one or two of my Intel machines failed similarly, I've had a number of dead CPUs including a few Pentium 4s back when I used to fool around with overclocking settings, and at least a few dead server chips. And I know that I've had a few systems that simply wouldn't boot even when all the parts were fresh out of the box but would when I'd swap out one part or another, and sometimes that failed part was a CPU whether it be AMD or intel.

Oh, I also get to cheat no clue how many dead Solaris CPUs I have seen! Those big Mainframe imitation systems have the ability to hot swap CPUs and it was pretty rare to have a cabinet sized computer where every piece of hardware was fully operational, at least in my role as a legacy maintainer of such things. I bet they worked much better when they rolled off the factory line.

thereddaikon 433 days ago [-]

Outside of overclocking its very rare to see a failed CPU. Validation testing at the fab almost always catches the lemons and it usually takes special circumstances for one to degrade after fabrication. OC'ing with higher voltages is the most common culprit. The number of honestly bad CPUs I've seen in my IT career I can count on my hand. Intel's current issue is due to a manufacturing error and definitely qualifies as extraordinary circumstances.

whizzter 432 days ago [-]

At some point CPU manufacutrers started treating overclocking as a feature rather than somethings hobbyists do and then computer OEMs started to tune things for this, but due to the quick generation cycles without any kind of long term testing it's only been a matter of time until we started seeing these issues since Moors law hasn't helped much with single core performance for years.

My current laptop was getting uncomfortably hot when some random browser pages started pressing the cpu, after searching I noticed that the default setting was to enable some kind of "Boost Mode" (that's basically overclocking in the classic sense), disabling that made a world of difference and looking at the failure rates of the Ryzen 5000 series in the article I'm not a single bit surprised about it.

Googling the laptop family you get tons of Reddit hits, https://www.reddit.com/r/ZephyrusG14/comments/gho535/importa...

sqeaky 432 days ago [-]

Oh yeah, not trying to say this level of failure is normal. Puget getting 5% failures, or thereabouts, is a typical historical failure rate and they are getting it by being more conservative than others. Ending else is running more aggressive defaults.

I was just trying to provide a few examples of real first hand failures. And most OCing doesn't break anything, but every once in a while you set some voltage and one part never works again, hard not to conclude it was the OC when the failure perfectly coincides. I suppose it could be coincidental, but that stretches credulity.

jerf 432 days ago [-]

"How many CPUs have you owned?"

Another issue as an end user is I don't have the resources to prove that it was the CPU a lot of the time. I've had some laptop failures that could have been the CPU, could have been the motherboard, could have been the power supply, dunno, all I've really got is that it doesn't boot and I don't have the capacity to diagnose it due to the level of integration and inability to get replacement parts to even try to diagnose the problem. And while as a poor college student, I had the time and desire to carefully replace parts and exert maximum diagnostic effort to figure out exactly what is wrong because I can't just buy a new complete setup, not everyone goes to this level of effort, and they may not be correct.

End users really can't pick up on these trends. The data set is so noisy. Sure, the "end users" may have had suspicions about this but I've also seen communities come to consensus about certain things being broken that I had very good reason to believe were wrong and were just internet forums amplifying random loud voices being confidently and loudly wrong about things until it become "common consensus" through the power of nothing but the confident and loud error.

AHTERIX5000 433 days ago [-]

I experienced problems with 5950X as well and nobody (including myself) seemed to believe it was the CPU before I got another 5950X which just worked with the same setup.

The issue wasn't easy to reproduce, all standard checks and torture tests ran just fine but much more random workload crashed the unit when all cores were maxed for few seconds at a time instead (eg. during compiling). Sometimes it happened twice a day sometimes once a week.

alyandon 433 days ago [-]

I'm in the process of troubleshooting my son's desktop and have swapped out every bit of hardware except for the motherboard and the ~2 year old 5600X itself. I just assumed the motherboard had failed at that point and RMA'd the thing but the OEM tested the board and said it checked out. At this point, the CPU is the only thing left. :-/

telgareith 432 days ago [-]

Try Disabling C states in BIOS. And verify they're disabled within the OS.

alyandon 432 days ago [-]

Unfortunately, the system was running fine for 2 years and then deteriorated to not-quite-fully-dead in the span of about 20 minutes.

It would only attempt to POST 1 time out of about 10 power on attempts and won't finish enough of the POST to even make it into the BIOS/UEFI setup. When it would fail to even attempt to POST it wouldn't even initialize the USB peripherals (keyboard doesn't light up).

I swapped everything (GPU, ram, power supply, etc) before getting to the point of suspecting the motherboard was faulty because it would not Q-FLASH via a USB thumb drive so I sent it in for warranty service. Since the manufacturer said the motherboard passed their quality checks there is nothing else that could be left besides the CPU.

Even given all that, I refused to believe the CPU could be the culprit until I saw the failure rate graph for AMD 5000 series processors in the article that far exceed what I figure would have been in the 0.5% range.

Live and learn I guess but it certainly adds a new fun dimension to troubleshooting because I don't exactly keep spare CPUs laying around "just in case" like I do with a spare power supply.

telgareith 432 days ago [-]

oof. Still, hopefully other people read this and try disabling C States. I was about to spend the $25 for a spare CPU when the C state issue came up and...

For me it was simple: disable C states - Stable for a week (at first it'd be weeks between crashes, then days, then daily, and at this point a couple hours) Enabled them again - crashed within an hour Disable again - Stable for a week. Flop back and fourth a handful of times in the same day.... same as above.

I can only guess that this might be something that was fixed in the mysterious "frequency voltage curve change" stepping.

telgareith 432 days ago [-]

On the 5950x, try disabling C states (not P, C; the ones where it 'sleeps' CPU cores). And verify they're disabled within the OS.

It took most of a year to nail that one down.

AMD didn't ask for anything besides proof of ownership once I told them disabling C states fixed it. And, 5950x's have a 5 year warranty...

PS: if they were Northwood P4's, see: https://www.overclockers.com/forums/threads/the-official-sud... (I didn't name it. I agree, terribly insensitive name)

pmalynin 433 days ago [-]

Oh glad to know I’m not the only the with the 5950x MCE errors. I probably haven’t seen one in a year though, so maybe it was finally fixed

easygenes 433 days ago [-]

I had a 5950x with similar issues to other posters here. It failed outright after just over two years.

mm0lqf 433 days ago [-]

Was common with AMD CPUs in the past, the Ryzen 1000 range had a widespread problem where many made in 2017 would randomly segfault from time to time under Linux, it was a whole drama, and you had to RMA them until you got lucky.

zeven7 432 days ago [-]

> I can't recall ever having a CPU fail

How do you know? I thought one of the main ways these were failing resulted blue screens. I'm sure you've had a bunch of blue screens in your lifetime.

gjsman-1000 433 days ago [-]

4% seems very high to me; but CPU errors happen with relative frequency, and design mistakes are common.

If you ever run “cat /proc/cpuinfo” on, say, Skylake - Linux will happily tell you it has 5-6 workarounds active for hardware mistakes.

CPUs are still pretty darn reliable. Think about how many GHz your CPU runs at, multiplied many instructions per cycle there are, and then calculate the failure rate if there was just 1 mistake per minute. Nothing on earth would compare.

upon_drumhead 433 days ago [-]

Mechanical hard drives are absolutely on the same, or higher, level of reliability. It's mind boggling what we can achieve when we really focus on quality outcomes.

433 days ago [-]

veqq 433 days ago [-]

Wow, you weren't joking:

> apic_c1e spectre_v1 spectre_v2 spec_store_bypass swapgs taa itlb_multihit srbds mmio_stale_data retbleed eibrs_pbrsb gds bhi

asveikau 433 days ago [-]

I mean, a bunch of those are timing issues in speculative execution, you could make an argument that it's working as designed but people didn't anticipate the existence of timing exploits. I'd call that different from computation errors.

gjsman-1000 433 days ago [-]

As the original comment suggested, about 5-6 of these are not related to timing exploits (or at least, not the Meltdown/Spectre variants which claim so many patches to their name). Summaries from LKML and Kernel.org:

> apic_c1e

Both ACPI and MP specifications require that the APIC id in the respective tables must be the same as the APIC id in CPUID.

The kernel retrieves the physical package id from the APIC id during the ACPI/MP table scan and builds the physical to logical package map.

There exist Virtualbox and Xen implementations which violate the spec. As a result the physical to logical package map, which relies on the ACPI/MP tables does not work on those systems, because the CPUID initialized physical package id does not match the firmware id. This causes system crashes and malfunction due to invalid package mappings.

The only way to cure this is to sanitize the physical package id after the CPUID enumeration and yell when the APIC ids are different. If the physical package IDs differ use the package information from the ACPI/MP tables so the existing logical package map just works.

> taa

TAA is a hardware vulnerability that allows unprivileged speculative access to data which is available in various CPU internal buffers by using asynchronous aborts within an Intel TSX transactional region.

> itlb_multihit

iTLB multihit is an erratum where some processors may incur a machine check error, possibly resulting in an unrecoverable CPU lockup, when an instruction fetch hits multiple entries in the instruction TLB. This can occur when the page size is changed along with either the physical address or cache type. A malicious guest running on a virtualized system can exploit this erratum to perform a denial of service attack.

> srbds

SRBDS is a hardware vulnerability that allows MDS techniques to infer values returned from special register accesses. Special register accesses are accesses to off core registers. According to Intel's evaluation, the special register reads that have a security expectation of privacy are RDRAND, RDSEED and SGX EGETKEY.

> mmio_stale_data

Processor MMIO Stale Data Vulnerabilities are a class of memory-mapped I/O (MMIO) vulnerabilities that can expose data. The sequences of operations for exposing data range from simple to very complex. Because most of the vulnerabilities require the attacker to have access to MMIO, many environments are not affected. System environments using virtualization where MMIO access is provided to untrusted guests may need mitigation. These vulnerabilities are not transient execution attacks. However, these vulnerabilities may propagate stale data into core fill buffers where the data can subsequently be inferred by an unmitigated transient execution attack. Mitigation for these vulnerabilities includes a combination of microcode update and software changes, depending on the platform and usage model. Some of these mitigations are similar to those used to mitigate Microarchitectural Data Sampling (MDS) or those used to mitigate Special Register Buffer Data Sampling (SRBDS).

userbinator 433 days ago [-]

itlb_multihit is the only one that sounds like an actual bug, just like F00F and FDIV were on the original Pentium. Timing and other data side-channels are arguably not bugs as Intel has long maintained the stance that CPU protection rings are not security boundaries but only meant to protect against accidents instead of deliberate maliciousness.

FireBeyond 433 days ago [-]

> There exist Virtualbox and Xen implementations which violate the spec. As a result the physical to logical package map, which relies on the ACPI/MP tables does not work on those systems, because the CPUID initialized physical package id does not match the firmware id. This causes system crashes and malfunction due to invalid package mappings.

You can argue that the system shouldn't crash (although at that low a level, not sure what else can happen)...

but beyond that, how is "VirtualBox and Xen implementations violate the spec" a failing of a CPU?

cwbriscoe 433 days ago [-]

Here is for my Ryzen 7700X: sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass srso

amluto 433 days ago [-]

FWIW, “sysret_ss_attrs” is a workaround for a design error in AMD’s x86_64 implementation. One might argue that AMD is right because they designed AMD64 in the first place, but IMO this is silly, and AMD’s design is unjustifiable.

(I’m the one who characterized this issue on Linux and wrote the test case and the workaround.)

philjohn 433 days ago [-]

It's pretty widely publicised that even on older Intel CPU's there is a non zero number of servers giving unexpected results at scale - https://arxiv.org/pdf/2102.11245

Panzer04 433 days ago [-]

I wonder if this buckets things like motherboard problems under the same causes. Those numbers do seem very high in general.

I guess it would be a pretty useless comparison if they weren't carefully filtering for CPU-only failure though..

whizzter 432 days ago [-]

I broke the first CPU I've bought for my own money, it was a PPro 180mhz back in early 1997 (2 days before Intel introduced MMX). I ran it overclocked at 200mhz for a good while until I started getting stability issues, had to downclock to 133mhz to use it after that (even 180 was unstable) and bought a new computer once I started my first "real" job.

Seen other HW issues, memory or motherboard(soldered memory) on my ex's laptop that affected only certain adress ranges, memtest86 has been my go-to to check computer health when random crap starts happening since then and I've replaced at least one memory stick on another machine thanks to it.

high_na_euv 433 days ago [-]

https://research.google/pubs/cores-that-dont-count/

Aurornis 433 days ago [-]

> Have I been living in a fantasy bubble where CPU's do exactly what you asked of them (and errors come from not holding it right)?

These are gaming CPUs clocked right up to the threshold of stability (and in some cases, past it)

Server CPUs with ECC RAM are significantly better.

However, if you haven’t experienced much CPU instability, you may not have operated at scale where it appears. Get into the scale where operations occur across 100,000s of CPUs and weird things happen constantly.

userbinator 433 days ago [-]

One of the previous discussions of this instability noted that it was happening to the server equivalents (which are after all the same die) at stock settings as well.

sirn 433 days ago [-]

I feel like the entire fiasco has been multiple issues being lumped together as one, and muddied to the point that even a bluescreen out of an attempt to run XMP at extremely high MT/s are now being claimed as degradation. From what I can make out of this mud, there seems to be (1) a failure caused by high current due to some boards unlocking IccMax/PL1/PL2 by default, and (2) high voltage during a single-core boost (TVB). The former is caused by overclocking, and the latter seems to be Intel's failure to validate the CPUs at low load/long period of single-core boost, where IccMax/PL no longer matters as much (since single-core boost never exceeds PL1 anyway).

Most Raptor Lake "server boards" right now are W680 with client CPUs because the C266/Xeon E-2400 took a long time to come out. The one intended for workstations typically has overclockable settings or is even overclocked by default, which means it's likely to get hit with the failure (1). The one intended for servers do have more conservative settings, but can still be hit with failure (2) under some conditions.

Buildzoid released a video on the Supermicro W680 blade a bit ago that were having issues after running a single-core load 24x7, which is essentially 24x7 boost[1] (aka issue (2)). Xeon E-2400 _could_ be affected in this scenario, although even the highest clock E-2400 SKU (E-2488) is only running at 5.6GHz without Thermal Velocity Boost, and most others are ranging from 4.5 to 5.2 GHz boost (rather than the 5.8 to 6 GHz boost some client SKUs do). I feel like the actual B0 Xeon E-2400 would be a lot less prone to both failures (1) and (2) due to this (but it could happen, though there's no reports of such).

But then the conversation gets muddied enough that "even servers and Xeons are affected" becomes the common narrative (while the former is true, the circumstances needs to be noted; and for Xeons, it's a _maybe_ at most, since right now there's no report of Xeon E-2400 failing).

[1]: https://www.youtube.com/watch?v=yYfBxmBfq7k

userbinator 432 days ago [-]

Looking around, I'm seeing reports of 1.4-1.5V core voltages using Intel's stock profiles, with some even going to 1.7V. That's insanely high for a 10nm process and I'm not surprised about the degradation. For comparison, in the 45/32nm days 1.2-1.3V was the norm, with some extreme overclockers (who don't expect CPUs to survive for more than a few minutes, using liquid nitrogen etc.) hitting ~1.5V, and 1.4V was a commonly quoted safe upper limit for 24x7 operation.

sirn 430 days ago [-]

This is why I think it's going to be much harder for Xeon to be affected by this, as they're normally running in a more conservative voltage settings. I don't have Xeon E-2400 to look at, but Raptor Lake should be able to do 5.6 GHz at 1.3-1.4V-ish, which should be within a safe voltage range. (even the "power hungry" w9-3495X only runs at ~1.25V during 4.8 GHz TVB, and ~1.15V at non-TVB 4.6 GHz boost.)

Dalewyn 433 days ago [-]

I remember hearing that server motherboards also played a role in overclocking out of the box, which is frankly fucking stupid. I don't recall anything about Raptor Lake-based Xeons suffering from degradation.

userbinator 433 days ago [-]

Unless they disable Turbo Boost (which is horrible for performance, but great if you want benchmark consistency), the CPU will automatically overclock until it reaches the limits, adjusting both voltage and frequency.

All the evidence I've seen points to electromigration as a cause of this degradation, and IMHO excessively aggressive automatic overvolting by Intel's microcode is to blame.

There is actually a simple experiment which can determine whether that is true --- remove the fan from the heatsink, or even let the CPU run without a heatsink. As the CPU will automatically throttle once it reaches its designed maximum temperature (and AFAIK that is a hardcoded limit), it will lower its frequency and voltage to maintain that temperature. If this results in a stable CPU, while the one that has great cooling becomes more unstable, it confirms the hypothesis.

There are numerous stories of machines where the heatsink was not in contact with the CPU for some reason, yet they remained perfectly stable (but slow) for many years. I can also say that I've had an 8th-gen i7 running at 100% 24x7 with all power and turbo limits disabled, with its temperature constantly at the design limit of 100C, and it has also remained stable for over 5 years.

cesarb 433 days ago [-]

> There are numerous stories of machines where the heatsink was not in contact with the CPU for some reason, yet they remained perfectly stable (but slow) for many years.

I once had a laptop, which came from the factory with the four screws which hold the heatsink to the CPU missing. It was very slow, and shut down after a few minutes (the reason being thermal shutdown in the BIOS event log helped diagnose the issue). After the four screws were replaced (each screw came in its own large individual cardboard box), it worked fine for many years, BUT after a couple of years (still under warranty), the motherboard failed with a short in the power input. I suspect that all the extra heat from when the CPU was without a working heatsink went to the power supply components through the motherboard ground plane, and cooked them, significantly shortening their useful life.

Dalewyn 433 days ago [-]

>Unless they disable Turbo Boost (which is horrible for performance, but great if you want benchmark consistency), the CPU will automatically overclock until it reaches the limits, adjusting both voltage and frequency.

Turbo Boost (and Thermal Velocity Boost if applicable) frequencies are according to specifications, it's not an overclock.

gradschool 433 days ago [-]

That's a lot of cpu time. Maybe they just don't make them like they used to. If there's some crazy complicated numerical or combinatorial problem you've been trying to crack, do tell.

iforgotpassword 433 days ago [-]

Hm, what's interesting here is that now the blame does somewhat lie on the mainboard manufacturers and their stock overclock, mostly. Wendell in his video pointed out that even cloud gaming providers running server boards have the same high failure rates, while applying much more conservative settings.

J_Shelby_J 433 days ago [-]

Here are some instructions I've been sharing that has led me to stability.

Download OCCT. https://www.ocbase.com/

disable xmp

stress test with OCCT

if it crashes here, you have to downclock your cpu

enable xmp

stress test with OCCT

if it crashes here you need to downclock your memory speed

repeat 5 and 6 in the smallest interval of ram speeds in your bios until it's stable for ~5m

Now, I'm sharing this here with because it's good to know but also to make the point: do you think supermicro is doing this for every server that leaves their factory? Not to say there isn't an issue with Intel, but based on what the article says about failure rates, and what I see from friends with AMD, there is an issue with system stability in general that extends beyond this specific issue. My guess BIOS settings being ran at redline didn't do well when we rolled out the new DDR5.

Dalewyn 433 days ago [-]

XMP (and AMD's EXPO) is an overclock, you should not and can not expect stability out of the box where overclocks are concerned. This is regardless whatever the RAM vendor might tell you; an overclock is an overclock, it is literally running hardware out of specification.

Both Intel and AMD publish their memory controller specs and you should thoroughly understand them if you do want to overclock (read: use XMP/EXPO), anything that goes above the specs is not guaranteed to work.

Incidentally, for all the flak the motherboard vendors rightfully got with their out of the box overclocking defaults, their default configuration for RAM is in fact to stick to Intel/AMD specifications like superglue.

Mathnerd314 433 days ago [-]

I don't think it is as cut and dried as "you should not and can not expect stability out of the box". If it doesn't overclock, you RMA it, if the new one doesn't overclock either, either you are bad at overclocking or you got really unlucky in the silicon lottery. It just doesn't fit the facts to say that you shouldn't expect 5400 Mhz ram to work at 5400 Mhz, even if Intel says the spec is 4800 or whatever. Now 7200 Mhz, that is pushing it and is a lot more like snake oil.

Dalewyn 432 days ago [-]

>If it doesn't overclock, you RMA it

Overclocks are explicitly not covered by warranty.

Yes, Intel and AMD and mobo vendors all say overclocks "may" void warranty and in general they have honored warranties for overclocked hardware, but the official and legal position is that overclocking is not covered by warranty.

>It just doesn't fit the facts to say that you shouldn't expect 5400 Mhz ram to work at 5400 Mhz

You can certainly try RMAing the RAM with the RAM vendor since they sold it for whatever frequency they marketed it at. But as far as Intel and the mobo vendor would be concerned, an overclock is beyond the purview of their warranties.

Mathnerd314 431 days ago [-]

It would certainly be interesting to see if Intel can get sued for their "don't ask, don't tell" policy, but practically I view warranties as a joke anyway. What I meant was really stuff like Amazon's 30-day return policy. Those are no questions asked and 30 days is plenty of time to see if overclocking works.

stn8188 433 days ago [-]

I read this article yesterday and thought it was interesting but would like a bit more data. Most importantly, the first plot shows similar failures per month of 11th and 14th Gen, but the final plot shows that failure rates of 11th Gen was far higher (about double). Does this mean there are about double the number of systems built by Puget with the 14th Gen than they had of 11th Gen? I'd also love to see the first two plots with AMD data.

Dalewyn 433 days ago [-]

I've had the impression that while this problem is definitely real, it's also suffering from very bad media sensationalism (both mainstream and social) and some very emotional chest thumping a la Boeing.

Puget's numbers kind of vindicate that by showing 11th gen was even worse and AMD clearly benefitted from their "underdog", "cult favorite" status.

It would be nice if we could be rid of most of this noise so we could get down to what truly matters.

cyanydeez 433 days ago [-]

The problem isn't just the propaganda fight, it's that it's basically a heisenbug. Intel, a few months ago, blamed these very failures on motherboard manufacturers and claimed they were overlooking and the failures at that time were their fault.

Unless you do a very thorough timeline, you might be confused. But go dig into the last 6 months and you'll see that Intel has either no idea or are absolutely muddying the waters. Neither of these cognitive conclusions should result in Intel looking anything but half rate purveyors of silicon.

stqism 433 days ago [-]

In a sense, they were partially right, while being wrong. Based on Puget’s data, it’s apparent that motherboard vendors overly aggressive default settings helped contribute to the issue being so prominent, when reasonable settings would fail at a lower rate than comparable zen CPUs.

Obviously Intel messed up badly, and those settings shouldn’t result in this behavior, but maybe this will convince system integrators to have more reasonable defaults in the future.

In a top end system, we’re already sitting in territory where our GPU is our benchmark, do we really need to default to giving the cpu so much power?

adrian_b 433 days ago [-]

Even Puget's data, which due to their conservative MB configurations have much less Raptor Lake defects than others with aggressive settings, show an essential difference between the defects of Zen 4 and the defects of Raptor Lake.

The defects of Zen 4 are random manufacturing defects, so most of them are detected by Puget after assembling and testing their systems, before selling them to customers.

On the other hand, most of the Raptor Lake defects happen after some time after selling them to the customers, which implies some kind of wearing mechanism, which either can affect any Raptor Lake CPU or perhaps only CPUs that have some kind of latent defects.

Because the Raptor Lake defects happen after some time, it is likely that their number will continue to raise among the already sold systems and the same statistics recomputed after some months might show a higher number of Raptor Lake defects than now.

jtriangle 433 days ago [-]

They didn't pull those numbers out of thin air, those were intel's specs when those boards were designed. They are, obviously, dangerous specs to run a chip at, hindsight being 2020 and all.

Intel trying to pass the buck is as much of a problem as the CPU's themselves really, because now you can't trust them.

Dalewyn 433 days ago [-]

There is nothing "on spec" about 4096W power limits and using the single-core clock multiplier for multi-core boost, among other deviations.

Intel programming the voltage curves wrong is on them, but that doesn't matter if the motherboards aren't going to run the CPUs according to specification out of the box. Intel calling out mobo vendors for their stupid defaults was justified and very much needed.

krige 433 days ago [-]

The issue is intel guidelines are basically nonsensical and contradictory. What they claim is "recommended" settings is basically three separate sets of options, with no clear indication which is the actual so-called baseline. Which was probably done entirely on purpose as to faciliate blame slinging.

Dalewyn 433 days ago [-]

Intel's specifications are readily available[1][2] to the public. If you can't understand them that's your problem, not Intel's.

Incidentally, there is no such thing as a "baseline". Intel separately specifies an "Extreme Config" for applicable SKUs (the i9s), but otherwise there is only the one set of specifications.

The fact you are talking about "baseline" suggests you did not actually consult the specifications published by Intel, just like the mobo vendors who put out so-called "Intel Baseline Profiles" before they got chastised again for not actually reading and obeying the specs (and arguably they still don't).

[1]: https://edc.intel.com/content/www/us/en/design/products/plat...

[2]: https://edc.intel.com/content/www/us/en/design/products/plat...

krige 433 days ago [-]

This it not what I am referring to. I am referring to the chart posted in their official community post, most recently in June [1]. The chart is labelled "Intel Recommendations: 'Intel Default Settings'" (sic). Notice how "Baseline" is incomplete, and so is "Extreme". Also notice a bunch of notes saying "Intel does not recommend baseline" included on their "recommendations" chart. There's more of little gotchas like that if you pay attention. Also note that this chart has been quietly revised at least once as I have a version from back in April that was less stringent and less guarded with notes than it is now.

[1]https://community.intel.com/t5/Processors/June-2024-Guidance...

Dalewyn 433 days ago [-]

>Notice how "Baseline" is incomplete, and so is "Extreme".

Yeah, you still haven't read the specifications.

Please read the fucking specifications if you are going to partake in discussions concerning specifications.

Extreme is "incomplete" because those specifications apply and only apply to Raptor Lake i9 SKUs. "Baseline" is incomplete and not recommended because "baseline" does not exist in the specifications.

What's more, "performance" also does not exist in the specifications per se. Most of it is actually the specifications copied verbatim, except for PL1 which is 125W for the concerned SKUs according to specification and actually noted as such by Intel in that chart.

The chart also excludes other important information, such as the PL2 time limits (56 seconds for the SKUs in the chart), the max core voltage rating of 1.72V, and AC/DC load lines and associated calibration.

Again: Please read the fucking specifications. You are contributing to the media sensationalism and emotional chest thumping, which is all worthless noise.

433 days ago [-]

michaelmrose 433 days ago [-]

Do I read the graph correctly that 14th gen had a ~10% per month failure rate each each month of May–July 2024 for a cumulative failure rate of almost 30% even with much more conservative than industry average power settings/clocking?

Have I misunderstood the graph or is it actually that awful?

upon_drumhead 433 days ago [-]

The chart is raw counts, not a percentage

> Even though failure rates (as a percentage) are the most consequential, I think showing the absolute number of failures illustrates our experience best

michaelmrose 433 days ago [-]

How would absolute numbers mean anything at all without the percentages? Did they spend their time making a useless graph?

upon_drumhead 433 days ago [-]

They cover that down the page under the "Failure Rates in Context" section

> Everything I’ve shown you so far is our raw number of failures, but what matters most is failure rate percentages. Let’s look at total failure rates in the context of multiple generations and with comparison to AMD Ryzen CPUs.

shrubble 433 days ago [-]

One of the alarming things about Intel's reaction is how tone-deaf they seem to have been during this entire process in terms of reassuring customers, especially gamer/enthusiast buyers.

The Costco desktops in stock at the local Costco are all 13th/14th gen systems priced reasonably enough with Nvidia 4060 or 4070 cards... but my perception is they are not selling well because of the concerns raised about these CPUs.

Dalewyn 433 days ago [-]

If they aren't selling well, that's probably because 14th gen (Raptor Lake Refresh) and especially 13th gen (Raptor Lake) are actually last gen old stock products. This is made even more obvious because Costco usually has all those "Core Ultra 1" (15th gen, Meteor Lake) branded laptops nearby.

We've been on 15th gen (Meteor Lake) for quite a while now, with 16th gen (Arrow and Lunar Lakes) presumably coming later this year. I sympathize with Costco wanting to move all that old stock out now rather than later.

shrubble 433 days ago [-]

Microcenter.com (US retailer) doesn't have anything newer than 14th gen CPUs for desktops, on their site. Did Intel release a newer desktop chip already?

layer8 433 days ago [-]

No, the 15th gen desktop will be Arrow Lake-S aka Core Ultra 200, and is on track for release in Q4 2024.

Dalewyn 433 days ago [-]

No, Arrow Lake is 16th gen. 15th gen (Meteor Lake) was supposed to also have a desktop segment, but that got axed because Intel(tm).

No, I am not going to entertain the desktop/mobile generation split bullshit that AMD also has. Screw that noise, disjointed market segmentation is anti-consumer.

(No, I am not going to entertain Intel's new Core XX branding either.)

hnuser123456 432 days ago [-]

Going by your reasoning, 13th and 14th gen should've been the same generation because 14th was just a "refresh" of 13th gen raptor lake.

Going by the more common theme of "generation goes up by 1 each year", Arrow lake would been 15th gen.

Meteor lake came out the same year as 14th gen, but the "14" branded parts are raptor lake refresh instead of meteor lake due to the deadline miss, they couldn't get the new architecture ported to all product segments so they just did consumer laptops since they don't pull much power and don't push the silicon too hard.

imtringued 433 days ago [-]

The Ryzen 8700G and 8945H are basically identical. AMD is seemingly simply repackaging their laptop SoCs to be sold as desktop APUs in a different form factor. There is not much segmentation going on here.

The only real problem with AMD is that they are gouging on APUs. Who exactly is in the market for a $330 (from memory) APU for their office PC or gaming PC? Of course, they already dropped the price, but that $330 launch price felt really awkward.

wmf 433 days ago [-]

For one thing, Meteor Lake is not available for desktops so Raptor Lake is still the current generation there.

hnuser123456 432 days ago [-]

Meteor lake is only "15th gen" for midrange laptops, and only if you add a blank "14th gen" by skipping raptor lake refresh.

The 185H followed the 1365U, so the 185H would've otherwise been the 1465U.

It was supposed to also be the 14th gen of desktop parts but they couldn't finish it in time, so we got raptor lake refresh instead. There is still the line of e.g. 14700HX (Q1 24), which is a raptor lake laptop part that came out after meteor lake (Core Ultra 9 185H) did (Q4 23), but it beats meteor lake by being used in 45W+ CPU TDP gaming laptops paired with a discrete GPU.

cyanydeez 433 days ago [-]

Love Puget. Keep up the great work.

CamperBob2 433 days ago [-]

I bought my 13900K box from them after an HN story a couple of years ago in which they openly talked about Samsung's failure to live up to their expectations, and basically apologized for advocating the Samsung 990 Pro drives in the past ( https://www.pugetsystems.com/blog/2023/02/02/update-on-samsu... ).

Not one vendor in 100 would stand up and say something like that in public. The system I bought from them has been great, and if/when that changes, I don't doubt that they'll have my back when it comes to haggling with Intel.

frognumber 432 days ago [-]

I've only bought from them rarely, a long long time ago, but I was impressed.

If money is no object -- or developer time is expensive -- or failure is expensive -- it'd be my go to source.

For what I'm doing now (education), pricing is quite a bit too high.

ljoshua 433 days ago [-]

I second that. I've only had a chance to buy one PC from them (I usually run Macs), but the level of attention to detail and customer service from Puget was bar none the best I've had from any tech outfit.

layer8 433 days ago [-]

The subsection “Failure Rates in Context” struck me as the most interesting.

flyinghamster 432 days ago [-]

> With Intel Core CPUs in particular, we pay close attention to voltage levels and time durations at which those levels are sustained. This has been especially challenging when those guidelines are difficult to find and when motherboard makers brand features with their own unique naming. (emphasis mine)

That last little bit is a thing that has infuriated me for decades, and it's all over the place in anything that tech touches, not just motherboards. Can't we call things by their proper names?

1oooqooq 432 days ago [-]

what a surprise!

CPU and GPU learned from decades of RAM business.

Sell garbage. People buy, get blue screen, blame software. ECC? pffft.

Let's just see how long it will take for NVMEs to fall to the same level of reliability as usb pen drives and everyone will just buy 4 of them to RAID1 and assume that's just "normal".

o11c 433 days ago [-]

Hm, once again this lacks numbers on variation within a generation. How much applies to K vs non-K processors? How much between i3, i5, i7, and i9? Does it affect low-end 13th-generation-branded Alder Lake?

forbiddenlake 433 days ago [-]

> Starting with 10th Gen, we have only sold the top 2 SKUs (XX700K and XX900K) in volume, which gives us a nice clean set of data.

cyanydeez 433 days ago [-]

Should Intel be telling you these things?

wmf 433 days ago [-]

Nobody believes Intel at this point.

tacticus 433 days ago [-]

though, given intel is willing to say this impacts everything from the 13400 up and 14400 up we can at least accept a minimum range of impacted chips.

JBiserkov 433 days ago [-]

They got 1 pun in their comedic relief budget and they spent it brilliantly:

> We have enough data to know that we don’t have an acute problem on the horizon with 13th Gen — it is more of a slow burn.

433 days ago [-]

johnobrien1010 433 days ago [-]

Which part of that is a pun?

HankB99 433 days ago [-]

"slow burn", like the processors I suppose.

taspeotis 433 days ago [-]

> slow burn

layer8 433 days ago [-]

"slow burn"

432 days ago [-]

scionescio 432 days ago [-]

[flagged]

layer8 432 days ago [-]

Their statistics does include field failures, meaning failures reported by their customers, and they do note that the degrading problem may yet worsen the statistics in the coming months. They also note that they are applying (as they always have) more conservative settings than the often more aggressive motherboard defaults that other users tend to be using. They aren’t excusing Intels failures, they are presenting the data they have on hand, which is more than I have seen elsewhere.

scionescio 432 days ago [-]

its not about excusing intel failures. BUT imho it is misleading in the context of the current discussion, if you also include statistics data of initial failed cpu, youre kinda polluting the pool. Thats what im criticizing here. it may not be intentional, but its definitely not helping. it distracts from the degrading issue if you also add data from DOA cases.

in opposite other (not yet public?) data, where inital tests have been passed but subsequent tests not.

ps. and it should be crystal clear that Intel has a massive interest in anything that distracts from the degrading issue. be it letting some youtuber blaming (already debunked) motherboardsvendors or possibly (tinfoil hat mode) by 'innocently' throwing statistics to he public that include other failures and pull competitors also into the spotlight, you know how detailed public looks at statistics. they see big bar labled with amd and say hey, amd fails too - that its a total different case, most will ignore.

432 days ago [-]

Rendered at 01:17:44 GMT+0000 (Coordinated Universal Time) with Vercel.