Thoughts on CachyOS?

@[email protected] · 2 months ago

Thoughts on CachyOS?

@d3Xt3r · edit-2 2 months ago

That depends on your CPU, hardware and workloads.

You’re probably thinking of Intel and AVX512 (x86-64-v4) in which case, yes it’s pointless because Intel screwed up the implementation, but on the other hand, that’s not the case for AMD. Of course, that assumes your program actually makes use of AVX512. v3 is worth it though.

In any case, the usual places where you’d see improvements is when you’re compiling stuff, compression, encryption and audio/video encoding (ofc, if your codec is accelerated by your hardware, that’s a moot point). Sometimes the improvements are not apparent by normal benchmarks, but would have an overall impact - for instance, if you use filesystem compression, with the optimisations it means you now have lower I/O latency, and so on.

More importantly, if you’re a laptop user, this could mean better battery life since using more efficient instructions, so certain stuff that might’ve taken 4 CPU cycles could be done in 2 etc.

In my own experience on both my Zen 2 and Zen 4 machines, v3/v4 packages made a visible difference. And that’s not really surprising, because if you take a look the instructions you’re missing out on, you’d be like ‘wtf’:

CMPXCHG16B, LAHF-SAHF, POPCNT, SSE3, SSE4_1, SSE4_2, SSSE3, AVX, AVX2, BMI1, BMI2, F16C, FMA, LZCNT, MOVBE, OSXSAVE.

And this is not counting any of the AVX512 instructions in v4, or all the CPU-specific instructions eg in znver4.

It really doesn’t make sense that you’re spending so much money buying a fancy CPU, but not making use of half of its features…

Atemu · 2 months ago

v3 is worth it though

[citation needed]

Sometimes the improvements are not apparent by normal benchmarks, but would have an overall impact - for instance, if you use filesystem compression, with the optimisations it means you now have lower I/O latency, and so on.

Those would show up in any benchmark that is sensitive to I/O latency.

Also, again, [citation needed] that march optimisations measurably lower I/O latency for compressed I/O. For that to happen it is a necessary condition that compression is a significant component in I/O latency to begin with. If 99% of the time was spent waiting for the device to write the data, optimising the 1% of time spent on compression by even as much as 20% would not gain you anything of significance. This is obviously an exaggerated example but, given how absolutely dog slow most I/O devices are compared to how fast CPUs are these days, not entirely unrealistic.

Generally, the effect of such esoteric “optimisations” is so small that the length of your unix username has a greater effect on real-world performance. I wish I was kidding.
You have to account for a lot of variables and measurement biases if you want to make factual claims about them. You can observe performance differences on the order of 5-10% just due to a slight memory layout changes with different compile flags, without any actual performance improvement due to the change in code generation.

That’s not my opinion, that’s rather well established fact. Read here:

So far, I have yet to see data that shows a significant performance increase from march optimisations which either controlled for the measurement bias or showed an effect that couldn’t be explained by measurement bias alone.

There might be an improvement and my personal hypothesis is that there is at least a small one but, so far, we don’t actually know.

More importantly, if you’re a laptop user, this could mean better battery life since using more efficient instructions, so certain stuff that might’ve taken 4 CPU cycles could be done in 2 etc.

The more realistic case is that an execution that would have taken 4 CPU cycles on average would then take 3.9 CPU cycles.

I don’t have data on how power scales with varying cycles/task at a constant task/time but I doubt it’s linear, especially with all the complexities surrounding speculative execution.

In my own experience on both my Zen 2 and Zen 4 machines, v3/v4 packages made a visible difference.

“visible” in what way? March optimisations are hardly visible in controlled synthetic tests…

It really doesn’t make sense that you’re spending so much money buying a fancy CPU, but not making use of half of its features…

These features cater towards specialised workloads, not general purpose computing.

Applications which facilitate such specialised workloads and are performance-critical usually have hand-made assembly for the critical paths where these specialised instructions can make a difference. Generic compiler optimisations will do precisely nothing to improve performance in any way in that case.

I’d worry more about your applications not making any use of all the cores you’ve paid good money for. Spoiler alert: Compiler optimisations don’t help with that problem one bit.