Rendered at 23:36:01 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
psanchez 17 hours ago [-]
This reminds me of a story from 15 years ago, where I was developing a technology to download games on demand by hooking into the OS calls.
There was a particular game that was superslow when this tech was applied. Original game loading took around 15-20 seconds, whereas once the tech was applied it took easily 3-5 min, even with all data already downloaded.
When I started digging into it, I realized the reason was the game was using something like
fread(data, 1, 65536, fptr);
instead of
fread(data, 65536, 1, fptr);
Which basically expanded back in the day to 65k reads of 1 byte for several MB file. Each fread translated to 65k reads of ReadFile Windows API. Since my code was hooking on ReadFile system call, and my call was heavier than ReadFile, the game loading felt really slow. Unusable. It would have not been fun for players.
The easy fix was to swap arguments for certain calls. The long fix required to use an internal cache to account for these cases so that the hooked ReadFile was faster when data was already in disk.
Funny thing is that as we started rolling out the tech and applying it to more and more games we realized lots of games did this. We went for the cache fix and games ended up loading faster than before. Honestly, games could have load all the data in a couple of seconds by just swapping the args. I'm guessing developers did this on purpose so that games seemed like they were loading a lot of stuff, although you never know.
Taniwha 17 hours ago [-]
I used to be a graphics card/chip architect for macs in the early/mid 90s - our chips were the fastest, but some programs were resistant because they did stupid stuff: pagemaker invalidated the font cache every time it went thru its main loop, quark with ATM did an n*2 thing every time it wrote text etc etc. We had special hardware to accelerate text drawing and it did nothing because the software pissed it away. We considered creating a plugin that fixed all these things, it would have been hard to maintain, in the end we travelled around to the people who made these apps and talked them through their problems
To be fair excel would erase places white that it wanted to write up to 9 times before it drew any black pixels, we made that very fast! we didn't tell them :-)
At the time 24-bit framebuffers were so slow that before we built graphics acceleration hardware people would switch back to 8-bit to get stuff done, making 24-bit/true colour your daily driver was a big step forward.
jjuran 2 hours ago [-]
In my 68K Mac emulator running on modern (or even decade-old) hardware, performance in the traditional sense is less of a concern, but other issues arise. The big ones include CPU-burning loops that wait for a length of time or for an interrupt-decremented counter to reach zero, as well as invalid memory accesses (which I've made crash — no NULL deref for you).
> We considered creating a plugin that fixed all these things, it would have been hard to maintain, in the end we travelled around to the people who made these apps and talked them through their problems
Since talking to developers is no longer an option, I actually do write "Such-and-such Tune-up" extensions that patch applications dynamically to make them run better (or at all) in Advanced Mac Substitute, or even Mac OS itself.
Taniwha 1 hours ago [-]
Yeah those low core global system variables (including a readable/writeable 0) at fixed addresses were very much a thing, they were a bad design decision made for the original Macs with almost no memory, and made running more than one app (switcher/multifinder) a difficult transition back in the day. Someone wasn't planning ahead
I also worked on the original A/UX port for the Mac II, some hardware (like the IWM) required tiny buzzy loops, we ran into one bug where using the floppy caused ADB to freeze, but only on the release machines, not the prototypes all our engineers had, turned out there was hardware that made access to the VIA faster by pulling the clock in for 1 cycle, if you sat in a loop reading the timer in the via to measure a sector time for the IWM in too tight a loop it upped the output clock from the VIA to the ADB chip and over clocked it ....
nxobject 14 hours ago [-]
Does that make you the first in a long tradition of GPU developers going to blockbuster app devs to say "hey, you should be doing this instead?"
PS – I am looking through the NuBus cards that I have... did you work for SuperMac or RasterOps?
Taniwha 12 hours ago [-]
I was probably not the first to have to do that, we knew what apps our customers used, making them better was the whole point of the operation
I did the architectural design for the SuperMac cards. I figured out what needed to be accelerated, dropping code into people's machines to see where the cycles were going. Others did the physical design for the first 2 cards, I did the design of the chip in the Thunder and later cards (designed the data paths and state machines and a full simulation, someone else actually laid the gates)
If your card has a SQD01 on it it's my work. It peaks at 1.5Gb/s on solid fills
urbandw311er 16 hours ago [-]
This is a horrible and yet not unexpected insight into the internals of Excel
Taniwha 15 hours ago [-]
To be fair this was Excell 25 years ago, may no longer be true.
One of the other bugs (the Quark/ATM one) was also because of the programmers were worried about writing over stuff that hadn't been completely erased, the Quark guys wrote a string with 2 spaces at the end through a box that masked the end of the string, the ATM font renderer saw it couldn't fit the text so it split it in half and tried again so it drew N/2 N/4 N/8 ... strings. It spent all it's time in the 68k's multiply instructions figuring out how wide the strings (and substrings) were, our fancy 24-bit character rendering hardware was an afterthought
Xirdus 9 hours ago [-]
There's a good chance it was Excel's workaround for some other GPU's buggy behavior.
bathtub365 15 hours ago [-]
In all of the software you’ve written, are you aware of how many on-screen pixels you’ve overdrawn?
trelbutate 14 hours ago [-]
> To be fair excel would erase places white that it wanted to write up to 9 times before it drew any black pixels
I feel like I'm having a stroke trying to read this, what does it mean??
Taniwha 12 hours ago [-]
Well all they needed to do was erase the screen with white and draw on it, but their app's internal logic meant that they erased it more than once.
I was capturing QuickDraw library calls - the low level graphics primitives, to figure out where the graphics time in apps was going and found out sometimes excel did it 9 times
Of course users didn't see it more than once, but our hardware made all that wasted time run faster
canucker2016 5 hours ago [-]
It's more likely that one dev wrote the draw-cell code.
Another dev who's fixing a bug, realizes if they call a certain function either directly or indirectly, their particular bug gets fixed.
Oh, and as a side effect, the cell gets erased (again).
A few more fixes/new features added like this and the code is inadvertently erasing the same cell multiple times.
It takes a certain type of dev to step through in a debugger and Notice the app is doing way too much work and then to untangle the mess of code without causing regressions.
HansHamster 9 hours ago [-]
Maybe their CRTs had horrible burn-in and they had to erase everything 9 times before it was gone...
NSUserDefaults 14 hours ago [-]
Several layers of white is what makes the black really pop. (Just kidding).
layer8 9 hours ago [-]
It’s necessary for erasing cat pixels.
_shantaram 11 hours ago [-]
I think it could call (their equivalent of) clearRect up to 9 times on an already cleared region before drawing there?
b112 14 hours ago [-]
It means they were time travellers! Secretly, they came from an alternate future where everyone used e-ink displays, and wanted Excel to be ready!
sixeyes 14 hours ago [-]
before writing to some area, it would erase it (clearing with white) up to 9 times
PaulHoule 12 hours ago [-]
I remember when 24 bit color was exotic and aspirational and you had to settle for 16.
saltcured 6 hours ago [-]
Yeah, even in Linux we were doing these things with X Windows bit depths.
8 bit psuedo color, so the color palette switched with every focus-follows-mouse window boundary crossing. 16 bit direct color with banding but no more palette psychedlia.
This was equal parts to make it faster and to allow for higher framebuffer resolutions with limited VRAM.
projektfu 11 hours ago [-]
I got the extra vram in my LC to allow for 24-bit color but it was dog slow. The 16 bit data path didn't help. If I wanted it, I'd get things done in 8 bit or mono until it was ready, then switch to 24 bit for the final look.
spauldo 9 hours ago [-]
I swear if Sun Microsystems was still around, their machines would still ship with 8-bit pseudocolor and you'd have to pay an extra $3k for 24-bit.
dhosek 10 hours ago [-]
16 bits? Luxury. I had 6 colors when I was a kid and was happy to have them.
PaulHoule 9 hours ago [-]
My first computer was a TRS-80 Color Computer which had a tiny set of badly chosen colors!
Back then you did what you could with graphics and it wasn't a lot. After I got a PC I had indexed color for a long time and working with indexed color was pretty rough because anything physics-based like rendering or raytracing was going to be difficult. You could render a photo pretty well with 256 carefully chosen colors and dithering but if you wanted to, say, composite two photos and do general sorts of things you'd need to convert to "true color", do the math there, then re-quantize for display.
spogbiper 9 hours ago [-]
6?!? We had only 4 colors in low res mode and 2 in high res
floxy 8 hours ago [-]
Well my Hercules graphics card was only monochrome, but it was relatively high resolution.
PaulHoule 7 hours ago [-]
I had a herc clone on the 286 machine I bought around 1987 and later added a Super VGA card. One cool thing about the IBM PC was that the monochrome and color graphic systems were sufficiently different in terms of memory map and ports so you could plug in two graphics cards and two monitors and that's what I had.
badc0ffee 3 hours ago [-]
I think this was true for MDA (text only), but the Hercules had 64k of video memory at B0000, the latter half of which would overlap with the CGA card, and also VGA's text mode.
xattt 12 hours ago [-]
What would have been the purpose of stupid code like that?
Was it a workaround for things that didn’t fully complete on one iteration, so the devs kept hammering away at it until it worked?
phire 12 hours ago [-]
They were most likely just bugs. Quite possibly really stupid bugs.
Not every bug results in the program doing the wrong thing, they often just make the program do the right thing very slowly.
And nobody notices, since it still produces the right result.
Taniwha 12 hours ago [-]
Yes, they were bugs, I think programmers (and their marketing people) were more focused on new features than performance
dhosek 10 hours ago [-]
Thankfully we’ve moved past that era.
Now the bugs that get ignored for new features cause bad results AND bad performance.
kazinator 6 hours ago [-]
It's not necessarily stupid code in the game, but something the C library is doing that it probably shouldn't.
If the stream is buffered, then all operations, including fread, are supposed to go through the buffer.
All three of these should issue buffer-sized reads to the operating system:
1. A loop which calls getc(stream) 65536 times.
2. fread(buf, 1, 65536, stream)
3. fread(buf, 65536, 1, stream)
The more direct behavior of fread should only kick in if the stream is configured as unbuffered.
I would say that the way low-level reads are issued to the host operating system is a "visible effect" of the program, so I suspect this may actually be a matter of conformance. I.e. it's not okay to issue those reads however the stream library wants as long as the data is read.
Xirdus 9 hours ago [-]
Reminds me of the "community patch" to GTA Online from a few years ago. The game was plagued by 10+ minute loading times. The situation remained for years and only got worse with time. Some hacker figured out that the game spent 80% of loading time reading the in-game store listing file. The file was tens of megabytes IIRC, and it literally used the Schlemiel the Painter's Algorithm - for each entry, start reading from the beginning byte after byte. The hacker made a tiny patch that made it remember where it found the last entry. This cut the total loading time by 80%, from over 10 minutes to less than 3.
Edit: removed incorrect information.
exrook 8 hours ago [-]
This is not quite an accurate telling of rockstar's reaction, there were actually receptive to it and paid out $10k for the discovery. Though it's an understandable mistake given rockstar's hostile history with the gta modding scene.
See the original post and discussion for the whole story:
That's not how I remember these events when they were playing out. I distincly remember social media posts warning about the dangers of modifying game files, plus refusal to acknowledge the issue. Note there were 2 full weeks between the blog post and the update mentioning the bounty. I'm pretty sure the massive community outrage in between has played a role in it. But I don't have any sources and I was wrong about at least one thing (lack of attribution), so I'm okay assuming I'm wrong about everything else too.
jayd16 5 hours ago [-]
Wowee two full weeks? You mean like a single sprint to discover, verify, and post PR about a perf patch that was good among the sea of rumors and reports a billion dollar game usually gets?
Xirdus 5 hours ago [-]
I mean like enough time to check the pulse with the community and walk back the initial confrontational response. I don't have a problem with when they fixed it. I don't have a problem with when they paid out. I wouldn't have a problem if they didn't pay out at all (why would they?). I have a problem with their initial reaction, which was full of the usual fearmongering against modders. (And a smaller problem with that it took an external contributor to finally make them implement a trivial fix for a massive usability issue that's been there for at least 6 years. It shows how much they don't care about their customers or the product they're selling unless the media get involved.)
Someone 15 hours ago [-]
> Which basically expanded back in the day to 65k reads of 1 byte for several MB file. Each fread translated to 65k reads of ReadFile Windows API
What software did that that badly? If the code asks for (up to) 65,536 single byte items, why would you split that into 65,536 calls?
Also, that change changes behavior. The old call could read anything from zero to 65,536 bytes, the new one only can read zero or 65,536 bytes.
(Reading the source of a few implementations, I think most implementations will fill the output buffer with partial objects if the input doesn’t supply an integral number of them, but the return value of fread cannot signal that to the caller)
tom_ 9 hours ago [-]
The standard says that fread calls fgetc multiple times for each object:
> For each object, size calls are made to the fgetc function and the results stored, in the order read, in an array of unsigned char exactly overlaying the object
(wording unchanged since C99)
If the file is unbuffered, depending on how the implementation handles buffering, and how it interprets the standard, then perhaps it does end up hitting a path where there's 1 ReadFile call per byte...
I don't know how most implementations get around this. Presumably it's valid to interpret "calls are made" as "behaving as if calls are made", meaning fread can copy data out of the FILE's buffer directly, or make calls directly to whatever routine fgetc defers to, rather than calling fgetc N times literally. Looks like glibc's fread does this.
klodolph 8 hours ago [-]
I think it’s pretty rare for files to be unbuffered like that. AFAIK it’s mostly stderr that ends up unbuffered, at least on Unix-like systems.
tom_ 8 hours ago [-]
You can call setbuf(fp,NULL) after opening, and now the stream is unbuffered. What this means is apparently implementation-dependent.
As to why you'd do that? - well, who knows the exact circumstances in this case. Perhaps this was faster in some meaningful case that was relevant to some other project (and then maybe the fread doesn't call fgetc after all!). I'm just speculating. Well-reused code often ends up with stuff that needs rethinking, that, even if noticed, nobody has the time or inclination to attempt to fix.
micampe 15 hours ago [-]
A long time ago I worked with someone who read 1 byte at a time from a socket because they insisted data was cached so the kernel was going to batch it magically somehow. It took me days to convince them to measure it.
vidarh 11 hours ago [-]
I used to make it a general rule to start all my optimisation of any network code by running strace and look for excessive read's and write's, because you'd be shocked how many did stuff like that if they didn't know the length of a string, or to read the length first, instead of reading into a buffer.
I had to convince people with benchmarks regularly that, yes, you could write the handful of lines to do proper user-space buffering and trivially run rings around any code that did extra context switches, because a lot of people didn't realise the cost difference between system calls and calling their own functions.
This included, by the way, the MySQL client library, at one point, which would do small read for length fields instead of larger non-blocking reads into a buffer all the time
quietbritishjim 14 hours ago [-]
That's different: you're talking about the application code, like OP.
But I think the parent comment's point is that the issue is in the implementation of fread itself in the standard library. It's perfectly reasonable for an application to pass it 1, 65536 (i.e. one byte, up to 65536 times) and expect it not to issue 65536 separate OS calls.
b112 14 hours ago [-]
Is it? I get what you're saying, but asking for 1 byte 65536 times, is indeed different than asking for 65536 bytes, 1 time. There may be reasons, such as when you pull off the end of a buffer, it shifts. And the buffer size is 1 byte. Or 10. Or whatever.
No, I'm not saying that's why. I'm simply saying there is a difference between asking for 1 byte or 65k bytes of something. Even dd runs the same under Linux.
dd bs=10k count=1 is faster than bs=1 count=10k
I remember trying to recover some data from a spinning disk, and trying to slowly creep up on the data. So I wanted 1 byte per, I wanted it to nibble, until it hit whatever the errored part was. If I just grabbed the lot, it'd error out from the whole read.
The latter (as usual when comparing OpenBSD and Linux) is more complex, but both multiply count by size and then go their way.
Also, the API contract allows fread to read fewer bytes than requested. I would except any implementation to do that.
But maybe, somebody interpreted the contract differently than major OSes, in the sense that a call isn’t allowed to write partial size-sized chunks to user memory and/or advance the file position further than its return value advocates (that, I think, is something that the implementations above can do, and might be considered a bug)
quietbritishjim 12 hours ago [-]
> asking for 1 byte 65536 times, is indeed different than asking for 65536 bytes, 1 time.
Yes it's different. As others have noted, the difference is what is returned if less than 65536 are available to read in the file: total failure vs partial read.
There is, unsurprisingly, no requirement that it has an unnecessarily inefficient implementation to meet this behavioral requirement. (The C standard doesn't talk about such things as syscalls but, even if it did, it surely wouldn't require such a thing.)
The irony is that that partial read is actually the default on both Windows and Posix (i.e. both ReadFile and read() will read up to the number of bytes specified). So a one-syscall implementation for fread would have been easier than multiple calls, and certainly would be standard compliant.
The dd example isn't comparable because dd is much lower level, and you really are specifying how the syscalls should be made.
sumtechguy 4 hours ago [-]
Also you need to be careful what you read/write. In some cases.
As many examples out there use int/char etc to show how to use the thing. But if you switch to structs that fwrite can totally burn you if you use the sizeof call. As the sizeof a struct can vary between platforms and compilers. Depending on packing. Then endianness can sometimes mess you up. If you are reading/writing for yourself you can get away with a lot. But if you are trying to interop then you have to be wildly careful what you do.
fwrite is another one where people will do one byte at a time (same up to for the windows version). Bash out a loop, use the sizeof for the input to the for loop. copy and paste just doing 1 byte and you can easily end up here. One program I added a cache in front of the thing so it would always write on disk block boundaries and then come back for more. I started off with just packed struct sizes but the perf was just 'ok'. The file block boundary thing really made it fast. Not all OS's have a readahead/write buffer behind that call so perf can vary.
It is honestly such an easy mistake to make. As many of the examples/docs do not really show you why/how to use both of those calls in the way needed. You sort of have to stumble into it and work it out.
Once you see it you know. But until then you do not really notice if it is 'working'.
dspillett 13 hours ago [-]
Another possibility for why it needs to be done that way is dealing with error conditions.
I've not looked at the code (or even the man pages) and it is a long time since I touched anything that low level, so this might be completely wrong, but if there is an error before the next 64KiB (including just hitting EOF) then the semantics could be different. Asking for 1x64KiB I would expect to just error as there aren't the requested number of bytes. Asking for 64Ki lots of 1 byte might simple error just the same, or it might at least populate the buffer with what it can read, or if the meaning of 1,65536 is actually “up to 64Ki lots of 1B” then it would populate the buffer as far as possible and return the amount read rather than an error condition.
If the per-byte option is slow but still fast enough, and dealing with the semantics is less faf, then people will go for that because the tiny time loss is worth the larger effort reduction. Of course this assumes the underlying system doesn't change, as with the “making local code to run as on-demand networked code” example higher in the thread which changes the relative performance characteristics of the two calling methods significantly.
chadgpt3 13 hours ago [-]
dd is designed to request a certain block size from the kernel. fread is not and should just multiply the two arguments and read that many bytes, just like calloc.
macintux 12 hours ago [-]
I assumed it was a simple mistake: easy to forget what order the two integers are sent.
mort96 14 hours ago [-]
Wait, is that wrong? I always call fread as:
fread(data, 1, sizeof(buffer), f);
with the rationale that I'm interested in reading sizeof(buffer) individual bytes. The buffer size is incidental, not the size of the items I'm trying to read from the file; "read one item whose size is sizeof(buffer)" seems semantically wrong.
Is this just the case of Windows having a bad stdlib fread implementation 15 years ago or is my thinking here actually wrong?
chadgpt3 13 hours ago [-]
It's not wrong. Guy just wrote a bad implementation of fread and blamed everyone else.
DarkUranium 13 hours ago [-]
He didn't write it.
The C runtime authors did (presumably Microsoft, if it's MSVCRT).
He's hooking into ReadFile, a layer below the stdlib. By the time it reaches the hook, it's already split.
projektfu 11 hours ago [-]
fread should be buffered, but different values may cause buffering at different rates. Perhaps it didn't generate 65535 calls to ReadFile but it generated 16 or 64.
fsfod 13 hours ago [-]
Part of Windows Explorer actually does tons of tiny 4 byte ReadFile calls in to its tracking database like file when you delete a file. If you deleting lots of files this quickly adds up.
pbhjpbhj 9 hours ago [-]
Is this why Windows takes so long to delete things?? Presumably those reads aren't done when using del from a console as that always seems a bit faster.
jonathanlydall 9 hours ago [-]
Its slowness is also a function of security software or any other file system "filters" (I believe they're called) are installed.
For example, I run TortoiseGit which has a caching feature which is supposed to make it faster at showing what to commit. Disabling it increases the number of items I can delete per second in my Windows Explorer from about 1000 to about 3000 while making not making TortoiseGit operations meaningfully slower (that I can tell).
This is a Dev Drive [0] on my machine, it would probably be slower on my C: drive which has full Windows Defender real time file scanning.
But it doesn't seem to explain why it's so much slower at regular extraction.
Dwedit 3 hours ago [-]
Is this actually real? I thought fread just multiplied the two numbers together to compute a total size. Meanwhile, the Win32 API call ReadFile actually does do a separate system call if you call it multiple times.
somenameforme 16 hours ago [-]
Doesn't that break anything relying on the return value? fread gives you the number of objects read as a return. So I think a pretty typical thing would be to fread and then parse that number of characters, and that'd just break?
jcul 15 hours ago [-]
I've seen a lot of code that just assumes fread / fwrite succeeded without bothering to check the return value...
But in this case if the code was calling fread 65536 times in a loop and getting 64KiB each time it wouldn't be good either!
Sounds like the parent comment had to fix this with the internal cache thing to speed up the small freads. I think they meant the easy fix would have been swapping the args in the original / caller code.
account42 15 hours ago [-]
There are no small freads in the story, whatever implements those freads supposedly split them up into many calls. But that sound more like a problem of that implementation than the fread callers as size == 1 is correct when you are reading a bag of bytes.
jcul 1 hours ago [-]
Ah you're right, I misread it.
koolala 15 hours ago [-]
I think they turned it from a tiny file read to a tiny ram read.
DonHopkins 15 hours ago [-]
The type of programmer who swaps the args to fread tends to be the type of programmer who doesn't bother to check the return value, fortunately.
Edit: mort96: So did you check the return value or not?
mort96 14 hours ago [-]
If I have a buffer of bytes, and I intend to treat the content of that buffer as individual bytes, what is semantically wrong with "read 65k 1-byte-sized items into this buffer"? Wouldn't it be a bit unnatural to express it as "read one item whose size is 65k"?
account42 15 hours ago [-]
But the args aren't necessarily swapped just because they end up in a slow case in some implementation.
gwbas1c 5 hours ago [-]
> The long fix required to use an internal cache to account for these cases
That's because the OS does the same thing too. It's the right fix, when I implemented something similar, we implemented caching right away.
lukan 15 hours ago [-]
"I'm guessing developers did this on purpose so that games seemed like they were loading a lot of stuff"
I really hope that was not the case and rather think incompetence or to deal with obscure legacy problems, but the gamer in me gets enraged at the thought someone would artificially increase loading times.
dfox 11 hours ago [-]
The most important fix in SP1 for Office 2007 was fixing exactly that in Excel. Doing ridiculous amount of 4 byte reads made it basically unusable on network filesystems.
chadgpt3 13 hours ago [-]
Why does your fread to anything other than multiplying the two arguments?
Sesse__ 12 hours ago [-]
The idea of having two arguments to fread() is presumably to be able to do something else than all-or-nothing when there's a short read.
chadgpt3 12 hours ago [-]
Yes, it divides the bytes read by the element size to get the return value.
Which is the obvious reason you'd pass an element size of 1: you want to know how many bytes were read.
dlcarrier 18 hours ago [-]
SimCity had a read-after-free bug that Microsoft patched in Windows 95. That was a lot easier for customers than having Maxis fix it, which could have required exchanging copies of the game.
oceansky 13 hours ago [-]
There's also the opposite effect, a windows security update broke GTA San Andreas because it relied on undefined behavior.
in this dark age of agents writing code that gets debugged by other agents, i love reading stuff like this: stories of human intuition fixing human mistakes. thanks for a fascinating read.
Cthulhu_ 16 hours ago [-]
It feels like graphics drivers do / did this a lot too. At the very least they make specific optimizations for specific games, probably by tweaking settings and features that the game developers didn't optimize properly themselves.
That's a case of the driver cheating but there are also lots of cases where the game is just full of bugs that the driver has to work around in order to not be blamed for them.
smallstepforman 8 hours ago [-]
The driver switched to lower mipmapped texture and got caught. There is a ton of that out there for popular benchmark ready games. Run pro drivers instead of adrenaline to run generic baseline real driver.
Gibbon1 14 hours ago [-]
I've said over the years a few times, this isn't our fault but it's our problem.
SyzygyRhythm 15 hours ago [-]
There are many, many, cases like this, including correctness fixes. One recent example I remember had a shader that computed:
x = a / b * b
The optimizer was allowed, but not obligated, to transform that into:
x = a
However, in this case, b was sometimes 0. And if so, the unoptimized version computed:
x = a / 0 * 0 = Inf * 0 = NaN
So badness ensued if the that particular path didn't get optimized, which could happen under various circumstances. We had to add some code to ensure that transformation always happened on that game.
DarkUranium 12 hours ago [-]
I'm curious, what's the ratio of:
- deciding to inform the game developer & wait for reply vs not waiting for reply vs just fixing it yourself without informing the developer; and
- if informed: developer actually fixing it vs only saying they would fix it vs no reply whatsoever (not counting automated "thank you for your inquiry" replies, in cases where you don't already have more direct channels to the dev than email)
I've always kind of wondered this because in a way, it's kind of weird that it's fixed for them, at least for new releases / games actively being developed.
(Full disclosure: I'm a game developer myself, with a very high interest in engine plumbing & dev [including graphics], though finding a job for the latter is easier said than done.)
SyzygyRhythm 1 hours ago [-]
We always try to inform game devs about correctness issues, but generally we can push out a driver fix before the devs can fix things on their side, so that pretty much always happens. Many things can be fixed quickly by app profile (detecting executable name). And we have a pretty good relationship with most game devs and usually get some feedback. Of course, we don't have infinite resources, so bigger game devs get more attention to tiny ones.
I'm not sure what fraction of devs actually fix things on their side, though. Once there's a driver workaround, and we've informed the devs, it's off our plate.
Performance is more of a gray area. We contact devs if there's something we can't work around, of course. And if there's something truly breaking. But for things that aren't exactly bugs, just things that could be improved, and we can improve on our own... well, we'll probably keep that for the competitive advantage.
rbits 7 hours ago [-]
Yep. I know the Minecraft optimisation mod Sodium has encountered some issues because Nvidia drivers try to optimise the game in ways that can cause issues for them
easyThrowaway 16 hours ago [-]
The most interesting part is that IIRC they shipped the entire Windows 3.11 memory allocator to make it work.
I have very little understanding on how allocation works at OS level, but I'm surprised there are no wrappers like dgVoodoo or dxWrapper specifically for this kind of issues. There are quite a bunch of old Windows games (Need for Speed 1-4 for a start) that refuse to run on modern OSes due to rather...bold memory management strategies.
rincebrain 16 hours ago [-]
Apparently the recollection of the fix was that they deferred actually freeing memory for a while if they detected it was SimCity running. [1]
A story I heard at Sun, which may be apocryphal but was fucking hilarious enough to be a repeatable rumor, was that a release of an early operating system in BETA was determined to be solid and tested and ready to release and ship to customers, so they simply changed the version string from something like "SunOS2.1BETA" to "SunOS2.1FCS" (First Customer Ship), and recompiled. But the change from a 12 character version to an 11 character version threw off the alignment of some important data structures somewhere in the kernel, and the entire OS ran MUCH SLOWER because of 68k unaligned memory accesses!
hodgehog11 18 hours ago [-]
I think we're starting to see more of this sort of thing happening now with Proton and Wine gaining prominence in the Linux community. Some games (Elden Ring comes to mind) have bad enough PC ports when they come out that the compatibility layer can incorporate a hotfix to improve performance, while users of the software on the original platform still had to suffer.
Gigachad 17 hours ago [-]
Fairly sure GPU drivers do the same thing where they include a ton of per game tweaks to make them run faster. It does feel like a fragile way of doing things where an external component that should be agnostic to the software running ends up including a handful of junk trying to fix stuff that should have been fixed by the consumer of the driver.
zoenolan 16 hours ago [-]
The big one I remember was many applications, not just games assuming the buffer swap was performed by a blit into the display buffer, not an framebuffer pointer update. They relied on the previous frames data still being in the back buffer. For those applications you were forced to blit the buffer, not swap the pointer and take a performance hit.
I also remember a media player being called out by name in the code for doing invalid operations, needing a work around and code to detect it was running just to function.
Guvante 17 hours ago [-]
It goes the other way too, sometimes you trigger some optimization silliness in the driver and the game needs to adapt to avoid it.
rickdeckard 16 hours ago [-]
then the driver gets updated and the game either continues to optimize (wrong) or branches out into code that was written before that driver came out and generally wasn't that well tested, and the circle continues...
It's the life of a (game) developer...
anilakar 17 hours ago [-]
GPU driver packages are already a huge collection of workarounds for bad game engine coding.
An Nvidia employee once told me that one of the easiest ways to squeeze out a few extra frames on your old machine is to rename the game executable to hl2.exe.
st_goliath 16 hours ago [-]
> GPU driver packages are already a huge collection of workarounds for bad game engine coding.
And of course, browser engines also do the same things for certain websites:
I can see how it can modify GPU driver behavior, but I cannot see how it would get you better performance with everything else the same?
What it should do is ensure some things not relevant to Half-Life 2 were not done, thus getting better performance for this game in particular, but there is no guarantee that same optimizations work for other applications or games, so one should not expect an overall improvement.
Unless they are doing some silly things like dropping quality, but that's the "everything else the same" point.
If not, why not have this enabled as default behavior instead?
sfink 9 hours ago [-]
In general, because it's a flag that says to do things in an incorrect but faster way. It's like -ffast-math. The applications for which it's intended don't do anything where the incorrectness matters. Some random application falsely labeled hl2.exe may or may not.
> What it should do is ensure some things not relevant to Half-Life 2 were not done, thus getting better performance for this game in particular, but there is no guarantee that same optimizations work for other applications or games, so one should not expect an overall improvement.
I can't quite parse this. Yes, there is no guarantee that the optimizations will work for another game, which is precisely why you can expect an improvement with hl2. With non-hl2, you may get an improvement, you may not, and you may get incorrect behavior.
Everything else is not the same, but hl2 doesn't use the stuff that's different.
dlcarrier 16 hours ago [-]
I wouldn't be surprised if it made other games on the Source engine faster, but everything else slower.
limflick 17 hours ago [-]
> to rename the game executable to hl2.exe
This seems genuinely unbelievable. Does anyone have a technical explanation for this?
hurtigioll 17 hours ago [-]
gpu drivers detect games, among other thing by looking at executable names
then driver "optimizes" behavior, sometimes dishonestly (reducing precision), sometimes honestly (working around game engine stupidity)
limflick 16 hours ago [-]
Couldn't that also cause glitches since optimizations meant for HL2 might not work for, say San Andreas? I understand some optimizations might be universal but I can't help but think about unexpected behavior.
ChocolateGod 16 hours ago [-]
Yes.
A lot of people use Nvidia profile inspector to enable reBar on all games and claim that Nvidia is purposely holding back performance, but doing this causes many games to crash.
tester756 16 hours ago [-]
Who's problem is this?
Nvidia probably doesnt officially say anything about this and 99.9% of people do not rename process name
account42 15 hours ago [-]
It's definitely Nvidia's problem if this breaks something. Nothing in the D3D/OpenGL specs says that you can (not) use certain executable names.
redsocksfan45 14 hours ago [-]
[dead]
hurtigioll 16 hours ago [-]
of course they do.
nvidia even has an official api for a game to identify itself so they dont need to look at executable name
limflick 16 hours ago [-]
Phrasing, I wasn't blaming anyone, just curious about the technicalities.
proton_9 17 hours ago [-]
This sounds like a really interesting story, would like to read more on why half life 2 specifically? the game itself was pretty well optimized and ran on really low end hardware even back in the day.
db48x 17 hours ago [-]
Because everyone reported performance metrics using it as a benchmark. Higher number = more sales.
murderfs 17 hours ago [-]
If you go back 5 years, everyone was using Quake 3 Arena as the benchmark. ATI got in some hot water because if you renamed quake3.exe to quack3.exe, your FPS would drop by 15%, because they were silently reducing quality to juice their benchmark numbers.
jkrejcha 16 hours ago [-]
Apparently people did this with the DirectX "3D Tunnel" demo as well[1] back over 20 years ago.
Also there was one "that checked if you were printing a specific string used by a popular benchmark program. If so, then it only drew the string a quarter of the time and merely returned without doing anything the other three quarters of the time".
A big portion of GPU driver updates are actually just that, same with Windows updates.
Windows 95 patched a bug in SimCity just to get it to work.
kazinator 17 hours ago [-]
> Anyway, my colleague found that there was one program that needed to allocate around 64KB of memory on the stack and initialize it. The standard way of doing this is to perform a stack probe to ensure that 64KB of memory is available, then subtracting 65536 from the stack pointer, and then initializing the memory in a small, tight loop.
Actually, the standard way of allocating 64 kB of memory on the stack is to just assume you can do it, subtract 64k from the stack pointer, and hope for the best.
Most stack allocations in the wild are not checked.
i_don_t_know 15 hours ago [-]
IIRC you have to probe every page of the stack on Windows. You cannot just subtract a value from ESP/RSP. If you don't probe every page in order, you get a page fault or some other exception (I don't remember which one).
NobodyNada 4 hours ago [-]
The reason for this is to ensure stack overflows are detected. The OS places a guard page above the top of the stack, which will cause a segfault if accessed. That way stack overflows are guaranteed to crash rather than stomping on valid memory that belongs to something else. However, if a stack frame is larger than a page (say, because it includes a large buffer), then it is possible for the program to "jump over" the guard page and access memory beyond.
In order to protect against this, the compiler inserts some dummy reads or writes as needed to ensure every page is touched in order from bottom to top. This ensures the guard page is hit before the application has a chance to write to memory beyond it.
How else would the OS know your read/write 16 pages away from the current stack pointer is in fact an attempt to increase the stack and not just really bad pointer arithmetic and a bug? How many pages should the runtime let you skip before its just a segfault?
andikleen2 5 hours ago [-]
Dave Jones used to have a series of "Why user space sucks" Linux kernel conference talks with many such examples, usually with dumb and redundant system calls.
However as someone who looks a lot at instruction traces I could probably write on e on why Linux kernel code sucks too. One of my current pet peeves is the way Linux walks bitmasks of CPU bits, which is a reasonably common operation. Due to a chain of unfortunate changes and decisions it currently needs 16+ instructions to find the next bit for something which the x86 instruction set has a single instruction. Of course that is so big that it is even outlined, adding even more overhead.
selcuka 18 hours ago [-]
To be fair it is possible that the developer enabled a special "unroll all loops, no matter what" optimisation flag during compilation.
I agree it would be stupid for a compiler to even support such a flag, but those were the 1980s/90s.
ack_complete 8 hours ago [-]
Doesn't require any special flags, just hitting optimizer limits can do it with MSVC.
At least these actually make things faster usually.
14 hours ago [-]
lozf 3 hours ago [-]
Heh, "funrollloops" reminds me of recompiling FreeBSD 4 on my thinkpad back in the early aughts. The word made me imagine some sort of processed breakfast cereal with too many additives.
ashdnazg 15 hours ago [-]
I worked on a transpiler from Nand2tetris assembly to WebAssembly, and had some really annoying memory corruption bug that I just couldn't solve.
That is, until I checked the program I used for testing (which I didn't write), and found the following code:
dealloc(this)
return this->field
With the original allocator, this worked fine, since the deallocation didn't touch the memory.
My allocator, however, overwrote the field during the deallocation with bookkeeping stuff, which meant the returned value was not what the programmer intended and after a short while the program crashed.
Unlike TFA, I had the luxury of just fixing the test program.
wazoox 15 hours ago [-]
IIRC, one of the similar old story from Raymond Chen is about SimCity 2000, that did a similar trick (free memory, then start immediately using it) that worked just fine under DOS, but was a big no-no starting with Windows 95. The game was so common that Windows had to include a special rule to make it run...
zimmund 8 hours ago [-]
I can't stop thinking about all the unoptimized code we have around. As processors (and memory) over the last 2-3 decades improved faster than we needed to fix the inefficiencies we created, we silently accepted that we don't need efficiency everywhere. So maybe a compiler, an emulator or some critical piece of code were created with this in mind, but the average app or website just waste resources left and right and pray for the best.
With more and more code being written with AI (which has notoriously inefficient solutions to simple problems), I expect this issue to become more prevalent. I just hope we optimize at the source of the problem (AI and humans using it) and not on platforms (compiler and engine/kernel heuristics)
smallstepforman 8 hours ago [-]
Half the compute and reduce memory by factor of x4 and in a decade we’ll have double the performance we have now.
I do old school embedded, the amount of desktop bloat is insane. Any function I really need to refactor, I can reduce size and improve performance. And there are better engineers out there that are more efficient than me.
cranx 11 hours ago [-]
Loop unrolling is a basic compiler optimization and depending on the machine language and processor instruction set should be faster taking into account all the house keeping required to execute a conditional, jump, move register values etc. This article is missing the analysis of why. If someone didn’t “like” it and was offended then that seems like an equally silly reason. On the surface 256k to init less does seem silly, but what if it was faster?
ryukoposting 8 hours ago [-]
A few things to consider.
In this case we're talking about a tight initialization loop with probably a single instruction in the body. The HW optimizations necessary to make a loop like this perform equally to the unrolled form are so rudimentary that they're taken for granted on basically any CPU, even 30 years ago. Seriously, we're talking about optimizations I made in an "intro to Verilog" class as an undergrad, and I'm not even a HW engineer.
It also depends how often this code is being hit. Does the code run once while the program loads? Nobody will notice a 2 microsecond improvement in loading times. Does the code run in a timing-sensitive hot path, like a game loop or a GUI rendering thread? Well now optimization matters. But again, consider the HW argument above.
Also remember that, back then, storage wasn't cheap. 256K of code is 18% of a 1.44MB floppy, and 35% of a 720K floppy.
classichasclass 18 hours ago [-]
Betting Alpha was the native architecture in question. It seemed to have the best support.
projektfu 5 hours ago [-]
Yeah, but I thought DEC wrote the FX!32 translator for the Alpha. Perhaps Raymond was talking about those people and didn't want to mention that they weren't Microsoft people.
0xdecrypt 10 hours ago [-]
256 KB of code to zero 64 KB of memory is the kind of optimization that makes you question every life choice that led to it.
rasz 2 hours ago [-]
I blame Intel. It took them 33 years (ERMSB) to finally standardize REP MOVSB as _the_ fast path. Another 10 years passed and someone discovered https://lock.cmpxchg8b.com/reptar.html
jeffbee 18 hours ago [-]
People from Transmeta told me stories about how their translators were full of special case optimizations to fix horrors they discovered in Microsoft Windows itself.
wolfi1 16 hours ago [-]
speaking of which, what became of it?
hbbio 15 hours ago [-]
Acquired by a patent monetization business...
electroglyph 18 hours ago [-]
heh, when Raymond Chen dunks on the MSVC team =)
mkl 10 hours ago [-]
There's no indication it was MSVC, and there are lots of compilers (and used to be more).
ant6n 17 hours ago [-]
Arguably more of an optimization, rather than a fix. Looks like un-unrolling a loop, or better, rolling a loop. Or rolling straight line code?
senfiaj 14 hours ago [-]
Yeah, but after a certain point the win is negligible. Huge code can also increase cache misses which will slow down things.
m1r 18 hours ago [-]
Couldn't they just turn the optimization off for this loop?
MadnessASAP 18 hours ago [-]
They didn't have the code for the offensive program, they were creating the emulator to run it on a different architecture.
McGlockenshire 18 hours ago [-]
> offensive program
Agreed.
notorandit 18 hours ago [-]
Which optimizer replaces a 64k loop with 64k instructions?
Ah, yes. Microsoft's!
selcuka 18 hours ago [-]
There is no indication that the compiler that produced the code was Microsoft's. Actually the article hints otherwise ("[...] whatever compiler was used to compile this code").
notorandit 13 hours ago [-]
Who has been validating that approach to solve their own optimization target?
notorandit 18 hours ago [-]
> they fixed it during emulation
It means the fix was applied to run during the emulation loop execution, not that the fix was found and applied while the emulation loop was running.
Which would have made it an emulation code escape.
pantulis 10 hours ago [-]
I was just curious and checked The Old New Thing archive... yes I've been reading Raymond Chen's stories for as long as I remember but hey, it's been 23 years of delivering consistently solid stories about Windows.
canucker2016 7 hours ago [-]
I was looking through the compiler docs about memory allocation and I found the section about the debug version of the CRT which could fill the allocated memory with a non-zero canary value to help detect uninitialized memory (assuming you weren't calling calloc - which zero-init's allocated memory).
But there wasn't any similar programmatic debugging aid for detecting uninitialized stack memory.
Going further down the rabbit hole, I discovered the _chkstk function.
The MS C compiler would emit a call to _chkstk on function entry to ensure that stack memory had been paged in. But further reading noted that _chkstk was only emitted if the function allocated a lot of stack memory. And there was source code! MS included the assembly language source code for _chkstk in the CRT source code, installed with compiler.
I needed _chkstk to be emitted for every function not only for functions that allocated >= 4KB of stack variables.
Curses, foiled again.
Then, while perusing the list of compiler command line switches, I see "/Ge".
/Ge (Enable Stack Probes)
Activates stack probes for every function call that requires storage for local variables.
Ahhhhh! The grey, storm clouds parted and the sun rays bathed shone down on me in their warmth.
I had all the pieces I needed to fill uninitialized stack memory with a non-zero canary value so I could make detection of uninitialized stack variables more reliable.
_stkfil was born
Modifying _chkstk was easy. I needed to write to every byte of stack in a stack page instead of reading only 4 bytes and skipping to the next page of stack.
While I was mucking in the bowels of modifying _chkstk, I added a 4-byte global variable to hold my canary value. Let the app override what value to use.
In debug builds, _stkfil helped find a couple of bugs, but soon all the stray uninited stack vars were gone and the code was forgotten.
InitAll - Automatic Initialization
In addition to the previously mentioned approaches, Microsoft is now using a feature known as InitAll which performs automatic compile-time initialization of stack variables.
This section documents how Windows is using this technology and the rationale for why.
Current Windows Settings
The following types are automatically initialized:
- Scalars (arrays, pointers, floats)
- Arrays of pointers
- Structures (plain-old-data structures)
The following are not automatically initialized:
- Volatile variables
- Arrays of anything other than pointers (i.e. array of int, array of structures, etc.)
- Classes that are not plain-old-data
For optimized retail builds, the fill pattern is zero. For floats the fill pattern is 0.0.
For CHK builds or developer builds (i.e. unoptimized retail builds), the fill pattern is 0xE2. For floats the fill pattern is 1.0.
There was a particular game that was superslow when this tech was applied. Original game loading took around 15-20 seconds, whereas once the tech was applied it took easily 3-5 min, even with all data already downloaded.
When I started digging into it, I realized the reason was the game was using something like
instead of Which basically expanded back in the day to 65k reads of 1 byte for several MB file. Each fread translated to 65k reads of ReadFile Windows API. Since my code was hooking on ReadFile system call, and my call was heavier than ReadFile, the game loading felt really slow. Unusable. It would have not been fun for players.The easy fix was to swap arguments for certain calls. The long fix required to use an internal cache to account for these cases so that the hooked ReadFile was faster when data was already in disk.
Funny thing is that as we started rolling out the tech and applying it to more and more games we realized lots of games did this. We went for the cache fix and games ended up loading faster than before. Honestly, games could have load all the data in a couple of seconds by just swapping the args. I'm guessing developers did this on purpose so that games seemed like they were loading a lot of stuff, although you never know.
To be fair excel would erase places white that it wanted to write up to 9 times before it drew any black pixels, we made that very fast! we didn't tell them :-)
At the time 24-bit framebuffers were so slow that before we built graphics acceleration hardware people would switch back to 8-bit to get stuff done, making 24-bit/true colour your daily driver was a big step forward.
> We considered creating a plugin that fixed all these things, it would have been hard to maintain, in the end we travelled around to the people who made these apps and talked them through their problems
Since talking to developers is no longer an option, I actually do write "Such-and-such Tune-up" extensions that patch applications dynamically to make them run better (or at all) in Advanced Mac Substitute, or even Mac OS itself.
I also worked on the original A/UX port for the Mac II, some hardware (like the IWM) required tiny buzzy loops, we ran into one bug where using the floppy caused ADB to freeze, but only on the release machines, not the prototypes all our engineers had, turned out there was hardware that made access to the VIA faster by pulling the clock in for 1 cycle, if you sat in a loop reading the timer in the via to measure a sector time for the IWM in too tight a loop it upped the output clock from the VIA to the ADB chip and over clocked it ....
PS – I am looking through the NuBus cards that I have... did you work for SuperMac or RasterOps?
I did the architectural design for the SuperMac cards. I figured out what needed to be accelerated, dropping code into people's machines to see where the cycles were going. Others did the physical design for the first 2 cards, I did the design of the chip in the Thunder and later cards (designed the data paths and state machines and a full simulation, someone else actually laid the gates)
If your card has a SQD01 on it it's my work. It peaks at 1.5Gb/s on solid fills
One of the other bugs (the Quark/ATM one) was also because of the programmers were worried about writing over stuff that hadn't been completely erased, the Quark guys wrote a string with 2 spaces at the end through a box that masked the end of the string, the ATM font renderer saw it couldn't fit the text so it split it in half and tried again so it drew N/2 N/4 N/8 ... strings. It spent all it's time in the 68k's multiply instructions figuring out how wide the strings (and substrings) were, our fancy 24-bit character rendering hardware was an afterthought
I feel like I'm having a stroke trying to read this, what does it mean??
I was capturing QuickDraw library calls - the low level graphics primitives, to figure out where the graphics time in apps was going and found out sometimes excel did it 9 times
Of course users didn't see it more than once, but our hardware made all that wasted time run faster
Another dev who's fixing a bug, realizes if they call a certain function either directly or indirectly, their particular bug gets fixed.
Oh, and as a side effect, the cell gets erased (again).
A few more fixes/new features added like this and the code is inadvertently erasing the same cell multiple times.
It takes a certain type of dev to step through in a debugger and Notice the app is doing way too much work and then to untangle the mess of code without causing regressions.
8 bit psuedo color, so the color palette switched with every focus-follows-mouse window boundary crossing. 16 bit direct color with banding but no more palette psychedlia.
This was equal parts to make it faster and to allow for higher framebuffer resolutions with limited VRAM.
Back then you did what you could with graphics and it wasn't a lot. After I got a PC I had indexed color for a long time and working with indexed color was pretty rough because anything physics-based like rendering or raytracing was going to be difficult. You could render a photo pretty well with 256 carefully chosen colors and dithering but if you wanted to, say, composite two photos and do general sorts of things you'd need to convert to "true color", do the math there, then re-quantize for display.
Was it a workaround for things that didn’t fully complete on one iteration, so the devs kept hammering away at it until it worked?
Not every bug results in the program doing the wrong thing, they often just make the program do the right thing very slowly.
And nobody notices, since it still produces the right result.
Now the bugs that get ignored for new features cause bad results AND bad performance.
If the stream is buffered, then all operations, including fread, are supposed to go through the buffer.
All three of these should issue buffer-sized reads to the operating system:
1. A loop which calls getc(stream) 65536 times.
2. fread(buf, 1, 65536, stream)
3. fread(buf, 65536, 1, stream)
The more direct behavior of fread should only kick in if the stream is configured as unbuffered.
I would say that the way low-level reads are issued to the host operating system is a "visible effect" of the program, so I suspect this may actually be a matter of conformance. I.e. it's not okay to issue those reads however the stream library wants as long as the data is read.
Edit: removed incorrect information.
See the original post and discussion for the whole story:
https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times... https://news.ycombinator.com/item?id=26296339
What software did that that badly? If the code asks for (up to) 65,536 single byte items, why would you split that into 65,536 calls?
Also, that change changes behavior. The old call could read anything from zero to 65,536 bytes, the new one only can read zero or 65,536 bytes.
(Reading the source of a few implementations, I think most implementations will fill the output buffer with partial objects if the input doesn’t supply an integral number of them, but the return value of fread cannot signal that to the caller)
> For each object, size calls are made to the fgetc function and the results stored, in the order read, in an array of unsigned char exactly overlaying the object
(wording unchanged since C99)
If the file is unbuffered, depending on how the implementation handles buffering, and how it interprets the standard, then perhaps it does end up hitting a path where there's 1 ReadFile call per byte...
I don't know how most implementations get around this. Presumably it's valid to interpret "calls are made" as "behaving as if calls are made", meaning fread can copy data out of the FILE's buffer directly, or make calls directly to whatever routine fgetc defers to, rather than calling fgetc N times literally. Looks like glibc's fread does this.
As to why you'd do that? - well, who knows the exact circumstances in this case. Perhaps this was faster in some meaningful case that was relevant to some other project (and then maybe the fread doesn't call fgetc after all!). I'm just speculating. Well-reused code often ends up with stuff that needs rethinking, that, even if noticed, nobody has the time or inclination to attempt to fix.
I had to convince people with benchmarks regularly that, yes, you could write the handful of lines to do proper user-space buffering and trivially run rings around any code that did extra context switches, because a lot of people didn't realise the cost difference between system calls and calling their own functions.
This included, by the way, the MySQL client library, at one point, which would do small read for length fields instead of larger non-blocking reads into a buffer all the time
But I think the parent comment's point is that the issue is in the implementation of fread itself in the standard library. It's perfectly reasonable for an application to pass it 1, 65536 (i.e. one byte, up to 65536 times) and expect it not to issue 65536 separate OS calls.
No, I'm not saying that's why. I'm simply saying there is a difference between asking for 1 byte or 65k bytes of something. Even dd runs the same under Linux.
dd bs=10k count=1 is faster than bs=1 count=10k
I remember trying to recover some data from a spinning disk, and trying to slowly creep up on the data. So I wanted 1 byte per, I wanted it to nibble, until it hit whatever the errored part was. If I just grabbed the lot, it'd error out from the whole read.
The latter (as usual when comparing OpenBSD and Linux) is more complex, but both multiply count by size and then go their way.
Also, the API contract allows fread to read fewer bytes than requested. I would except any implementation to do that.
But maybe, somebody interpreted the contract differently than major OSes, in the sense that a call isn’t allowed to write partial size-sized chunks to user memory and/or advance the file position further than its return value advocates (that, I think, is something that the implementations above can do, and might be considered a bug)
Yes it's different. As others have noted, the difference is what is returned if less than 65536 are available to read in the file: total failure vs partial read.
There is, unsurprisingly, no requirement that it has an unnecessarily inefficient implementation to meet this behavioral requirement. (The C standard doesn't talk about such things as syscalls but, even if it did, it surely wouldn't require such a thing.)
The irony is that that partial read is actually the default on both Windows and Posix (i.e. both ReadFile and read() will read up to the number of bytes specified). So a one-syscall implementation for fread would have been easier than multiple calls, and certainly would be standard compliant.
The dd example isn't comparable because dd is much lower level, and you really are specifying how the syscalls should be made.
As many examples out there use int/char etc to show how to use the thing. But if you switch to structs that fwrite can totally burn you if you use the sizeof call. As the sizeof a struct can vary between platforms and compilers. Depending on packing. Then endianness can sometimes mess you up. If you are reading/writing for yourself you can get away with a lot. But if you are trying to interop then you have to be wildly careful what you do.
fwrite is another one where people will do one byte at a time (same up to for the windows version). Bash out a loop, use the sizeof for the input to the for loop. copy and paste just doing 1 byte and you can easily end up here. One program I added a cache in front of the thing so it would always write on disk block boundaries and then come back for more. I started off with just packed struct sizes but the perf was just 'ok'. The file block boundary thing really made it fast. Not all OS's have a readahead/write buffer behind that call so perf can vary.
It is honestly such an easy mistake to make. As many of the examples/docs do not really show you why/how to use both of those calls in the way needed. You sort of have to stumble into it and work it out.
Once you see it you know. But until then you do not really notice if it is 'working'.
I've not looked at the code (or even the man pages) and it is a long time since I touched anything that low level, so this might be completely wrong, but if there is an error before the next 64KiB (including just hitting EOF) then the semantics could be different. Asking for 1x64KiB I would expect to just error as there aren't the requested number of bytes. Asking for 64Ki lots of 1 byte might simple error just the same, or it might at least populate the buffer with what it can read, or if the meaning of 1,65536 is actually “up to 64Ki lots of 1B” then it would populate the buffer as far as possible and return the amount read rather than an error condition.
If the per-byte option is slow but still fast enough, and dealing with the semantics is less faf, then people will go for that because the tiny time loss is worth the larger effort reduction. Of course this assumes the underlying system doesn't change, as with the “making local code to run as on-demand networked code” example higher in the thread which changes the relative performance characteristics of the two calling methods significantly.
Is this just the case of Windows having a bad stdlib fread implementation 15 years ago or is my thinking here actually wrong?
The C runtime authors did (presumably Microsoft, if it's MSVCRT).
He's hooking into ReadFile, a layer below the stdlib. By the time it reaches the hook, it's already split.
For example, I run TortoiseGit which has a caching feature which is supposed to make it faster at showing what to commit. Disabling it increases the number of items I can delete per second in my Windows Explorer from about 1000 to about 3000 while making not making TortoiseGit operations meaningfully slower (that I can tell).
This is a Dev Drive [0] on my machine, it would probably be slower on my C: drive which has full Windows Defender real time file scanning.
[0]: https://learn.microsoft.com/windows/dev-drive/
nah, its equally slow on system with everything ripped out (defender, filters, even logging).
This is a great article on why it's so unreasonably slow to modify these archives: https://textslashplain.com/2021/06/02/leaky-abstractions/
But it doesn't seem to explain why it's so much slower at regular extraction.
But in this case if the code was calling fread 65536 times in a loop and getting 64KiB each time it wouldn't be good either!
Sounds like the parent comment had to fix this with the internal cache thing to speed up the small freads. I think they meant the easy fix would have been swapping the args in the original / caller code.
Edit: mort96: So did you check the return value or not?
That's because the OS does the same thing too. It's the right fix, when I implemented something similar, we implemented caching right away.
I really hope that was not the case and rather think incompetence or to deal with obscure legacy problems, but the gamer in me gets enraged at the thought someone would artificially increase loading times.
Which is the obvious reason you'd pass an element size of 1: you want to know how many bytes were read.
https://silentsblog.com/2025/04/23/gta-san-andreas-win11-24h...
The optimizer was allowed, but not obligated, to transform that into: x = a
However, in this case, b was sometimes 0. And if so, the unoptimized version computed: x = a / 0 * 0 = Inf * 0 = NaN
So badness ensued if the that particular path didn't get optimized, which could happen under various circumstances. We had to add some code to ensure that transformation always happened on that game.
- deciding to inform the game developer & wait for reply vs not waiting for reply vs just fixing it yourself without informing the developer; and
- if informed: developer actually fixing it vs only saying they would fix it vs no reply whatsoever (not counting automated "thank you for your inquiry" replies, in cases where you don't already have more direct channels to the dev than email)
I've always kind of wondered this because in a way, it's kind of weird that it's fixed for them, at least for new releases / games actively being developed.
(Full disclosure: I'm a game developer myself, with a very high interest in engine plumbing & dev [including graphics], though finding a job for the latter is easier said than done.)
I'm not sure what fraction of devs actually fix things on their side, though. Once there's a driver workaround, and we've informed the devs, it's off our plate.
Performance is more of a gray area. We contact devs if there's something we can't work around, of course. And if there's something truly breaking. But for things that aren't exactly bugs, just things that could be improved, and we can improve on our own... well, we'll probably keep that for the competitive advantage.
I have very little understanding on how allocation works at OS level, but I'm surprised there are no wrappers like dgVoodoo or dxWrapper specifically for this kind of issues. There are quite a bunch of old Windows games (Need for Speed 1-4 for a start) that refuse to run on modern OSes due to rather...bold memory management strategies.
[1] - https://www.joelonsoftware.com/2000/05/24/strategy-letter-ii...
I also remember a media player being called out by name in the code for doing invalid operations, needing a work around and code to detect it was running just to function.
It's the life of a (game) developer...
An Nvidia employee once told me that one of the easiest ways to squeeze out a few extra frames on your old machine is to rename the game executable to hl2.exe.
And of course, browser engines also do the same things for certain websites:
https://github.com/WebKit/WebKit/blob/main/Source/WebCore/pa...
https://github.com/WebKit/WebKit/blob/main/Source/WebCore/pa...
What it should do is ensure some things not relevant to Half-Life 2 were not done, thus getting better performance for this game in particular, but there is no guarantee that same optimizations work for other applications or games, so one should not expect an overall improvement.
Unless they are doing some silly things like dropping quality, but that's the "everything else the same" point.
If not, why not have this enabled as default behavior instead?
> What it should do is ensure some things not relevant to Half-Life 2 were not done, thus getting better performance for this game in particular, but there is no guarantee that same optimizations work for other applications or games, so one should not expect an overall improvement.
I can't quite parse this. Yes, there is no guarantee that the optimizations will work for another game, which is precisely why you can expect an improvement with hl2. With non-hl2, you may get an improvement, you may not, and you may get incorrect behavior.
Everything else is not the same, but hl2 doesn't use the stuff that's different.
This seems genuinely unbelievable. Does anyone have a technical explanation for this?
then driver "optimizes" behavior, sometimes dishonestly (reducing precision), sometimes honestly (working around game engine stupidity)
A lot of people use Nvidia profile inspector to enable reBar on all games and claim that Nvidia is purposely holding back performance, but doing this causes many games to crash.
Nvidia probably doesnt officially say anything about this and 99.9% of people do not rename process name
nvidia even has an official api for a game to identify itself so they dont need to look at executable name
Also there was one "that checked if you were printing a specific string used by a popular benchmark program. If so, then it only drew the string a quarter of the time and merely returned without doing anything the other three quarters of the time".
[1]: https://devblogs.microsoft.com/oldnewthing/20040305-00/?p=40...
Windows 95 patched a bug in SimCity just to get it to work.
Actually, the standard way of allocating 64 kB of memory on the stack is to just assume you can do it, subtract 64k from the stack pointer, and hope for the best.
Most stack allocations in the wild are not checked.
In order to protect against this, the compiler inserts some dummy reads or writes as needed to ensure every page is touched in order from bottom to top. This ensures the guard page is hit before the application has a chance to write to memory beyond it.
Here's an example: https://godbolt.org/z/oTbzTczM6
However as someone who looks a lot at instruction traces I could probably write on e on why Linux kernel code sucks too. One of my current pet peeves is the way Linux walks bitmasks of CPU bits, which is a reasonably common operation. Due to a chain of unfortunate changes and decisions it currently needs 16+ instructions to find the next bit for something which the x86 instruction set has a single instruction. Of course that is so big that it is even outlined, adding even more overhead.
I agree it would be stupid for a compiler to even support such a flag, but those were the 1980s/90s.
https://www.reddit.com/r/cpp/comments/1i36ahd/is_this_an_msv...
https://www.shlomifish.org/humour/by-others/funroll-loops/Ge...
That is, until I checked the program I used for testing (which I didn't write), and found the following code:
With the original allocator, this worked fine, since the deallocation didn't touch the memory.My allocator, however, overwrote the field during the deallocation with bookkeeping stuff, which meant the returned value was not what the programmer intended and after a short while the program crashed.
Unlike TFA, I had the luxury of just fixing the test program.
With more and more code being written with AI (which has notoriously inefficient solutions to simple problems), I expect this issue to become more prevalent. I just hope we optimize at the source of the problem (AI and humans using it) and not on platforms (compiler and engine/kernel heuristics)
I do old school embedded, the amount of desktop bloat is insane. Any function I really need to refactor, I can reduce size and improve performance. And there are better engineers out there that are more efficient than me.
In this case we're talking about a tight initialization loop with probably a single instruction in the body. The HW optimizations necessary to make a loop like this perform equally to the unrolled form are so rudimentary that they're taken for granted on basically any CPU, even 30 years ago. Seriously, we're talking about optimizations I made in an "intro to Verilog" class as an undergrad, and I'm not even a HW engineer.
It also depends how often this code is being hit. Does the code run once while the program loads? Nobody will notice a 2 microsecond improvement in loading times. Does the code run in a timing-sensitive hot path, like a game loop or a GUI rendering thread? Well now optimization matters. But again, consider the HW argument above.
Also remember that, back then, storage wasn't cheap. 256K of code is 18% of a 1.44MB floppy, and 35% of a 720K floppy.
Agreed.
Ah, yes. Microsoft's!
It means the fix was applied to run during the emulation loop execution, not that the fix was found and applied while the emulation loop was running.
Which would have made it an emulation code escape.
But there wasn't any similar programmatic debugging aid for detecting uninitialized stack memory.
Going further down the rabbit hole, I discovered the _chkstk function.
The MS C compiler would emit a call to _chkstk on function entry to ensure that stack memory had been paged in. But further reading noted that _chkstk was only emitted if the function allocated a lot of stack memory. And there was source code! MS included the assembly language source code for _chkstk in the CRT source code, installed with compiler.
I needed _chkstk to be emitted for every function not only for functions that allocated >= 4KB of stack variables.
Curses, foiled again.
Then, while perusing the list of compiler command line switches, I see "/Ge".
Ahhhhh! The grey, storm clouds parted and the sun rays bathed shone down on me in their warmth.I had all the pieces I needed to fill uninitialized stack memory with a non-zero canary value so I could make detection of uninitialized stack variables more reliable.
_stkfil was born
Modifying _chkstk was easy. I needed to write to every byte of stack in a stack page instead of reading only 4 bytes and skipping to the next page of stack.
While I was mucking in the bowels of modifying _chkstk, I added a 4-byte global variable to hold my canary value. Let the app override what value to use.
In debug builds, _stkfil helped find a couple of bugs, but soon all the stray uninited stack vars were gone and the code was forgotten.
Then I read about InitAll in https://www.microsoft.com/en-us/msrc/blog/2020/05/solving-un...
solidity sweating profusely