Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tellg returns -1 when using clang++ to get the size of a large file (around 6GB) on Windows #462

Open
gevikhn opened this issue Oct 24, 2024 · 6 comments

Comments

@gevikhn
Copy link

gevikhn commented Oct 24, 2024

When compiling C++ code with clang++ on Windows, the tellg() function returns -1 when attempting to get the size of a large file (around 6GB). However, when compiling the same code with MSVC or g++, the file size is retrieved correctly.

#include <fstream>
#include <iostream>
int main() {
    std::ifstream inFIle("test.file", std::ios::binary|std::ios::in);
    if(!inFIle.is_open() || !inFIle.good()) {
        std::cout << "fail" << std::endl;
        return 0;
    }
    inFIle.seekg(0, std::ios::end);
    auto size = inFIle.tellg();
    inFIle.seekg(0, std::ios::beg);

    std::cout << size << std::endl;

    return 0;
}
@mstorsjo
Copy link
Owner

There's a combination of two issues here.

With the mingw headers, off_t defaults to a 32 bit size, unless building with -D_FILE_OFFSET_BITS=64.

Secondly, the code that uses off_t and ftello and fseeko in libcxx is in the headers, so it doesn't help to rebuild libcxx with this option; the user that includes libcxx headers will need to set _FILE_OFFSET_BITS=64 before including those haeders.

When setting up the base ABI for UCRT, it's a shame that we didn't think of defaulting off_t to 64 bit there, as we already did deviate a bit from the existing status quo, by making time_t default to 64 bit. Or what would you think of changing the default for off_t for UCRT configurations to 64 bit? It would be an ABI break, but I wonder how much of an issue it would be. CC @lhmouse @cjacek @mati865 @lazka @jeremyd2019

FWIW, if comparing with MSVC, they don't really have anything like off_t, so with MSVC headers, users are expected to manually call e.g. _ftelli64 instead of ftello. Building libcxx in MSVC mode also runs into this same bug, and there, setting _FILE_OFFSET_BITS=64 doesn't help at all (as it's not trying to use anything like ftello). The MS STL probably directly calls _ftelli64 instead.

@mstorsjo
Copy link
Owner

I remembered there's another thing that we could do on the libcxx side, though - we can configure libcxx with -DLIBCXX_EXTRA_SITE_DEFINES="_FILE_OFFSET_BITS=64", which adds that define into libcxx's __config_site which all libcxx headers include. So that way, libcxx mostly would work as expected, with the small unexpected side effect that including standard C++ headers affects the user's a little. Although if the user already has included e.g. stdio.h before libcxx headers, it won't help either?

(GCC's libstdc++ does something similar for __USE_MINGW_ANSI_STDIO=1, with similar caveats. For msvcrt environments, that's mostly always set by default these days anyways, but for UCRT we still tend to default to having it disabled.)

@lazka
Copy link

lazka commented Nov 19, 2024

When setting up the base ABI for UCRT, it's a shame that we didn't think of defaulting off_t to 64 bit there, as we already did deviate a bit from the existing status quo, by making time_t default to 64 bit. Or what would you think of changing the default for off_t for UCRT configurations to 64 bit? It would be an ABI break, but I wonder how much of an issue it would be. CC @lhmouse @cjacek @mati865 @lazka @jeremyd2019

time_t is 64bit in MSVC now but off_t is still 32bit, so it doesn't seem too wrong what we have now when trying to reduce the difference there. I'd prefer MSVC compat if possible. Any cross-platform code which tries to support MSVC and >2GB files will have to avoid off_t anyway.

I remembered there's another thing that we could do on the libcxx side, though - we can configure libcxx with -DLIBCXX_EXTRA_SITE_DEFINES="_FILE_OFFSET_BITS=64", which adds that define into libcxx's __config_site which all libcxx headers include.

In MSYS2 I remember one case where it broke things since this leads to mingw-w64 defining "stat" https://github.com/mingw-w64/mingw-w64/blob/fab7cfe34e5388c920230d0237e3663804d91458/mingw-w64-headers/crt/sys/stat.h#L276 which clashed with a C++ class called stat in libtorrent (arvidn/libtorrent@44da281). Just an warning that there might be fallout.

@cjacek
Copy link

cjacek commented Nov 19, 2024

Or what would you think of changing the default for off_t for UCRT configurations to 64 bit? It would be an ABI break, but I wonder how much of an issue it would be.

This seems justified to me.

@lhmouse
Copy link

lhmouse commented Nov 20, 2024

time_t is 64bit in MSVC now but off_t is still 32bit, so it doesn't seem too wrong what we have now when trying to reduce the difference there. I'd prefer MSVC compat if possible. Any cross-platform code which tries to support MSVC and >2GB files will have to avoid off_t anyway.

Microsoft defines off_t as a 'long integer', equivalently, a 32-bit integer: https://learn.microsoft.com/en-us/cpp/c-runtime-library/standard-types?view=msvc-170

It has been long since Platform SDK for Windows 2003. I am bit curious why there is such a POSIX type; it is an 'old name' and is deprecated anyway; why don't users prefer int64_t instead?

That being said, whether libc++ defines _FILE_OFFSET_BITS=64 is questionable; but despite that, it really should call ftello64() and feeko64() explicitly.

@mstorsjo
Copy link
Owner

time_t is 64bit in MSVC now but off_t is still 32bit, so it doesn't seem too wrong what we have now when trying to reduce the difference there. I'd prefer MSVC compat if possible. Any cross-platform code which tries to support MSVC and >2GB files will have to avoid off_t anyway.

That's true, but on the other hand, MSVC headers don't really use off_t so for them it essentially doesn't matter. For them it's only used for the st_size field in the 32 bit size versions of the stat structs (in the form _off_t though, but they're linked).

Switching the type would be beneficial for the unix originated projects which may be using ftello/fseeko (if available, otherwise just ftell/fseek) and not wanting to manually do something else.

I remembered there's another thing that we could do on the libcxx side, though - we can configure libcxx with -DLIBCXX_EXTRA_SITE_DEFINES="_FILE_OFFSET_BITS=64", which adds that define into libcxx's __config_site which all libcxx headers include.

In MSYS2 I remember one case where it broke things since this leads to mingw-w64 defining "stat" https://github.com/mingw-w64/mingw-w64/blob/fab7cfe34e5388c920230d0237e3663804d91458/mingw-w64-headers/crt/sys/stat.h#L276 which clashed with a C++ class called stat in libtorrent (arvidn/libtorrent@44da281). Just an warning that there might be fallout.

Right, indeed - yeah there may be fallout.

(That's a good reference though - I was wondering if defining _FILE_OFFSET_BITS=64 wouldn't break the use of stat, but that redirect seems to make it work as intended anyway.)

Or what would you think of changing the default for off_t for UCRT configurations to 64 bit? It would be an ABI break, but I wonder how much of an issue it would be.

This seems justified to me.

Also regarding uses of both stat and ftello - as each individual object file references different stat functions and ftell functions depending on the header configuration, an ABI break like this wouldn't mostly be visible at all. It would only be visible if you have data types that include off_t (or stat structs) on an ABI boundary (e.g. combining an old object file built with old headers, with a new one built with new headers, touching a struct containing off_t, or using an API across DLL boundaries, if that API uses structs with off_t).

time_t is 64bit in MSVC now but off_t is still 32bit, so it doesn't seem too wrong what we have now when trying to reduce the difference there. I'd prefer MSVC compat if possible. Any cross-platform code which tries to support MSVC and >2GB files will have to avoid off_t anyway.

Microsoft defines off_t as a 'long integer', equivalently, a 32-bit integer: https://learn.microsoft.com/en-us/cpp/c-runtime-library/standard-types?view=msvc-170

It has been long since Platform SDK for Windows 2003. I am bit curious why there is such a POSIX type; it is an 'old name' and is deprecated anyway; why don't users prefer int64_t instead?

It's probably meant for transitionary uses, for cases where you earlier might have been ok with having 32 bit file sizes/offsets, but wanted the same API to upgrade to 64 bit file sizes later. In modern days, there's indeed not much reason to bother with it...

That being said, whether libc++ defines _FILE_OFFSET_BITS=64 is questionable; but despite that, it really should call ftello64() and feeko64() explicitly.

That's indeed right; if libcxx wants to have working >2 GB file sizes in MSVC build configurations, it would need to switch to such an API anyway. And if it does that, it doesn't matter how we define off_t or ftello by default.

mstorsjo added a commit to mstorsjo/llvm-project that referenced this issue Jan 11, 2025
This adds a test for an issue reported downstream at
mstorsjo/llvm-mingw#462.
mstorsjo added a commit to mstorsjo/llvm-project that referenced this issue Jan 11, 2025
This adds a test for an issue reported downstream at
mstorsjo/llvm-mingw#462.
mstorsjo added a commit to mstorsjo/llvm-project that referenced this issue Jan 11, 2025
This adds a test for an issue reported downstream at
mstorsjo/llvm-mingw#462; this is known
to fail on Windows right now, where the fseek/ftell calls end
up truncated to 32 bits.
mstorsjo added a commit to mstorsjo/llvm-project that referenced this issue Jan 13, 2025
This adds a test for an issue reported downstream at
mstorsjo/llvm-mingw#462; this is known
to fail on Windows right now, where the fseek/ftell calls end
up truncated to 32 bits.

The test for this, unfortunately, requires temporarily creating
a 4 GB file.
mstorsjo added a commit to llvm/llvm-project that referenced this issue Jan 15, 2025
This adds a test for an issue reported downstream at
mstorsjo/llvm-mingw#462; this is known to fail
on Windows right now, where the fseek/ftell calls end up truncated to 32
bits.

The test for this, unfortunately, requires temporarily creating a 4 GB
file.
github-actions bot pushed a commit to arm/arm-toolchain that referenced this issue Jan 15, 2025
…#122798)

This adds a test for an issue reported downstream at
mstorsjo/llvm-mingw#462; this is known to fail
on Windows right now, where the fseek/ftell calls end up truncated to 32
bits.

The test for this, unfortunately, requires temporarily creating a 4 GB
file.
mstorsjo added a commit to mstorsjo/llvm-project that referenced this issue Jan 15, 2025
This allows using the full 64 bit range for file offsets.

This should fix the issue reported downstream at
mstorsjo/llvm-mingw#462.
paulhuggett pushed a commit to paulhuggett/llvm-project that referenced this issue Jan 16, 2025
This adds a test for an issue reported downstream at
mstorsjo/llvm-mingw#462; this is known to fail
on Windows right now, where the fseek/ftell calls end up truncated to 32
bits.

The test for this, unfortunately, requires temporarily creating a 4 GB
file.
DKLoehr pushed a commit to DKLoehr/llvm-project that referenced this issue Jan 17, 2025
This adds a test for an issue reported downstream at
mstorsjo/llvm-mingw#462; this is known to fail
on Windows right now, where the fseek/ftell calls end up truncated to 32
bits.

The test for this, unfortunately, requires temporarily creating a 4 GB
file.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants