Hello all,
I'm encountering a problem when using shared folders where files that are written to by a single process are being corrupted if another thread in that same process merely reads from the same file under a separately opened file handle.
This is occurring under VMware Fusion 4.1.4 under Mac OS X 10.7.5, using CentOS 5.8 as the guest OS. The corrupted files end up with blocks of zero bytes that end up overwriting a portion of its data. The size of the blocks does not generally match the length of the missing data.
I have attached the source code to a simple C++ test program that exhibits the problem fairly consistently. A Makefile is included to build the executable.
Usually it hits the bad case on the second or third attempt when it writes out to a shared folder. It doesn't hit the bad case at all when it's outputting to the local file-system. The test program returns non-zero if it reproduces the error, so a simple shell loop can be used to continually run the program until the bad case is hit.
The program has two threads. Each thread has a separately opened file handle to the same file. The first thread opens the two file handles. It first creates a handle to a file for writing, and then it opens a read handle to read back from the file being written out to.
It then sets up the second thread, which is given a file handle for writing, and this second thread writes ever-increasing consecutive integers to the file until it is signalled to stop. After writing out a single unsigned integer, it flushes the file.
The first thread will continually use its read file handle to seek around the file to randomly chosen 64kb boundaries, and atttempt to read 64kb. This mimics behaviour in our development system where we first encountered the corruption. The first thread performs no writes of its own to the write file handle after calling fopen(), and it only calls fseek() and fread(), on the read file handle, as well as stat() on the filename. It's unknown whether the 64kb boundary reads are significant to reproducing the problem.
If SIGINT is received, or a time limit was given on the command line, the second thread is signalled to stop after writing out any current integer and a subsequent fflush(). The first thread waits on the second thread, and then attempts to verify the output file, reading unsigned integers and checking that they are consecutively numbered from 0. If there's a mismatch, and it's a 0, then more integers are read from the file until we have the full block of zeros. The file position and the number of zeroes (and the number of bytes) is then output to stdout.
Any assistance, or even just confirmation of the bug, would be greatly appreciated. It would be nice to know whether similar problems have been encountered in the past, and whether it's likely to be fixed in the near future. Any further questions, or things for me to try, please feel free to ask.
Cheers and thanks,
Dominic