[Unionfs] Strange behavior when adding a branch
Yoav Weiss
union342 at unpatched.net
Mon Jul 9 18:55:01 EDT 2007
On Sun, 8 Jul 2007, Erez Zadok wrote:
[...]
>
> BTW, from your script, I see you're using unionctl and probably the older
> unionfs 1.x; branch management has changed significantly in 2.0 (for the
> better). What I describe below is for unionfs 2.x, which may behave
> differently for you. (Can you try unionfs2 please ?)
>
Unfortunately, unionfs 2.0 crashes in my test environment and it happens
too early to debug. I need to set a non-live-cd environment with unionfs
2.0 so I can debug it. (I built a debian-live image based on 2.0 and the
kernel crashes from the context of unionfs before I have a working rootfs
so its tricky to debug - I have no filesystem to dmesg into). I'll try to
debug this crash later.
In any case, 2.0 doesn't fix the situation, according to
Documentation/filesystems/unionfs/concepts.txt:
"...a process that continually re-reads the same file's data will see the
NEW data as soon as the lower file had changed, upon the next read(2)
syscall (even if the file is still open!)..."
So I understand this is actually the intended behavior in 2.0 ?
> Yoav/Shaya, it depends what users really want.
>
As Shaya Potter and Jefferson Ogata already noted, the semantics I
suggested would be the natural behavior in unix environment. In any case,
the current behavior of files being replaced in the middle of reading
without the process being aware of it, is risky and never safe to rely on.
> First, understand that there's little precedent to what Unionfs does. In
> traditional posix OSs, a file or directory has a clear, unmistakable,
> single, unique location -- a one-to-one mapping of files to their locations.
Not necessarily. The one-to-one mapping is with the inode rather than the
filename. Once a file-descriptor has been associated with a file, it
always stays associated to the inode this file had at the time of
associating (rather than to the pathname). The canonical representation
of a file is the inode rather than a pathname. If the file is unlinked,
rewritten, covered by another mount, or lazy-umounted, it gets a new
inode, and the existing descriptor keeps pointing to the old file. In the
case of unlink, the space is not freed until the process closes the file
descriptor because the inode has a non-zero refcount.
I'll demonstrate by another script:
--------------------------------------------------------------------------
#!/usr/bin/env python
import os
# Prepare an environment with two same-named files.
os.system("rm -rf /tmp/uniontest")
os.mkdir("/tmp/uniontest"); os.chdir("/tmp/uniontest")
os.mkdir("dir1"); os.mkdir("dir2"); os.mkdir("dir3"); os.mkdir("mnt")
file("dir1/file","w").write("pre-branch\n"+"\r"*4096)
file("dir2/file","w").write("\r"*4096+"post-branch\n")
# Mount the filesystem.
os.system("mount --bind ./dir1 ./mnt")
# Get a descriptor and read the first page.
f=file("mnt/file")
print f.read(4096)
# Cover the mount by a new version of the filesystem.
os.system("mount --bind ./dir2 ./mnt")
print f.read(4096)
# Point proven. Cleanup.
f.close()
os.system("umount ./mnt")
os.system("umount ./mnt")
--------------------------------------------------------------------------
The results of this script show the standard behavior on a posix system.
I think the script I pasted in my first mail should produce the same
results.
> With Unionfs, however, you've got yourself a fan-out situation -- a
> one-to-MANY mapping of files to their locations. Any time you have "MANY"
> to choose from, you have to answer the question "which one do I pick"?
>
I think the situation is quite similar to the above, so the path has
already been laid for us. We should pick whatever preserves the
one-to-one mapping of inodes, and make sure a changed file always results
in a changed inode.
> Worse, branch management is a nasty action to take, no matter what. Again,
> there's no precedent to having a file system who's content changes mid way.
> Inserting/removing branches is like replacing one drive of a RAID5 array and
> expecting the array to consistently reconfigure itself and present a
> coherent new view of files and directories.
>
Its more like mounting another drive over the same mountpoint, like the
script above does. mount(2) takes care of it and produces consistent
results.
> If you remember throughout the history of unionfs 1.x and now 2.x, we've had
> various modes of operations (e.g., "delete all" and "delete first"), but we
> deprecated modes of operation which didn't appear to be used by users too
> much. We wanted to keep the code as simple and functional as possible.
> Simple, clear, well understood semantics are very important for code
> maintainability (and user sanity :-)
>
I definitely agree with you and appreciate it as a user as well as a
developer. I'm just saying that the default (and only) mode should be the
one that preserves existing descriptors because thats the expected
behavior in a posix environment.
> The situation you mention above, I believe, is one in which you insert a new
> top-level branch, which contains files with the same name, but different
> content. And you have files with the same name open on the older branches.
>
Correct.
> Right now the semantics are simple: a file's content is going to point to
> the left-most version of that file (once you go through the file/dentry
> revalidation upon entry into any Unionfs method). So for files, we re-open
> the file on the leftmost lower branch that it can be found. For
> directories, we open all of the lower directories with the same name, and
> merge them left-to-right.
>
> Now, if I recall, we've had cases where users inserted a new leftmost branch
> and they *wanted* to immediately see the new content: that was the main
> reason they inserted the new leftmost branch (e.g., for software package
> versioning).
>
Sounds risky. If a process is in the middle of reading a file and the
file gets replaced without notice, there's no way for the process to be
sure whether its reading the new file. The process must be aware of the
oncoming change and seek to a known location in the file before its
replaced. Even then, the process may be getting some data from the old
file due to caching.
Software package versioning is precisely why I'm worried. Software
updates usually require atomic replacement. Unless the software was
written specifically to deal with non-atomic changes, its behavior is
undefined when some of its files get replaced during execution. Note how
software distribution packages work. For example, when installing a
replacement package for some daemon, the files are replaced, and the
service is restarted in order to start using the new files, after they
were _all_ replaced. Until its restarted, the daemon keeps working with
the old files despite the fact that they were replaced on the filesystem.
In Linux, if you look in /proc/<pid>/fd/ of a daemon in that state, you
see references to files marked "(deleted)" but the daemon keeps working.
Thats probably the only safe way to upgrade an unaware software.
> We also have cases where many users modify files on the lower branches
> directly, and expect those modifications to be immediately visible through
> the union, even to processes with open files. In the last few months we've
> made a lot of progress on supporting these kinds of features, and ensuring a
> uniform semantics when it comes to branch management and modifying lower
> branches (all falling under the general umbrella of "cache coherency").
>
Thats cool! Still, I think these changes should only affect processes
which opened the leftmost file being modified.
As a side-note, I think users modifying files on lower branches directly
are taking a risk since they're relying on implementation details of
unionfs rather than its design, unless unionfs officially supports it and
guarantees consistent behavior (which requires full cache coherency).
This is no different from opening /dev/sda1 while its mounted as ext3 and
manipulating the filesystem directly. Its possible but one should really
know what he/she is doing. :)
Users who use this feature are, by definition, aware of unionfs, so they
can adjust. Unaware programs (such as general software updates) can't
adjust so they rely on known semantics.
> There are also more complicated cases involving copyup. Suppose you have
> this situation:
>
> 1. you have a union with branch /a which has a file /a/f. branch /a is
> readonly.
>
> 2. you open the file 'f' for writing through the union
>
> 3. now you insert a new leftmost branch /b which also has a file /b/f, and
> at the same time, you mark branch /a as readonly.
>
> 4. now you try to read file 'f' through the union: according to your
> suggestion, unionfs should return the older content from /a/f. Currently
> unionfs will return the newer content from /b/f.
>
No, I think the distinction is still clear:
If the read() is performed on a pre-branch file-descriptor, it should get
te old content. Otherwise it should get te new content. Thats how it
works with deleted/replaced files on posix systems. I'll demonstrate:
-------------------------------------------------------------------------
#!/usr/bin/env python
import os
pre=file("/tmp/file1","w+")
os.unlink("/tmp/file1")
post=file("/tmp/file1","w+")
# Write to the deleted file.
pre.write("pre")
# Write to the new file.
post.write("post")
pre.seek(0)
post.seek(0)
# Here's how the system sees it
os.system("ls -l /proc/self/fd/")
# Print the inode and content of both files.
print "%d: %s" % (os.fstat(pre.fileno()).st_ino, pre.read())
print "%d: %s" % (os.fstat(post.fileno()).st_ino, post.read())
os.unlink("/tmp/file1")
-------------------------------------------------------------------------
Output:
-------------------------------------------------------------------------
total 0
lrwx------ 1 root root 64 2007-07-10 01:15 0 -> /dev/pts/3
lrwx------ 1 root root 64 2007-07-10 01:15 1 -> /dev/pts/3
lrwx------ 1 root root 64 2007-07-10 01:15 2 -> /dev/pts/3
lrwx------ 1 root root 64 2007-07-10 01:15 3 -> /tmp/file1 (deleted)
lrwx------ 1 root root 64 2007-07-10 01:15 4 -> /tmp/file1
lr-x------ 1 root root 64 2007-07-10 01:15 5 -> /proc/20110/fd
826408: pre
826410: post
-------------------------------------------------------------------------
As the output shows, the semantics are preserved. As long as the files
have separate inodes, they're separate files, regardless of their current
pathname or state.
If I modify this script and replace the unlink with a unionctl/remount to
add a branch that changes /tmp/file1, I'd expect the script to still have
similar output because the operation is the same from a user's
perspective.
> 5. now suppose you try to write to file 'f' through the union. Since branch
> /a was marked as readonly, unionfs shouldn't write to that file. So that
> should result in a copyup of /a/f onto /b/f, right? But even that's not
> so clear: you have a 3-way merge problem now. You have the original file
> content /a/f, the newer content /b/f, and the also new content of /a/f
> which you're trying to write into. So which of the two "newer" ones
> should we use and when? It's confusing for sure, and it may depend on
> the specific needs of users at that time. Should we have allowed the
> read of 'f' to go to the old content when a new one was inserted? Should
> we have copied-up the old content of 'f' from /a/f to /b/f on the first
> read? On the first write?
>
Again, posix semantics come to the rescue. The operation where /a is
marked as readonly (which starts the whole confusion) is essentially
equivalent to "mount -o remount,ro /a". The system will simply fail this
operation, with mount(2) returning -EBUSY. You can't safely change a
filesystem to readonly if there are open descriptors to it in write mode.
Thats what users come to expect from mount(2) on other filesystems so
they shouldn't be surprised when they encounter similar results upon a
remount of a unionfs.
> Folks, I don't mind changing Unionfs's semantics if that's what most user
> want. But I'd like to hear that this is indeed what users want. I don't
> want to make a change that will break the desired behavior for existing
> users who may depend on that behavior. If the issue is confined to having
> two different ways in which unionfs might handle *open* files, then which
> one of them should we pick: stick with the old content or go for the new?
> Or maybe it should be a user option? Ideally, I'd like to avoid having many
> different modes of operations to choose from at mount time.
>
Personally, I don't think it should be an option, since the immediate
change behavior seems too volatile to be depended upon. Users are used to
posix semantics since they depend on them everywhere else, so they
shouldn't expect different behavior when using unionfs, at least where
posix behavior applies (and as I demonstrated above, posix behavior
applies in this particular case).
As for checking what most users indeed want, that would be hard for a
project as mature as unionfs. (I've been using unionfs for ages,
including on my mobile phone :) The reason is that most unionfs users
don't know that they're unionfs users. They use systems like Knoppix and
expect them to "just work". I think the best way to accomodate the needs
of these users (which are now the majority of unionfs users) is to stick
to the semantics they're used to.
Of course, I have a limited say in this, since I'm merely a user and not
a unionfs developer. I hope you'll agree with my reasoning on whats best
for the majority of users (who don't even know they're unionfs users).
> Cheers,
> Erez.
>
Cheers,
Yoav
More information about the unionfs
mailing list