[Unionfs] Strange behavior when adding a branch

Yoav Weiss union342 at unpatched.net
Mon Jul 9 18:55:01 EDT 2007


On Sun, 8 Jul 2007, Erez Zadok wrote:

[...]
>
> BTW, from your script, I see you're using unionctl and probably the older
> unionfs 1.x; branch management has changed significantly in 2.0 (for the
> better).  What I describe below is for unionfs 2.x, which may behave
> differently for you.  (Can you try unionfs2 please ?)
>

Unfortunately, unionfs 2.0 crashes in my test environment and it happens 
too early to debug.  I need to set a non-live-cd environment with unionfs 
2.0 so I can debug it.  (I built a debian-live image based on 2.0 and the 
kernel crashes from the context of unionfs before I have a working rootfs 
so its tricky to debug - I have no filesystem to dmesg into).  I'll try to 
debug this crash later.

In any case, 2.0 doesn't fix the situation, according to 
Documentation/filesystems/unionfs/concepts.txt:

"...a process that continually re-reads the same file's data will see the 
NEW data as soon as the lower file had changed, upon the next read(2) 
syscall (even if the file is still open!)..."

So I understand this is actually the intended behavior in 2.0 ?

> Yoav/Shaya, it depends what users really want.
>

As Shaya Potter and Jefferson Ogata already noted, the semantics I 
suggested would be the natural behavior in unix environment.  In any case, 
the current behavior of files being replaced in the middle of reading 
without the process being aware of it, is risky and never safe to rely on.

> First, understand that there's little precedent to what Unionfs does.  In
> traditional posix OSs, a file or directory has a clear, unmistakable,
> single, unique location -- a one-to-one mapping of files to their locations.

Not necessarily.  The one-to-one mapping is with the inode rather than the 
filename.  Once a file-descriptor has been associated with a file, it 
always stays associated to the inode this file had at the time of 
associating (rather than to the pathname).  The canonical representation 
of a file is the inode rather than a pathname.  If the file is unlinked, 
rewritten, covered by another mount, or lazy-umounted, it gets a new 
inode, and the existing descriptor keeps pointing to the old file.  In the 
case of unlink, the space is not freed until the process closes the file 
descriptor because the inode has a non-zero refcount.

I'll demonstrate by another script:

--------------------------------------------------------------------------
#!/usr/bin/env python

import os

# Prepare an environment with two same-named files.
os.system("rm -rf /tmp/uniontest")
os.mkdir("/tmp/uniontest"); os.chdir("/tmp/uniontest")
os.mkdir("dir1"); os.mkdir("dir2"); os.mkdir("dir3"); os.mkdir("mnt")
file("dir1/file","w").write("pre-branch\n"+"\r"*4096)
file("dir2/file","w").write("\r"*4096+"post-branch\n")

# Mount the filesystem.
os.system("mount --bind ./dir1 ./mnt")

# Get a descriptor and read the first page.
f=file("mnt/file")
print f.read(4096)

# Cover the mount by a new version of the filesystem.
os.system("mount --bind ./dir2 ./mnt")
print f.read(4096)

# Point proven.  Cleanup.
f.close()
os.system("umount ./mnt")
os.system("umount ./mnt")
--------------------------------------------------------------------------

The results of this script show the standard behavior on a posix system. 
I think the script I pasted in my first mail should produce the same 
results.

> With Unionfs, however, you've got yourself a fan-out situation -- a
> one-to-MANY mapping of files to their locations.  Any time you have "MANY"
> to choose from, you have to answer the question "which one do I pick"?
>

I think the situation is quite similar to the above, so the path has 
already been laid for us.  We should pick whatever preserves the 
one-to-one mapping of inodes, and make sure a changed file always results 
in a changed inode.

> Worse, branch management is a nasty action to take, no matter what.  Again,
> there's no precedent to having a file system who's content changes mid way.
> Inserting/removing branches is like replacing one drive of a RAID5 array and
> expecting the array to consistently reconfigure itself and present a
> coherent new view of files and directories.
>

Its more like mounting another drive over the same mountpoint, like the 
script above does.  mount(2) takes care of it and produces consistent 
results.

> If you remember throughout the history of unionfs 1.x and now 2.x, we've had
> various modes of operations (e.g., "delete all" and "delete first"), but we
> deprecated modes of operation which didn't appear to be used by users too
> much.  We wanted to keep the code as simple and functional as possible.
> Simple, clear, well understood semantics are very important for code
> maintainability (and user sanity :-)
>

I definitely agree with you and appreciate it as a user as well as a 
developer.  I'm just saying that the default (and only) mode should be the 
one that preserves existing descriptors because thats the expected 
behavior in a posix environment.

> The situation you mention above, I believe, is one in which you insert a new
> top-level branch, which contains files with the same name, but different
> content.  And you have files with the same name open on the older branches.
>

Correct.

> Right now the semantics are simple: a file's content is going to point to
> the left-most version of that file (once you go through the file/dentry
> revalidation upon entry into any Unionfs method).  So for files, we re-open
> the file on the leftmost lower branch that it can be found.  For
> directories, we open all of the lower directories with the same name, and
> merge them left-to-right.
>
> Now, if I recall, we've had cases where users inserted a new leftmost branch
> and they *wanted* to immediately see the new content: that was the main
> reason they inserted the new leftmost branch (e.g., for software package
> versioning).
>

Sounds risky.  If a process is in the middle of reading a file and the 
file gets replaced without notice, there's no way for the process to be 
sure whether its reading the new file.  The process must be aware of the 
oncoming change and seek to a known location in the file before its 
replaced.  Even then, the process may be getting some data from the old 
file due to caching.

Software package versioning is precisely why I'm worried.  Software 
updates usually require atomic replacement.  Unless the software was 
written specifically to deal with non-atomic changes, its behavior is 
undefined when some of its files get replaced during execution.  Note how 
software distribution packages work.  For example, when installing a 
replacement package for some daemon, the files are replaced, and the 
service is restarted in order to start using the new files, after they 
were _all_ replaced.  Until its restarted, the daemon keeps working with 
the old files despite the fact that they were replaced on the filesystem.
In Linux, if you look in /proc/<pid>/fd/ of a daemon in that state, you 
see references to files marked "(deleted)" but the daemon keeps working. 
Thats probably the only safe way to upgrade an unaware software.

> We also have cases where many users modify files on the lower branches
> directly, and expect those modifications to be immediately visible through
> the union, even to processes with open files.  In the last few months we've
> made a lot of progress on supporting these kinds of features, and ensuring a
> uniform semantics when it comes to branch management and modifying lower
> branches (all falling under the general umbrella of "cache coherency").
>

Thats cool!  Still, I think these changes should only affect processes 
which opened the leftmost file being modified.

As a side-note, I think users modifying files on lower branches directly 
are taking a risk since they're relying on implementation details of 
unionfs rather than its design, unless unionfs officially supports it and 
guarantees consistent behavior (which requires full cache coherency). 
This is no different from opening /dev/sda1 while its mounted as ext3 and 
manipulating the filesystem directly.  Its possible but one should really 
know what he/she is doing.  :)

Users who use this feature are, by definition, aware of unionfs, so they 
can adjust.  Unaware programs (such as general software updates) can't 
adjust so they rely on known semantics.

> There are also more complicated cases involving copyup.  Suppose you have
> this situation:
>
> 1. you have a union with branch /a which has a file /a/f.  branch /a is
>   readonly.
>
> 2. you open the file 'f' for writing through the union
>
> 3. now you insert a new leftmost branch /b which also has a file /b/f, and
>   at the same time, you mark branch /a as readonly.
>
> 4. now you try to read file 'f' through the union: according to your
>   suggestion, unionfs should return the older content from /a/f.  Currently
>   unionfs will return the newer content from /b/f.
>

No, I think the distinction is still clear:
If the read() is performed on a pre-branch file-descriptor, it should get 
te old content.  Otherwise it should get te new content.  Thats how it 
works with deleted/replaced files on posix systems.  I'll demonstrate:

-------------------------------------------------------------------------
#!/usr/bin/env python

import os

pre=file("/tmp/file1","w+")
os.unlink("/tmp/file1")
post=file("/tmp/file1","w+")

# Write to the deleted file.
pre.write("pre")
# Write to the new file.
post.write("post")

pre.seek(0)
post.seek(0)

# Here's how the system sees it
os.system("ls -l /proc/self/fd/")

# Print the inode and content of both files.
print "%d: %s" % (os.fstat(pre.fileno()).st_ino, pre.read())
print "%d: %s" % (os.fstat(post.fileno()).st_ino, post.read())

os.unlink("/tmp/file1")
-------------------------------------------------------------------------

Output:
-------------------------------------------------------------------------
total 0
lrwx------ 1 root root 64 2007-07-10 01:15 0 -> /dev/pts/3
lrwx------ 1 root root 64 2007-07-10 01:15 1 -> /dev/pts/3
lrwx------ 1 root root 64 2007-07-10 01:15 2 -> /dev/pts/3
lrwx------ 1 root root 64 2007-07-10 01:15 3 -> /tmp/file1 (deleted)
lrwx------ 1 root root 64 2007-07-10 01:15 4 -> /tmp/file1
lr-x------ 1 root root 64 2007-07-10 01:15 5 -> /proc/20110/fd
826408: pre
826410: post
-------------------------------------------------------------------------

As the output shows, the semantics are preserved.  As long as the files 
have separate inodes, they're separate files, regardless of their current 
pathname or state.

If I modify this script and replace the unlink with a unionctl/remount to 
add a branch that changes /tmp/file1, I'd expect the script to still have 
similar output because the operation is the same from a user's 
perspective.

> 5. now suppose you try to write to file 'f' through the union.  Since branch
>   /a was marked as readonly, unionfs shouldn't write to that file.  So that
>   should result in a copyup of /a/f onto /b/f, right?  But even that's not
>   so clear: you have a 3-way merge problem now.  You have the original file
>   content /a/f, the newer content /b/f, and the also new content of /a/f
>   which you're trying to write into.  So which of the two "newer" ones
>   should we use and when?  It's confusing for sure, and it may depend on
>   the specific needs of users at that time.  Should we have allowed the
>   read of 'f' to go to the old content when a new one was inserted?  Should
>   we have copied-up the old content of 'f' from /a/f to /b/f on the first
>   read?  On the first write?
>

Again, posix semantics come to the rescue.  The operation where /a is 
marked as readonly (which starts the whole confusion) is essentially 
equivalent to "mount -o remount,ro /a".  The system will simply fail this 
operation, with mount(2) returning -EBUSY.  You can't safely change a 
filesystem to readonly if there are open descriptors to it in write mode.
Thats what users come to expect from mount(2) on other filesystems so 
they shouldn't be surprised when they encounter similar results upon a 
remount of a unionfs.

> Folks, I don't mind changing Unionfs's semantics if that's what most user
> want.  But I'd like to hear that this is indeed what users want.  I don't
> want to make a change that will break the desired behavior for existing
> users who may depend on that behavior.  If the issue is confined to having
> two different ways in which unionfs might handle *open* files, then which
> one of them should we pick: stick with the old content or go for the new?
> Or maybe it should be a user option?  Ideally, I'd like to avoid having many
> different modes of operations to choose from at mount time.
>

Personally, I don't think it should be an option, since the immediate 
change behavior seems too volatile to be depended upon.  Users are used to 
posix semantics since they depend on them everywhere else, so they 
shouldn't expect different behavior when using unionfs, at least where 
posix behavior applies (and as I demonstrated above, posix behavior 
applies in this particular case).

As for checking what most users indeed want, that would be hard for a 
project as mature as unionfs.  (I've been using unionfs for ages, 
including on my mobile phone :)  The reason is that most unionfs users 
don't know that they're unionfs users.  They use systems like Knoppix and 
expect them to "just work".  I think the best way to accomodate the needs 
of these users (which are now the majority of unionfs users) is to stick 
to the semantics they're used to.

Of course, I have a limited say in this, since I'm merely a user and not 
a unionfs developer.  I hope you'll agree with my reasoning on whats best 
for the majority of users (who don't even know they're unionfs users).

> Cheers,
> Erez.
>

Cheers,
Yoav


More information about the unionfs mailing list