[Unionfs] Anybody looking at NFS exporting a unionfs ?

Josef Sipek jsipek at fsl.cs.sunysb.edu
Thu Sep 6 12:01:11 EDT 2007


On Thu, Sep 06, 2007 at 10:06:32AM -0400, Jesse I Pollard wrote:
> Josef Sipek wrote:
>> On Wed, Sep 05, 2007 at 01:03:43PM -0400, Jesse I Pollard wrote:
>>   
>>> David P. Quigley wrote:
>>>     
>>>> It is worth nothing that this might not be a trivial implementation. In
>>>> the past to ensure that this functionality was correct we needed some
>>>> sort of persistent inode store. This may not be true anymore but if it
>>>> is then it isn't as simple as implementing 3 functions.
>>>>         
>>> That was why I was trying the 2.6.20-rc6-odf1 release. It uses an 
>>> auxilary filesystem to generate and track
>>> only the inode numbers, but the export capability had already 
>>> disappeared. There ARE comments imbeded in the unionfs 2.1.2
>>> that refer to the odf about NFS support.
>>>
>>> I would definitely be interested.
>>>     
>>
>> ODF was an experimental branch to see if the concept makes sense. ODF will
>> come back in the next few months :)
>>
>>   
>>> One alternative to the odf would be to add the largest possible inode
>>> number from the top level FS + 1 to an inode number from the next level
>>> down. This would introduce an "inode offset base" value to the table of
>>> branches - used to add to/subtract from the inode number , but would
>>> guarantee unique inodes as far as unionfs was concerned. Might add other
>>> numbers (maximum number of inodes in the branch fs) to make it easier to
>>> recalculate offsets during remount.  It also wouldn't easily work for
>>> writable branches that can dynamically add inode space.
>>>     
>>
>> I'm afraid that will not work. In the kernel, all the inode numbers are
>> 64-bits long, and there's no way for anyone to know the range of valid 
>> inode
>> numbers from a fs (except the fs itself). Given this fact, each lower fs 
>> can
>> give us an inode number in the range { 0 .. (2^64)-1 }. To uniquely 
>> identify
>> a file, unionfs needs a <branch index, lower inum> tuple (~70 bits).
>> Ideally, we would be able to take this info and shove it into our own 
>> inode
>> number, but we'd have to somehow map ~70 bits into 64 bits of our own 
>> inum.
>>
>>   
> True, the numerical limit is 64 bits. I was thinking of the actual usage
> by the fs. I believe this information is available (like a kernel mode
> statfs, field f_files), giving the total file nodes in the fs.

This would work for filesystems that allocate inodes at mkfs time (as you
mention later on yourself.)

> This would be a physical limitation (except for those that can dynamically
> expand the inode space), and could be mapped. A method of handling the
> dynamic situation would be to increase the total by 10 or 20% (mount
> option?) this would then be added to the base of the higher level branch,
> and hence become the base for the next lower level.

What do you mean by increasing the total? The total of currently allocated
inodes? If so, then that could still break on fs like XFS. XFS's inode
numbers are (indirectly) a function of the block number and offset into that
block. The entire XFS volume is divided into allocation groups (similar to
ext's block groups) and that's why you'll see jumps in the inode numbers.
For example:

root at batlh:/mnt# mkfs.xfs /dev/loop3
meta-data=/dev/loop3             isize=256    agcount=8, agsize=16000 blks
         =                       sectsz=512   attr=0
data     =                       bsize=4096   blocks=128000, imaxpct=25
         =                       sunit=0      swidth=0 blks, unwritten=1
naming   =version 2              bsize=4096
log      =internal log           bsize=4096   blocks=1200, version=1
         =                       sectsz=512   sunit=0 blks, lazy-count=0
realtime =none                   extsz=4096   blocks=0, rtextents=0
root at batlh:/mnt# mount /dev/loop3 foo/
root at batlh:/mnt# mkdir foo/a foo/b
root at batlh:/mnt# ls -li foo/
total 0
   131 drwxr-xr-x 2 root root 6 2007-09-06 11:30 a
262272 drwxr-xr-x 2 root root 6 2007-09-06 11:30 b

The 'a' directory got onto the first allocation group, while 'b' got onto
the second.

root at batlh:/mnt# df -i foo/
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/loop3            512000       5  511995    1% /mnt/foo

statfs doesn't even give real numbers as far as I can see...

root at batlh:/mnt# xfs_db -r /dev/loop3
xfs_db> sb 0
xfs_db> print icount ifree
icount = 128
ifree = 123

Only 128 inodes allocated. :)

...
> Also - if an overflow of the 64 bit inode space CAN/DOES occur, it would
> also be reasonable to disallow NFS exports for that specific mount. If it
> occured during a remount to add a branch, it would be reasonable to
> disallow the remount. Of course, a suitable error/warning message
> describing the problem would have to be provided in both situations.

Voluntary self-mutilation...interesting :)

> The only advantage this technique has over the ODF is possibly speed,
> since the ODF method requires a cache/disk lookup of the inode to identify
> the branch/inode pair. The ODF also requires a file system that has the
> same consideration of inode space as the above mapping.  ODF is a more
> compact mapping though.

Right. That's why ideally, ODF would be its own on-disk format so that we
don't have to call through the VFS.

> I'm not currently aware of any single filesystem that has even 4 billion
> inodes anyway. The largest I have encountered was around 20-25 million
> used (a Solaris SAMFS site some years ago); and I have access to a raid
> with 91 million allocated (but only 778,000 used).

I use XFS on my laptop for / (77GB volume), and the highest inode number I
have is 523686943.

>
>> Beware, I've spend long enough pondering about this, that I might be
>> fixated on the <branch index, lower inum> concept that I don't see
>> something more creative.
>>   
> My thoughts are more operationally aimed, than theoretical - theory would
> focus more on guaranteeing inode availability, operationally though, it
> doesn't seem to be necessary given the 64 bit range available.

I am just worring about "sparse" inode numbers like XFS provides. XFS is
very real, and very non-theoretical. I'm sure there are others that do
similar things.

Josef 'Jeff' Sipek.

-- 
I think there is a world market for maybe five computers.
		- Thomas Watson, chairman of IBM, 1943.


More information about the unionfs mailing list