Skip to topic | Skip to bottom
Home
Neuralyte

Down with Spammers!

Neuralyte.FuseJshfsr1.11 - 13 Dec 2010 - 20:46 - TWikiGuesttopic end

Start of topic | Skip to actions

Fuse-J-shfs

Fuse-J-shfs lets you easily implement a virtual filesystem in Unix shellscript.

And naturally, it already has some handy vfs implementations you can use straight away: gzip, rar, sparse, ...

To see how easy it can be to implement your own vfs with shfs_ez, take a look at this example ezfs config.

But be aware this software is not stable! Some of the filesystems do work as they should, depending what you are doing. Read-only access or writing sequentially should work OK, but there are unresolved issues regarding concurrent writes. See "Project Status" below.

Download

  • Dependencies: Some bits of the code make calls to shellscript functions which are not included in the tarballs.
  • So for FuseJshfs to work, you will need to install and run Jsh before you mount your vfs.
  • TODO: I should extract the shellscripts used from Jsh and distribute them with the tarball.
  • (For some uses you may prefer an earlier version. I think around versions 0.9 and 19 I made some major changes which may have made simple things less efficient (or even broken!), but more powerful things possible. Sorry I can't be more specific.)

Example shell filesystems

Virtual filesystem implementations already created for (Fuse-J-)shfs are (all in shellscript!):

  • gzipfs: mounts a tree of gzipped files as a tree of files.
    • You trade speed for disk-space, if that is what you want.

  • rarfs: mounts a rar archive as a filesystem.
    • Those rars sure are easy to manipulate. smile ( But some writes generate a huge (~rarsize) tempfile :/ )

  • ezfs: makes it /really/ trivial to implement a new filesystem
    • To implement an ezfs filesystem, you just need to provide three shell functions:
      • list_files_in_dir()
      • list_subdirs_in_dir()
      • get_contents_of_file()
    • If file (pre)retrieval is costly, and you are able to, you may prefer to implement:
      • list_files_in_dir_with_sizes()
    • or to really treat your users, you could implement:
      • list_files_in_dir_with_dates_and_sizes()
    • Existing implementations of ezfs filesystems:
      • loopback - simplest example
      • httpfs - provided your webserver presents directories similarly to Apache
      • svnfs - mounts every revision in your repository!

  • loopfs: mounts a local directory looped back.
    • Not so practical, but a good example implementation, and useful for testing/debugging.

  • sparsefs: presents large files virtually but internally splits them up into small files for storage
    • What's the use in that?
      • Mainly, when used with gzipfs, allows you to seek within the gzipped files, without having to decompress the whole file. smile
        • E.g., I have a huge VMWare FS file, large chunks of which are actually just blank (or extremely repetitive) data.
        • VMWare needs to see the file uncompressed, but if I could compress bits of it (the blank bits are never really accessed anyway), I would save a lot of hard-disk space.
        • So, a sparsefs layer can present that huge file, but actually split the file up into small files. And if I tell sparsefs to store each of the small files on another vfs, a gzipfs which compresses each file, then the blank data will no longer be taking up space on my disk!
        • I did manage to get "make svncviewer" working on this pair of VFSs (although I needed to implement optimdd yuk!), but VMWare soon ran into trouble with the "drive". I still think it's possible, I just need to track down the bugs...

Hierarchy

A little bit about how it works:

  • fuse is a module for your kernel which lets users mount their own types of virtual filesystem in userland smile

  • Fuse-J is a project which passes off fuse filesystem requests to any Fuse-J Filesystem implemented in Java.

  • Fuse-J-shfs is one such Filesystem implementation, which in turn passes off the queries to one of the shell filesystem implementations.

  • Each shfs implementation is written entirely in shellscript, and could handle the requests by looking on the local filesystem, checking in an archive, performing a network request, or whatever you like...
    • In the case of one shfs implementation, shfs_ez, it might pass the query down through yet another layer of simplification, to the ezfs_config file.

Status of project

  • Reached a plateau with make now working on rar's and the first implementation of shfs_ez. Released version 1.0001. big grin
  • But there are still things we could improve...
  • 4/Oct/05: We have http_vfs and svn_vfs. smile

Serious Issues

  • These only affect writeable filesystems.
  • Cachefile date resolution:
    • In gzip-fs at least: When we check a cachefile out, we give it the same date as the underlying gzipped file. We can check whether it needs committing or checking out again by comparing the dates.
    • The problem with this is that file dates are only accurate to 1 second, and some operations occur more quickly than this.
    • To catch these cases, I set the code to commit the cachefile if it has the same date as the gzipped file, but this causes it to be committed on every release. :/ Maybe without enough precision on dates, it would be better if Java determined whether a cachefile had been written to, and should be committed. :/
    • [ Hmm maybe I could push the existing system a little further, by remembering (or being reminded by Java) whether the release is from an open-RO or an open-RW. ]
  • Synchronization, concurrent access, locking.
    • Sometimes more than one process may be accessing a file.
    • This might invoke multiple shells to handle the requests.
    • But some of the operations performed at shell-level might not work as expected if they are running concurrently.
    • Too strong a locking scheme would break fifos, and gcc.
    • So we would need an "intelligent" locking algorithm...
    • [ This issue affects gzip if two process are writing to a file by two differents methods, eg. writebytes and writefile, and also affects writes to rarfs per-archive, i.e. other writes should not be performed whilst one is being added to the rarfile. ]

  • My initial direction with this project was to pass as much as possible of the fuse queries from Java to the shell, partly because I am sick and I love shell and want to give it power and flexibility in creating vfs, but also because I didn't know how much/little would need to be handled where.
  • But now that I do know, I can safey migrate some functionality from shell side to Java side, without contracting the limits of shell's flexibility. All the things which I say below are essentially common to all the shfs implementations, could be more efficiently implemented in Java, without the need to invoke, or even talk to the shell.
  • So from this prototype, I should be able to migrate to a more streamlined implementation, designed to allow creation of vfs in shell, but actually only calling shell when it's really needed, and doing a fair amount of stuff in Java. (E.g. the cachefile management.)
  • Update 02/2006: Unfortunately, like many hobbyist-programmers, I have lost interest in this project now that I have a working proof-of-concept. Efficiency was never really my goal - I just wanted to see if it was possible. So if you think this project is worth taking forwards, please contact me with some support/encouragement (or even offer to help!), otherwise I will probably never finish it! :P
  • Indeed, maybe this project should jump away from fuse and look at ways to integrate into "special files" in Hurd or reiserfs.

Cache files

  • Sometimes the system wants to random-access a file, i.e. jump along it picking up (or even putting down!) bytes in a non-sequential pattern. (E.g. "tail -n 10000 " is one way to get this behaviour)
  • Depending on the vfs you are implementing, providing this kind of access may not be directly possible.
  • So in such situations, a file which is being opened for random-access, is "checked out" of the vfs to a "cachefile" on the local filesystem. This cachefile can be easily read or written to in random ways by Java. Then, once the file is released from random-access by the external system, it is sent back to the vfs internally ("checked in").

  • At the moment, for each shfs implementations which need this caching, a similar shellscript implementation of checking in and checking out is used. (Actually these are called "expand_cachefile_if_needed" and "compress_cachefile_if_needed" in the code, although that naming is equally flawed, biased towards gzipfs's rather than versioning fs's!)
  • The current default implementation will commit the cachefile as soon as it is released, and either remove the cachefile immediately, or never remove it.
    • (For an extreme example of why this is bad, consider when certain vfs's are mounted on top of other vfs's: each block of 4096 bytes written by the upper, causes a re-open and release of the file in the lower; but we don't want to re-gzip the file every time, only when it's finished being written!)
  • But, as we all know, in general the desirable behaviour for caching goes more like:
    • When the cachefile is released, don't necessarily commit it immediately to the vfs.
    • Instead we should keep the cachefile around, in case another process opens it for reading/writing soon.
    • Only if the cachefile hasn't been accessed for a while, or it is really about time we synced the vfs, then check the cachefile into the vfs.
  • That kind of behaviour would be far more efficient in many circumstances. But it is a little more difficult to implement in shellscript.
    • We could reclaim the oldest file when du reports cachedir too large. (But what if all those files really are open? We should report beyond-allowed-space to user.)
    • We could have a shell thread which accepts messages when cachefiles are released, and commits them when needed.
    • But maybe we should pass cachefile handling off to Java.

  • So TODO:
    • Implement properly-behaved "cachefiles" in the Java side of Fuse-j-shfs. This is probably the only major improvement FuseJshfs needs before it takes over the world.
    • Also, expose mount options to the user allowing them to select the cachefile method, or disable caching. In fact we should probably just let the user select which of the get_supported_ops() may and may not be used. [ Because the shell implementations of expand_cachefile_if_needed() and contract_cachefile_if_needed() may still be more useful if, in sh, they have (easy access to) more relevant information than the Java layer does. ]

Java's caches

  • NOTE: the Java class ShellFilesystem? also has a cache (two HashMaps?) of file meta-data, which should not be confused with the cachefiles.
  • These caches are useful because sometimes the shell calls to retrieve file meta-data are slow and should therefore be cached wherever possible.
  • But depending on the vfs implementation, and on the way it is used, the actual configuration for filemeta cache syncing could differ.
    • E.g. if retrieving file-metadata fresh from vfs is costly, then we want Java to cache it as much as possible.
    • But if the vfs is of a nature that its contents change without any change from user, then caching should refresh / clear as often as it needs to keep user up-to-date with the contents of the virtual filesystem.
  • See clearCache() in ShellFilesystem?.java

  • So TODO:
    • Each shfs should recommend to Java which cache-clearance algorithm is most suitable.
    • Also, the user should be able to override the cache-clearance algorithm with one of their choosing, via an option to the mount.
      • (This would also allow us to batch up tests of each cache-clearance algorithm. smile

Also TODO:

  • Mount option(s) and shfs's own recommendation relating to direct_io, and whether the user requires up-to-date information before seeing files, or if file metadata should be dummied until it arrives, or whether it should just be dummied. (I'm talking about e.g. files which it is slow to obtain the file size: should we wait till we have it, or display 0 until we get it, or just display 0 and don't bother obtaining its size because we are using direct_io and don't need it.) Also, note that direct_io broke make when I last tried it, so on its own it is dodgy, but it could be useful for efficiency/user's-point-of-view if handled correctly.

  • Coding nicities
    • Lots of little text notes and scriptlets in dev dir need tidying up (especially those which are out-of-date!).
    • Lots of code needs tidying, mainly removing old commented code which is now redundant, but some of the Java stream-handling code is a little convoluted and could be refactored to something equivalent but easier-to-follow.
    • Both Java and sh should be made better at self-checking and error-reporting.
      • There are still a few types of errors which can be produced by buggy shfs implementations, the reports of which get lost in Java streams, and never make it to the eyes of the developer/debugger. :/
      • Easy and appropriate solution: I think Java should not read stderr of shell process at all. I think shell should pipe all stderr into the logfile so the developer can see it, and Java can forget about it.
    • After removing a lot in an earlier blitz (re-use of loopfs functions), some duplicated code has crept back into the shfs implementations.
      • Refactor any possible common code out again.
      • Separate the code which makes each shfs (shfs's in general) work similarly, from the code which makes each shfs different.
    • Consider: Probably the best way to refactor is to turn all the existing shfs's into implementations for shfs_ez. shfs_ez will perform all the default stuff that needs doing, but it should allow the implementating config perform any extra options which it is useful to do (e.g. write_bytes at an offset directly, rather than shfs_ez using a cachefile for it). Yes indeed.

  • Testing
    • So far my most exciting test has been to "make svncviewer" in each vfs. This almost always broke new implementations, performing a number of unsightly interactions with the files, but eventually it showed me how to fix the system to act more closely to how a fs should.
    • This testing process easily could be and should be automated.
    • It would actually be better to compile Fuse-J-shfs as a test, rather than have a whole separate test package (svncviewer). But will ant or javac get quite so funky with the files as gcc and ld?!
    • Even better, would be to make a custom test program to access the filesystems in strange ways and try to break them. It would also test the various options for each vfs, when they are implemented. This would help further development of the project, by quickly letting the developer know if they broke something!

  • More shell vfs implementations of course!
    • Extending ezfs...
      • It would be nice to merge a load of ezfs configs into one mount, to present a vfs that does different things in different places.
      • And, allow easy creation of conditional configurations, e.g.:
        • Present the mounted filesystem as it is,
        • except whenever there is a zipfile, also present it unzipped as a virtual directory right next to it,
        • and whenever there is a web page, present a virtual directory next to it containing all the documents it links to,
        • and hide all files matching regexp '\.tmp$',
        • and next to each mp3 file, present a .lyrics.txt files, if and when one can be obtained from the net (e.g. by the jsh script seeklyrics).
    • svnfs should be made writable!
    • sparsefs should be renamed splitfs. It's gzipfs that actually simulates sparseness.
    • forkedfs:
      • Starts off looking like one of your existing FSs (the base FS), but when changes are made, they are stored separately from the base FS. So it lets you have two slightly different versions of a FS, without taking too much space.
        • Useful, for example, if you want to upgrade your system, but want to check it's all working before actually dropping the old version. smile (Therefore, forkedfs should allow you to apply the changes back to the original basefs.)
      • I want a version of forkedfs which works on a diskimage or disk-partition. (Can I actually base on /dev/hda1 ?)
        • E.g. I want to backup my /dev/hda1, but it's too large to fit on a CD, even compressed. So I would mount the partition image with forkfs, and on the fork, delete some of the non-essential files not needed for the backup, overwriting them with 0's. The forked version of the fs (partition image) would then compress small enough. smile
      • Maybe forkedfs could be related to svnfs, but I might not want to check in the whole FS initially.

Why?

  • By giving you more control over your filesystems, which are very fundamental to your OS, accessed by all of your software, you can fix problems or do useful things on a FS which might be harder to do elsewhere.
    • E.g. a particular piece of software may be non-free so you can't change the way it reads or writes files, but with shfs you can change how the files are presented to it. smile

Why Java?

  • You're right, I didn't need to implement Fuse-J-sh in Java, I could have written C code to communicate between fuse and the shell. But I found it easy to do it with Java.
  • So this Java implementation could be considered to be a prototype, which could indeed be replaced by a C/native implementation, without neccessarily changing any of the shell code.

Why shell?

  • Well, because with shfs_ez it is now so easy to knock up a new vfs for whatever purpose you want, then forget about it afterwards.
  • For example:
    • It might be useful to be able to grep virtual files which you don't actually have stored on your disk.
    • Or run make in an automatically up-to-date-when-needed cvs forest of GNU source packages.
    • Or create little descriptors which say "there is a virtual file here whose contents are the output of the following command". (Unfortunately, since ext2 has no support for such special files, this would require mounting all relevant filesystems through shfs - inefficient. However, reiserfs, and the Hurd, do/will have support for this sort of thing...)
  • I wanted this ability in sh, because with Unix command-line tools, it is quick and easy to do many powerful things.
  • But if you prefer to program in Java, or C, then you don't need shfs, you can do it with fuse or Fuse-J. smile

Why so much shell?

  • That is indeed the question. The answer is probably: because I like writing shellscript!
  • It seems likely it would be far more efficient to implement more of the code in Java, and only call shellscript for (expose shellscript to) a very few operations (like ezfs does). That also might make it easier to implement some of the still-needed features / closing issues.
  • The only forseeable disadvantage with implementing more in Java, is that if we keep the Java layer light, it could more easily be replaced at a later date by an even lighter Fuse->C->sh layer.

-- JoeyTwiddle - 08 Jun 2005

Conclusions

One limitation I came up against was that fuse does not seem to let us spring virtual files into existence when an attempt it made to access them. Instead fuse first queries the filelist in the directory, and if the requested file is not there, it returns a failure to open the file. I was hoping to be able to present an empty dir, but to fill out the file contents of, say "/mounted_web/www.google.com/index.html" if the user tries to access it. The only workaround I can see for this, would be for the user to create a file, e.g. by touching it (or otherwise telling the engine a new file should appear), and then read its content. [IIRC it was possible to present a zero-length files, but then show a lot of data when catting it. I could be wrong.]

My goal was to give full power to the shell, so the shell would have as much flexibility as possible.

  • Success: I was able to request the middle of a large file, e.g. a video, and the shell would stream the partial content from a remote webserver.

  • Problem: Shell is very inefficient at all this low-level stuff[1]. The shell performed the caching, when it would have been better for Java or C to do that. For maximum efficiency, the only work the shell need really do is to return the file lists and file contents (ezfs). In a new implementation, I would recommend that all the efficiency hacks (caching etc., other useful common behaviour) be performed at the Java or C level, but that the shell be allowed to override the bits it needs. So most FSed would be super-efficient, but the shell level would get the extra control if needed. (Some attempt was made at this configuration of Java behaviour by the shell, but only during fs startup.)

[1] Silly footnote: One of the funniest things in this project was the recursive shellscript to dd an exact length. Since dd works on multiples of count*block_size, for any efficiency, you need to split the length up into sizes with large multiples. Somewhere there is a script which send a large multiple of the data, then calls itself again to deal with the data that still remains. It no longer produces primes as error messages. I suppose an alternative to this might have been to dd a little bit too much, then head -c <length>
to top


You are here: Neuralyte > WelcomePage > ProjectsPage > FuseJshfs

to top

Copyright © 1999-2011 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback