README


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80

So far this is just a copy of the nullfs example from
/usr/share/doc/python-fuse with some stuff renamed

To make it work:

- How do you get another arg in the options?
    - pydoc fuse shows some magic option parser stuff
    - need this for the "source" directory, or backing storage area
- Better to compress chunks?  Or have a blob more like a zip?

-----

TODO:

+ Make inflate/deflate block based as needed, so we don't have to do a
  bunch of work up front and waste a bunch of space on disk
    - done
+ Make files just contain a backing storage key, this key will reference
  what we have in it now (the data list and stat info) so that complete
  duplicate files will not take up a few extra megs and still be able to
  have their own permissions and stuff
+ Copying read-only files doens't work (permission denied on close, because
  that is the point we are opening and writing to the original file)
    - done - we open a file handle at __init__ now and use that
- R/W is basically ignored at this point
- fsck:
    - test each chunk < last is a full block size (this would be a good
      assert too)
+ delete unused chunks (refcounting)
- pack multiple chunks in to "super chunks" like cromfs/squashfs to get better
  compression (e.g. 4M of data will compress better than that same file
  split in to 4 1M pieces and compressed individually presumably)
- Speed it up?  Or is it "fast enough"
- should die on errors accessing blocks, will also need some kind of fsck
  to find corrupt blocks and files effected, this way if there is a problem
  and you have another copy of the file then the block can be recreated
- some kind of config/IOC to allow plugging in the hash, storage, etc
  methods.  Main components like FileSystem, Chunk, ChunkFile don't need to
  be swapped out since they are generic
- Maybe compression method doesn't belong in Chunk?  it should be part of
  storage (for super-chunks?) or should super-chunks be a "standard" part
+ Add refcounting so we can expire chunks, or would a reverse list be
  better?
- If we separate metadata from chunks then we can just rebuild the
  metadata for fsck
- Find a good way to test refcounting isn't purging things too soon or
  keeping them too long
-----

Other thoughts:

- If there was an easy way to "open a file" or something and have it
  "touch" all it's pieces, you could just run that in the mounted tree
  then "find storage/ -mtime +1" and delete that stuff to clean out cruft
- Alternatively have it keep track of block usage counts and when it goes
  to "zero" then delete it
    - Change load/save to be ref counted?  Or have another method for
      "release" and "lock" to say "Yeah I'm using this" or "This is garbage
        now?"
- Possibly better compression to be had if you use a squashfs sort of block
  of blocks.  So you get redundancy of small blocks (32k or whatever) and
  pack those together in to big blocks (say 2-4M) then compress the big
  block.  That way you get better compression in the big block.  The
  question is if this constant inflating an deflating of blocks will be too
  much of a performance hit
      - Maybe have a "working set" of pre-expanded sub blocks?  And
        automatically freeze out blocks when all the files are closed?
- This might work well over a remote link for random-access to large files
  using sshfs or ftpfs or something since you don't have to download the
  whole original file to get chunks out, you download the index then just
  the chunks you want
- Get rid of cpickle, it's way more than we need for saving essentially a
  few ints and a block list even though it is very convenient
- Because of the way we refcount blocks I don't think we can open/unlink a
  file like you would for temp files, but that's not the purpose anyway
- Do some profiling using a loopback filesystem, since most of it will be
  in memory we can see where the "real" bottlenecks are in the code by
  taking out the disk access unknowns
- ext3 uses a lot of space because of directory inodes reducing the savings
  quite a bit