java - How to quickly find added / removed files? -


I am writing a small program that creates an index of all the files on my directories. It basically stores each file on the disk It recycles and stores it in a searchable database, such as the discovery of Unix. The problem is that the index generation is quite slow because I have about one million files.

Once I have prepared an index, then there is a quick way to figure out which files have been added to disk or deleted since the last run?

Edit : I do not want to monitor file system events I think the risk is too high to get out of sync, I wish to scan very quickly Is it that quickly explains where the files have been deleted / perhaps the last modified date with the directory or something else?

A little small benchmark

I have just made a few benchmark running Running

  dir / b / s M: \ test \ & gt; C: \ out.txt  

takes 0.9 seconds and gives me all the information I need. When I use the Java implementation (), it takes about 4.5 seconds. At least this idea of ​​how to improve the attitude of this cruel force?

Related posts:

I have done this in my device metamake here is the recipe :

  1. If the index is empty, add the original directory with the timestamp == dir.lastModified () - 1.
  2. Directory Directory in All Search Directory
  3. Compare directory timestamps in index with one from file system. It is a fast operation because you have full path (this includes all files / diodes Does not include scanning).
  4. If the timestamp has changed, then you have a change in this directory. Scan it and update the index.
  5. If you receive unavailable directories at this stage, delete the subtitles from the index
  6. If you encounter an existing directory, ignore it (check in step will be done) 2)
  7. If you face a new directory, then add it to the timestamp == dir.lastModified () - 1 Make sure it is considered in step 2.

This will allow you to notice new and deleted files effectively since you scan for known paths in step # 2, so it will be very effective. File systems are bad in enumerating all the entries in a directory, but when you know the exact name it is faster.

DRAWKK: You will not see changed files, so if you edit a file, then it will reflect not in the change of directory. Even if you need this information, you must repeat the above algorithm for the file nodes in your index. This time, you can ignore new / deleted files because they are already updated while running on the directory.

Zach has mentioned that timestamps are not enough. My answer is: there is no other way to do this: the notion of "size" is completely changed from the directories and implementation to implementation Is unspecified for. There is no API, where you can register "I want to inform about any changes being made to do something in the file system". APIs that work during your application, but if it stops or misses an event, then you are out of sync.

If the file system is remote, then things get worse because all kinds of network problems can cause you to get out of sync, so when my solution can not be 100% accurate and tight in water , It will work for everyone but the most extraordinary extraordinary case. And this is the only solution that even goes away.

Now there is a single application that wants to preserve the timestamp of the directory after modification: A virus or worm will obviously break my algorithm but then, it means against virus infection is not. If you want to protect from this, then you need a completely different approach.

The only way to gain is to create a new filesystem that permanently logs this information somewhere, sell it to Microsoft and wait for some years (maybe 10 or more ) Until everyone uses it.


Comments