Jump to content
Sign in to follow this  
SlackerAl

Efficiently examining user disk usage

Recommended Posts

Hi All,

Overview

I have a network shared file system of reasonable size (25 TB) which is mounted as a mapped NTFS drive to the PCs of about 30 users. The storage is forever full due to a variety of poor practices beyond my control. One of the main problems is that the users have their data distributed over a wide and deep directory structure and they struggle to find their old data to archive. The ownership of the files within the directory structure is mixed.

I wanted to write a tool which would help individual users find their data heavy directories, preferably without thrashing the file system to death.

I've written a tree viewer that works well, assuming I do not want to restrict my summary to a specific owner name (I employ some filters / selective starting positions, to restrict the search range within the directory structure).

Problem

I need to code up something to replace DirGetSize to return size for a specific username (files are domain user owned e.g. EUROPE\slacker) which cascades through the sub-directories with the same user name requirement. I'm trying to avoid an ugly, unknown size looping function that checks each file etc... I was wondering if there are any better approaches / suggestions e.g. calls to an existing API?

Thanks

Al


Problem solving step 1: Write a simple, self-contained, running, replicator of your problem.

Share this post


Link to post
Share on other sites

Why not use something like TreeSize?


My UDFs and Tutorials:

Spoiler

UDFs:
Active Directory (NEW 2020-10-10 - Version 1.5.2.1) - Download - General Help & Support - Example Scripts - Wiki
OutlookEX (NEW 2020-06-30 - Version 1.6.2.0) - Download - General Help & Support - Example Scripts - Wiki
OutlookEX_GUI (NEW 2020-06-27 - Version 1.3.2.0) - Download
Outlook Tools (2019-07-22 - Version 0.6.0.0) - Download - General Help & Support - Wiki
ExcelChart (2017-07-21 - Version 0.4.0.1) - Download - General Help & Support - Example Scripts
PowerPoint (2017-06-06 - Version 0.0.5.0) - Download - General Help & Support
Excel - Example Scripts - Wiki
Word - Wiki
Task Scheduler (2019-12-03 - Version 1.5.1.0) - Download - General Help & Support - Wiki

Tutorials:
ADO - Wiki, WebDriver - Wiki

 

Share this post


Link to post
Share on other sites

Network drive scans require the commercial license for each user.

It requires admin privs to install (not always available).

It automatically starts scanning at the drive root (the level of load is controversial for multiple users on the large drive).


Problem solving step 1: Write a simple, self-contained, running, replicator of your problem.

Share this post


Link to post
Share on other sites

I would suggest this basic recursive algorithm :

#include <Constants.au3>

Opt("MustDeclareVars", 1)

Global $oShellApplication = ObjCreate("Shell.Application")

MsgBox ($MB_SYSTEMMODAL,"",GetOwnerSize("C:\Apps\AutoIt", "EUROPE\slacker"))

Func GetOwnerSize($sFolder, $sOwner)

  ;ConsoleWrite ("Folder Name = " & $sFolder & @CRLF)
  Local $oShellFolder = $oShellApplication.NameSpace($sFolder)
  Local $oShellFolderItems = $oShellFolder.Items()
  $oShellFolderItems.Filter(0x40, "*")
  ;ConsoleWrite("File count = " & $oShellFolderItems.count & @CRLF)
  Local $nCount = 0
  For $oShellFolderItem In $oShellFolderItems
    If $oShellFolder.GetDetailsOf($oShellFolderItem, 10) <> $sOwner Then ContinueLoop
    $nCount += $oShellFolder.GetDetailsOf($oShellFolderItem, 1)
  Next

  $oShellFolderItems.Filter(0x20, "*")
  ;ConsoleWrite("Folder count = " & $oShellFolderItems.count & @CRLF)
  For $oShellFolderItem In $oShellFolderItems
    $nCount += GetOwnerSize($sFolder & "\" & $oShellFolderItem.name, $sOwner)
  Next
  Return $nCount

EndFunc   ;==>GetOwnerSize

 

Share this post


Link to post
Share on other sites

Thanks for that, I've now got something working.


Problem solving step 1: Write a simple, self-contained, running, replicator of your problem.

Share this post


Link to post
Share on other sites

The issue is always going to be speed. You have to, in essence, grab every file/folder object and look at the ACLs to determine the owner and size, keeping a running total. You could use BrewManNH's excellent _FileGetProperty UDF (below), just change the _FileListToArray call to a _FileListToArrayRec. There is also an older function called _GetExtProperty that still works pretty well:

_FileGetPropertyUDF

_GetExtProperty (example below)

$aProps = _GetExtProperty(FileOpenDialog("Choose File", @UserProfileDir, "ALL (*.*)"), -1)
        If IsArray($aProps) Then ConsoleWrite($aProps[1] & ", " & $aProps[10] & @CRLF)

 

In either case, however, I think you're going to run into a speed issue if you're parsing a large number of files. Running the _GetExtProperty against a directory with ~45,000 files and pulling only files that match a specific user took more than 10 minutes. You may have to resort to Powershell; the same query took only 3 minutes.

 

Edit: Dang page refresh. Glad you found a solution.

Edited by JLogan3o13

"Profanity is the last vestige of the feeble mind. For the man who cannot express himself forcibly through intellect must do so through shock and awe" - Spencer W. Kimball

How to get your question answered on this forum!

Share this post


Link to post
Share on other sites

Thanks JLogan3o13. I did indeed use _FileGetProperty. I took advantage that below a certain directory level users stop  swapping around, so by forcing them to check a chunk of disk at a time performance was OK.


Problem solving step 1: Write a simple, self-contained, running, replicator of your problem.

Share this post


Link to post
Share on other sites

This is typically the kind of problems that you get when a company uses a file server as a file sharing platform across all users without no clear plan of action or management.

This results in files and folders beeing spread everywhere with no logic, no maintenance and horrible rights everywhere.

To solve it you need to either enforce strict and specific rules on how to use the fileserver and correct the actual filesystem to go accordingly, which can represent a lot of work.

The other way around is to setup properly a new file server from the ground up (either a traditional file server or something more sophisticated like nextcloud) then ask and warn the users that they have until x to move their files to the new server as explained by the new server rules.

Given your situation, if possible the 2nd choice would probably be better. Otherwise, good luck ;)

Share this post


Link to post
Share on other sites

Hi Neutro,

Fair comments for many situations. Here there is a clear use plan, generally with quite a good structure - there is some trade-off in freedom of working methods (for various complex problems) versus completely rigid structure.

There are two main causes of the problem - no quota system, as this is unwanted by those ultimately in-charge (there is an expectation of self-management, which works for 90% of the users). And a lack of supervision of live project spaces - because those running the projects want to spend their time on other, more productive, things.

The file system is the transient data store for live HPC projects. The cluster is able to rapidly generate large volumes of data, so a cloud solution is not ideal.

Best of all - I'm a user not an administrator 🙂 Now that my fellow users and I can find our occasional chunks of forgotten data, we are back to a working space.


Problem solving step 1: Write a simple, self-contained, running, replicator of your problem.

Share this post


Link to post
Share on other sites

Hey,

What you are dealing with right now is something that the people managing your IT system should have anticipated and dealt with even before it became a problem for you. Sorry to be a bit blunt but even if they are nice people to you, they're not doing they job right.

It's OK to have a no quota system but the admins should have a server monitoring interface which should alert them when there is a space problem and they should be able to see immediately from where it is coming from without having to re-scan the whole server data.

Also the volume of the data available has nothing to do with the software running it. This is linked to the hardware layer, not software :) 

"cloud solution" = data hosted remotely, which is technically what you already have right now. But if you had a nextcloud server to manage your data instead of a simple file server the file management would be more granular and you probably wouldn't have to deal with your data problems :)

A cloud solution can be hosted only for LAN, it doesn't have to be available through the internet as well, even if it's more convenient but it requires a very fast symmetrical internet line which is not always possible to get.

Edited by Neutro

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

  • Recently Browsing   0 members

    No registered users viewing this page.

×
×
  • Create New...