At least one embodiment of the present invention pertains to networked storage systems, and more particularly to a method and apparatus for collecting and reporting data pertaining to files stored on a storage server.
A file server is a type of storage server which operates on behalf of one or more clients to store and manage shared files in a set of mass storage devices, such as magnetic or optical storage based disks. The mass storage devices are typically organized as one or more groups of Redundant Array of Independent (or Inexpensive) Disks (RAID). One configuration in which file servers can be used is a network attached storage (NAS) configuration. In a NAS configuration, a file server can be implemented in the form of an appliance, called a filer, that attaches to a network, such as a local area network (LAN) or a corporate intranet. An example of such an appliance is any of the NetApp Filer products made by Network Appliance, Inc. in Sunnyvale, Calif.
A filer may be connected to a network, and may serve as a storage device for several users, or clients, of the network. For example, the filer may store user directories and files for a corporate or other network, such as a LAN or a wide area network (WAN). Users of the network can be assigned an individual directory in which they can store personal files. A user's directory can then be accessed from computers connected to the network.
A system administrator can maintain the filer, ensuring that the filer continues to have adequate space, that certain users are not monopolizing storage on the filer, etc. A Multi-Appliance Management Application (MMA) can be used to monitor the storage on the filer. An example of such an MMA is the Data Fabric Monitor (DFM) products made by Network Appliance, Inc. in Sunnyvale, Calf. The MMA may provide a Graphical User Interface (GUI) that allows the administrator to more easily observe the condition of the filer.
The MMA needs to collect information about files stored on the filer to report back to the administrator. This typically involves a scan or “file walk” of storage on the filer. During the file walk, the MMA can determine characteristics of files stored on the filer, as well as a basic structure, or directory tree, of the directories stored thereon. These results can be accumulated, sorted, and stored in a database, where the administrator can later access them.
On a large system, the file walk can be a very resource intensive process. Additionally, on a typical system having a large amount of storage, the results of the file walk can be very large. As a result, traversing the results of the file walk stored in the database can also be very resource intensive. What is needed is a way to store the results of a file walk so that they can easily be accessed and searched by an administrator.
A method for creating a file information database is disclosed. A storage server having a directory structure is scanned. Data regarding the directory structure is collected. Identification (ID) numbers are assigned to directories in the directory structure according to a depth first search (DFS) order. A table including the ID numbers is then written.
Other aspects of the invention will be apparent from the accompanying figures and from the detailed description which follows.
One or more embodiments of the present invention are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements and in which:
Described herein are methods and apparatuses for representing a directory structure using a depth first search (DFS) order. Note that in this description, references to “one embodiment” or “an embodiment” mean that the feature being referred to is included in at least one embodiment of the present invention. Further, separate references to “one embodiment” or “an embodiment” in this description do not necessarily refer to the same embodiment; however, such embodiments are also not mutually exclusive unless so stated, and except as will be readily apparent to those skilled in the art from the description. For example, a feature, structure, act, etc. described in one embodiment may also be included in other embodiments. Thus, the present invention can include a variety of combinations and/or integrations of the embodiments described herein.
According to an embodiment of the present invention, a filer or other storage server is coupled to a network to store files for users of the network. An agent is coupled to the filer and performs a scan or file walk of the file system of the filer for a Multi-Appliance Management Application (MMA), which is coupled to the filer and can monitor and manage the filer. The agent assigns identification (ID) numbers to the directories in the file system while scanning them. The ID numbers are assigned in a depth first search (DFS) order so that the results are less difficult and require fewer resources to traverse. Assigning the IDs facilitates efficient queries that may be useful to a system administrator monitoring the filer or other storage server. Several types of queries, including determining the parent of a node, determining all of the children of a node, determining the immediate children of a node, and determining all of the ancestors of a node may be easily accomplished using the ID numbers.
According to one embodiment of the invention, the agents 112 and 114 may use a file system different from the one used by the filer 102. For example, the agent 112 uses the Common Internet File System (CIFS), while the agent 114 uses the Network File System (NFS). Here, either agent 112 or 114 is able to perform the file walk of the filer 102, regardless of the file system used by the filer 102. The agent 112 also has storage 116 to store the results of a file walk while the walk is occurring and before they are transferred to the MMA 104. The agent 114 may also have attached storage for this purpose.
The results of a file walk may be transferred to and stored on the database server 108 after the file walk is complete. The database server 108 can then be accessed by the GUI 110, so that an administrator can search the results of the file walk. The GUI may allow the administrator to easily parse the results of a specific file walk, including allowing the administrator to monitor the total size of files stored on the filer, the size of particular directories and their subdirectories, the parents of specific directories, etc. These queries will be discussed in more detail below. The file walk may also collect statistics about the files on the filer, such as the total size of files, the most accessed files, the types of files being stored, etc. According to one embodiment, the GUI 110 may be a web-based Java application.
In this context, a tree represents a directory structure. A “node” is a point on the tree from which the tree branches off to other nodes or terminates. For example, the elements 201-210 are all nodes of the tree 200. A node, as used here, will represent a directory or file on the storage or file server. A “parent” of a first node is a second node located immediately above the first node in the tree. A “child” is the first node in relation to the second node. For example, a parent directory will have a child directory located within it. Here, the node 201 is the parent of the node 202, and the node 202 is the child of the node 201. A “sibling” is a node on the same level as another node. For example, two directories found embedded in the same parent directory are siblings. Here, the node 202 and the node 207 are siblings. Siblings always have the same parent.
Identification numbers (IDs) can be issued to each node to facilitate searching or querying the tree. In one embodiment, the IDs are issued during the file walk. The IDs can easily identify a node such as a specific directory or file. The IDs can also identify specific relationships between nodes, depending on the type of tree chosen.
The tree 200 has several nodes 201-210 that may represent directories stored on the filer 102. The nodes 201-210 have corresponding ID numbers 1-10, all in a DFS order. The DFS order assigns ID numbers to the nodes 201-210 by traversing down to the end of the tree first, and across the tree next. For example, the ID number 1 is assigned to the node 201, the number 2 to the node 202, and so on until the node 204 is reached. The node 204 has no children, i.e. has no embedded directories. Since the numbering system has reached the “deepest” directory, the process will move onto the siblings of the node 204. In this case, the node 204 has one sibling, the node 205. The node 205 will be assigned the next ID number, or 5. Since the node 205 has no children, the process will move up the tree looking for the next unassigned sibling, which is the node 206 here. The node 206 is then assigned an ID number of 6, the next available number. This process is repeated until all nodes 201-210 have been assigned ID numbers. As a result of the DFS ordering, all children of a particular node have an ID greater than the particular node, and all siblings of the node either have an smaller ID than the node or an ID number greater than all of the children of the node.
In block 404, data is collected regarding a directory structure. The data refers to the location of directories and characteristics of files stored in those directories. For example, this may be data generated for the table 350. The scan may create relationships between the directories so that a tree, such as the tree 200, can be created. This data may be stored on a database server such as the database server 108 and can be reported to a system administrator through a GUI 110.
In block 406, IDs are assigned to directories according to a DFS order while collecting the data. The agent is responsible for determining the organization of the directories into a directory tree. While the agent is organizing the directories, the agent can assign DFS IDs to each directory it encounters. These IDs can later be used to perform efficient queries, such as determining all the children of a specific directory, determining the parent of a specific directory, determining the ancestors of a directory, or the immediate children of a directory. The IDs are assigned in the order in which the directories are scanned, while data is being collected about the directory structure. In an operational sense, the agent also scans the directories in a DFS order.
In block 408, a table including the ID numbers is written. The table may include a list of the ID numbers, cross referenced with a name of the directory and the ID of the parent of the directory. This table can be used by a monitoring device or other server to determine the results of the queries mentioned above. The table can be written to a DB server such as the DB server 108. Once the table has been written, the process 400 is finished.
According to an embodiment of the invention, the file walk is performed by a single thread. A thread may be a program capable of operating independently of other programs. The thread may traverse and examine all files and directories found on a storage server to establish a logical tree. The thread may be configured to examine the contents of the server in a DFS order, so that each identified directory is assigned an ID in a DFS order.
The directories are chronologically assigned IDs. In other words, the first directory examined by the directory walking thread will be assigned the ID ‘1,’ the second directory will be assigned the ID ‘2,’ etc. If both threads operate simultaneously, they would be unable to maintain a DFS order. A condition variable and a mutex can be used to ensure the proper order. A mutex, or mutual exclusion object, allows multiple threads to share the same resource. While one thread is using the resource, access to the resource is denied to all other threads. The condition variable allows a resource to be blocked based on a condition. For example, when the file queue is empty, the condition variable may be signaled, allowing the directory walking thread to continue. Essentially, only one thread may operate at any given time to ensure the proper order is maintained. In practice, after the directory walking thread determines the children of a specific directory, the directory walking thread will cease examining directories, and allow the file thread to examine the file queue. Once the file queue is empty, the process resumes with the directory walking thread examining the next directory.
The process 500 illustrates the operation of the two threads. In block 502, the root directory is added to the directory queue. The root directory is the main directory of a file system, represented by the root, or top, node of a file tree. In block 504, it is determined whether there are any directories remaining in the directory queue. If not, the process ends, since every directory on the volume has been examined. If there are more directories, the process continues to block 506.
In block 506, the file walking thread examines the next directory in the directory queue. The examination of the directory reveals the children of the directory, or the directories and files stored within the directory. In block 508, the children of the directory being examined are placed on the file queue, and the current directory is assigned the next available ID. The root directory is assigned the ID ‘0’.
In block 510, it is determined whether there are any more entries in the file queue. If there are, the process continues on to block 512. If not, the process returns to block 504, where the next directory will be examined. In block 512, the file thread examines the next entry in the file queue. In block 514, it is determined whether that entry is a file or a directory. If it is a directory, in block 516, the directory is added to the front of the directory queue, and the process returns to block 504. If the entry is a file, in block 518 data about the file is recorded, and the process returns to block 504 to examine the next directory.
In
In
In block 702, if a user wants to perform a ‘Parent’ query, the process continues to block 704. If not, the process continues to block 706. Since the parent node is listed in the column 304, this query is trivial, requiring only one inquiry. In block 704, the parent column is reference for the particular node. The node's ID is found in the column 302, and the entry in the same row in column 304 is the parent ID. For example, if the parent of node 2 is to be determined, the system can search the column 302 for the node 2, and then reference the corresponding entry in the column 304 to determine that the node 1 is the parent of the node 2. Once the query is completed, the process moves to block 708, where it is determined whether more queries should be performed.
In block 706, if a user wants to perform the ‘Immediate Children’ query, the process continues to block 710. If not, the process continues to block 712. A node's immediate children are the nodes found directly beneath it. In block 710, the immediate children of a node can be determined by searching the parent column 304 for instances of that node. For example, the children of node 2 can be found by searching the column 304 for any occurrences of the ID 2. It can be seen that the ID 2 is found next to the nodes 3 and 6, which are the immediate children of node 2.
In block 712, if a user wants to perform the ‘All Children’ query, the process continues to block 714. If not, the process moves to block 720. All of the children of a specific node, i.e. a subtree located beneath the specific node, are found by first determining the ID of the sibling of the specific node. If the node has more than one sibling, the sibling having the next highest ID number (after the specific node) will be used. The sibling of the specific node can be determined by first determining the parent of the specific node in block 714 (see block 704). For example, if a user wanted to find all the children of the node 2, it is first determined that the parent of node 2 is node 1 by referencing the appropriate column in the table 300. In block 716, the sibling of the node is determined by searching for the next highest ID number that is also a child of node 1. Here, the only sibling of node 2 is node 7. As mentioned above, it is a characteristic of the DFS ordering that all of the children of a specific node will be assigned IDs before the sibling of that node is. So, as determined in block 718, the children of node 2 must be the nodes 3, 4, 5 and 6.
In block 720, if a user wants to perform an ‘Ancestors’ query, the process continues to block 722. An ancestors query determines a node's parent, grandparent, etc. until the root node is reached. In block 722, the parent query is performed to determine the parent of the specified node. For example, if we want to find the ancestors of node 5, we first determine that the node 3 is the parent of the node 5. Next, in block 724, it is determined what the parent of the parent of the requested node is; here the parent of node 3 is node 2. This process continues until the parent is the node 0, or the root. The result of an ancestors query on the node 5 would be 3, 2, 1. Depending on the location of the requested node within the tree, the ancestors query will require a number of requests equal to the depth of the tree.
The techniques introduced above have been described in the context of a NAS environment. However, these techniques can also be applied in various other contexts. For example, the techniques introduced above can be applied in a storage area network (SAN) environment. A SAN is a highly efficient network of interconnected, shared storage devices. One difference between NAS and SAN is that in a SAN, the storage server (which may be an appliance) provides a remote host with block-level access to stored data, whereas in a NAS configuration, the storage server provides clients with file-level access to stored data. Thus, the techniques introduced above are not limited to use in a file server or in a NAS environment.
This invention has been described with reference to specific exemplary embodiments thereof. It will, however, be evident to persons having the benefit of this disclosure that various modifications and changes may be made to these embodiments without departing from the broader spirit and scope of the invention. The specification and drawings are accordingly to be regarded in an illustrative, rather than in a restrictive sense.
Number | Name | Date | Kind |
---|---|---|---|
5146540 | Natarajan | Sep 1992 | A |
5313631 | Kao | May 1994 | A |
5555417 | Odnert et al. | Sep 1996 | A |
5566328 | Eastep | Oct 1996 | A |
5970494 | Velissaropoulos et al. | Oct 1999 | A |
5987506 | Carter et al. | Nov 1999 | A |
5999930 | Wolff | Dec 1999 | A |
6023706 | Schmuck et al. | Feb 2000 | A |
6052724 | Willie et al. | Apr 2000 | A |
6072936 | Koyama | Jun 2000 | A |
6138249 | Nolet | Oct 2000 | A |
6192191 | Suga et al. | Feb 2001 | B1 |
6199082 | Ferrel et al. | Mar 2001 | B1 |
6292797 | Tuzhilin et al. | Sep 2001 | B1 |
6298349 | Toyoshima et al. | Oct 2001 | B1 |
6311194 | Sheth et al. | Oct 2001 | B1 |
6356902 | Tan et al. | Mar 2002 | B1 |
6389427 | Faulkner | May 2002 | B1 |
6430611 | Kita et al. | Aug 2002 | B1 |
6457017 | Watkins et al. | Sep 2002 | B2 |
6480901 | Weber et al. | Nov 2002 | B1 |
6519612 | Howard et al. | Feb 2003 | B1 |
6553377 | Eschelbeck et al. | Apr 2003 | B1 |
6563521 | Perttunen | May 2003 | B1 |
6571257 | Duggan et al. | May 2003 | B1 |
6578048 | Mauldin | Jun 2003 | B1 |
6625615 | Shi et al. | Sep 2003 | B2 |
6625624 | Chen et al. | Sep 2003 | B1 |
6636250 | Gasser | Oct 2003 | B1 |
6687729 | Sievert et al. | Feb 2004 | B1 |
6725261 | Novaes et al. | Apr 2004 | B1 |
6754890 | Berry et al. | Jun 2004 | B1 |
6801903 | Brown et al. | Oct 2004 | B2 |
6857012 | Sim et al. | Feb 2005 | B2 |
6915409 | Peterson | Jul 2005 | B1 |
6922708 | Sedlar | Jul 2005 | B1 |
6947940 | Anderson et al. | Sep 2005 | B2 |
6961909 | Lord et al. | Nov 2005 | B2 |
6973577 | Kouznetsov | Dec 2005 | B1 |
7007024 | Zelenka | Feb 2006 | B2 |
7007244 | Pankovcin | Feb 2006 | B2 |
7013323 | Thomas et al. | Mar 2006 | B1 |
7024427 | Bobbitt et al. | Apr 2006 | B2 |
7054927 | Ulrich et al. | May 2006 | B2 |
7080277 | Anna et al. | Jul 2006 | B2 |
7089313 | Lee | Aug 2006 | B2 |
7096315 | Takeda et al. | Aug 2006 | B2 |
7120757 | Tsuge | Oct 2006 | B2 |
7139811 | Lev Ran et al. | Nov 2006 | B2 |
7167915 | Bendich et al. | Jan 2007 | B2 |
7203731 | Coates et al. | Apr 2007 | B1 |
7275063 | Horn | Sep 2007 | B2 |
7289973 | Kiessig et al. | Oct 2007 | B2 |
7293039 | Deshmukh et al. | Nov 2007 | B1 |
7433942 | Butt et al. | Oct 2008 | B2 |
7539702 | Deshmukh et al. | May 2009 | B2 |
7630994 | Deshmukh et al. | Dec 2009 | B1 |
20020049782 | Herzenberg et al. | Apr 2002 | A1 |
20020091710 | Dunham et al. | Jul 2002 | A1 |
20020147805 | Leshem et al. | Oct 2002 | A1 |
20020175938 | Hackworth | Nov 2002 | A1 |
20030046369 | Sim et al. | Mar 2003 | A1 |
20030115218 | Bobbitt et al. | Jun 2003 | A1 |
20040030586 | Cucchiara et al. | Feb 2004 | A1 |
20040078461 | Bendich et al. | Apr 2004 | A1 |
20040098363 | Anglin et al. | May 2004 | A1 |
20040098383 | Tabellion et al. | May 2004 | A1 |
20040122936 | Mizelle et al. | Jun 2004 | A1 |
20040133606 | Miloushev et al. | Jul 2004 | A1 |
20040143608 | Nakano et al. | Jul 2004 | A1 |
20040181605 | Nakatani et al. | Sep 2004 | A1 |
20040196970 | Cole | Oct 2004 | A1 |
20040205143 | Uemura | Oct 2004 | A1 |
20050022153 | Hwang | Jan 2005 | A1 |
20050050269 | Horn | Mar 2005 | A1 |
20050086192 | Kodama | Apr 2005 | A1 |
20050102289 | Sonoda et al. | May 2005 | A1 |
20050108474 | Zhang et al. | May 2005 | A1 |
20050108484 | Park | May 2005 | A1 |
20050166094 | Blackwell et al. | Jul 2005 | A1 |
20060041656 | Li et al. | Feb 2006 | A1 |
20080091739 | Bone et al. | Apr 2008 | A1 |
Number | Date | Country |
---|---|---|
WO 0225870 | Mar 2002 | WO |