Loading file system information into PostgreSQL

« previous entry | next entry »
Dec. 18th, 2008 | 11:03 am

GNU findutils is extremely useful. Unfortunately, the find command doesn't always scale well. For instance, running find on the entire machine can take a long time. And should you learn something that requires modifying your find expression, you have to start the command all over again from the beginning.

The sister program, locate, command helps by being much faster. However, it is missing the expressiveness of find. The locate command should support all the features of find -- save for maybe the -exec expression for security reasons.

Even still, the query syntax for find commands are also not scalable or very user-friendly.

I always wanted to import file system meta information into a database, and use SQL queries to find information about the file system. SQL has its own set of problems, but it would make asking questions about the files on a computer much more worthwhile and maybe even a bit exciting. Very interesting queries could be made, and they would be answered very quickly -- without having to wait.

So, I finally gave it a try. I describe here how I was able to load the results of find into PostgreSQL

The -printf expression of findutils was great for this task. It can generate the information about the file system. And it can have its output formatted to a tab-delimited file.

The following use of the find command makes a text file with entries for every file in your user directory.

$ find ~ -printf '%i\t%f\t%p\t%h\t%y\t%u\t%U\t%g\t%G\t%M\t%m\t%s\t%b\t%k\t%l\t%n\t%AY-%Am-� %AH:%AM:%AS\t%TY-%Tm-%Td %TH:%TM:%TS\t%CY-%Cm-� %CH:%CM:%CS\t%F\t%D\n' > finddb.txt

It's a pretty hideous command. These are all the fields it produces in the output:

  • inode
  • name
  • wholename
  • path
  • type (file, link, directory, device, ...)
  • user (symbolic)
  • user_id (number)
  • group
  • group_id
  • perm (e.g. "-rw-rw-r--")
  • perm_octal (e.g. "0664")
  • bytes
  • blocks (512-byte blocks used of disk)
  • kblocks (1k-blocks)
  • symlink
  • links (number of hard links)
  • atime (last access time)
  • mtime (last modification)
  • ctime (last time modified or status changed)
  • fstype (e.g. "ext3")
  • dev_id (device number)

These represent the header fields in the output. Later, they will represent the table columns in the database.

Unfortunately, the output of find is one-file-per-line and tab-delimited. That means files with newline or tab characters in their names won't cooperate. A suboptimal solution is to just ignore those files on the system. That's easy to do with the find command.

$ find / ! -regex ".*[$(echo -ne '\n\t')].*" -printf '%i\t%f\t%p\t%h\t%y\t%u\t%U\t%g\t%G\t%M\t%m\t%s\t%b\t%k\t%l\t%n\t%AY-%Am-� %AH:%AM:%AS\t%TY-%Tm-%Td %TH:%TM:%TS\t%CY-%Cm-� %CH:%CM:%CS\t%F\t%D\n'

Better yet, complain to the user on standard error (STDERR) every time the find command comes across one of these rare files.

$ find / \( -regex ".*[$(echo -ne '\n\t')].*" -exec sh -c 'echo >&2 "$0": File name has tab or newline' '{}' \; \) -o -printf '%i\t%f\t%p\t%h\t%y\t%u\t%U\t%g\t%G\t%M\t%m\t%s\t%b\t%k\t%l\t%n\t%AY-%Am-� %AH:%AM:%AS\t%TY-%Tm-%Td %TH:%TM:%TS\t%CY-%Cm-� %CH:%CM:%CS\t%F\t%D\n' > finddb.txt

If you want to know if your system has these wickedly named files, run the following locate command.

$ locate -r ".*[$(echo -ne '\n\t')].*"

To convert the file to a comma-separated value (CSV) file, I like to use Perl.

$ perl -mText::CSV_XS -e 'my $csv = Text::CSV_XS->new({ binary => 1, eol => $/ });' -ne 'chomp; split(/\t/); if ($csv->combine(@_)) {print $csv->string;} else {printf STDERR $csv->error_input;}' finddb.txt > finddb.csv

This is a table definition for PostgreSQL that can be loaded with the CSV or tab-delimited text file.

CREATE TABLE finddb (
    inode bigint NOT NULL,
    name text DEFAULT '' NOT NULL,
    wholename text DEFAULT '' NOT NULL,
    PRIMARY KEY (inode, wholename),
    path text DEFAULT '' NOT NULL,
    type character varying(1) NOT NULL,
    "user" text DEFAULT '' NOT NULL,
    user_id integer NOT NULL,
    "group" text DEFAULT '' NOT NULL,
    group_id integer NOT NULL,
    perm character varying(10) DEFAULT '' NOT NULL,
    perm_octal character varying(6) DEFAULT '' NOT NULL,
    bytes bigint DEFAULT 0 NOT NULL,
    blocks bigint DEFAULT 0 NOT NULL,
    kblocks bigint DEFAULT 0 NOT NULL,
    symlink text DEFAULT '' NOT NULL,
    links integer DEFAULT 0 NOT NULL,
    atime timestamp without time zone
          DEFAULT '1970-01-01 00:00:00' NOT NULL,
    mtime timestamp without time zone
          DEFAULT '1970-01-01 00:00:00' NOT NULL,
    ctime timestamp without time zone
          DEFAULT '1970-01-01 00:00:00' NOT NULL,
    fstype text DEFAULT '' NOT NULL,
    dev_id integer NOT NULL
);

Loading the text file into PostgreSQL is as easy as:

$ psql -c '\copy finddb from STDIN' < finddb.txt

for the CSV file:

$ psql -c '\copy finddb from STDIN CSV FORCE NOT NULL path, symlink' < finddb.csv

I did come across a few names on a file system that -- I believe -- Postgres would complain about, because of improperly encoded characters. Postgres on my system expects everything to be UTF-8 encoded. According to James Youngman, the maintainer of GNU findutils,

Character encoding is of course a significant problem. The Unix file system API offers no way to record the character encoding in effect at the time the file is created/renamed, so files on a file system will often have differing encodings.

After the load is completed, here's an example query and a returned row.

=> SELECT * FROM finddb WHERE wholename = '/home/aaronh/.emacs';
-[ RECORD 1 ]--------------------
inode        | 18842194
name         | .emacs
wholename    | /home/aaronh/.emacs
path         | /home/aaronh
type         | f
user         | aaronh
user_id      | 500
group        | aaronh
group_id     | 500
perm         | -rw-rw-r--
perm_octal   | 664
bytes         | 2884
blocks       | 8
kblocks      | 4
symlink      | 
links        | 1
atime        | 2008-11-04 12:26:46
mtime        | 2008-10-24 13:10:24
ctime        | 2008-10-24 13:10:24
fstype       | ext3

The following are some more examples of queries on this database table.

This query finds the 5 largest graphic files that were last modified in 2007, but ignores the auxiliary files of many a version control system.

SELECT wholename, bytes, mtime
FROM finddb
WHERE "type" = 'f' AND "name" ~ '.jpe?g'
      AND path not like '%/.svn/%'
      AND path not like '%/.git/%'
      AND path not like '%/.hg/%'
      AND path not like '%/.bzr/%'
      AND path not like '%/{arch}/%'
      AND path not like E'%/\\_darcs/%'
      AND mtime >= TIMESTAMP '2007-01-01 00:00:00'
      AND mtime <= TIMESTAMP '2007-12-31 23:59:59'
ORDER BY bytes DESC
LIMIT 5;

This query shows every user owning a file on the system, with the the total megabytes used, and with the biggest users first in the list.

SELECT "user", SUM(kblocks) / 1000.0 AS "mbytes"
FROM finddb
GROUP BY "user"
ORDER BY SUM(bytes) DESC;

This query tries to mimic the output of the ls -l / command.

SELECT perm, links, "user", "group", bytes, mtime, name
FROM finddb
WHERE path = '' AND name NOT LIKE '.%' ORDER BY name;
    perm    | links | user | group | bytes |        mtime        |    name    
------------ ------- ------ ------- ------- --------------------- ------------
 drwxr-xr-x |     3 | root | root  |  4096 | 2008-09-22 05:17:44 | backup
 drwxr-xr-x |     2 | root | root  |  4096 | 2008-10-29 18:30:19 | bin
 drwxr-xr-x |     5 | root | root  |  1024 | 2008-10-28 12:43:47 | boot
 drwxr-xr-x |     2 | root | root  |  4096 | 2008-09-10 04:11:52 | cdrom
 drwxr-xr-x |    13 | root | root  |  4460 | 2008-11-12 21:58:07 | dev
 drwxr-xr-x |   117 | root | root  |  8192 | 2008-11-12 21:57:53 | etc
 drwxr-xr-x |     5 | root | root  |  4096 | 2008-11-07 16:20:54 | home
 drwxr-xr-x |    16 | root | root  |  8192 | 2008-10-29 18:29:59 | lib
 drwx------ |     2 | root | root  | 16384 | 2008-09-10 03:05:56 | lost found
 drwxr-xr-x |     2 | root | root  |  4096 | 2008-11-12 21:57:28 | media
 drwxr-xr-x |     2 | root | root  |  4096 | 2008-04-07 17:44:40 | mnt
 drwxr-xr-x |     2 | root | root  |  4096 | 2008-04-07 17:44:40 | opt
 dr-xr-xr-x |   107 | root | root  |     0 | 2008-11-12 21:55:52 | proc
 drwxr-x--- |     6 | root | root  |  4096 | 2008-11-12 20:02:29 | root
 drwxr-xr-x |     2 | root | root  |  8192 | 2008-10-29 18:30:18 | sbin
 drwxr-xr-x |     7 | root | root  |     0 | 2008-11-12 21:55:52 | selinux
 drwxr-xr-x |     2 | root | root  |  4096 | 2008-04-07 17:44:40 | srv
 drwxr-xr-x |    11 | root | root  |     0 | 2008-11-12 21:55:52 | sys
 drwxrwxrwt |    74 | root | root  |  4096 | 2008-11-13 00:56:28 | tmp
 drwxr-xr-x |    13 | root | root  |  4096 | 2008-09-10 03:15:15 | usr
 drwxr-xr-x |    21 | root | root  |  4096 | 2008-09-24 06:53:23 | var

Sending queries against the data is loads of fun, but it really needs some improvements to match the strength of findutils matching expressions -- for example, the permissions matching rules and the -empty predicate. Some new tables with alternative perspectives on the data could accommodate better queries.

In a follow-up, I will present on how to handle all possible file names by using a null-delimited file rather than a tab-delimited one. There may be a piece on loading into MySQL. And in the last piece, I'll give more tastings on possible queries can be made of this file system information and provide the script that helps me manage the database loads from find.

Link | Leave a comment | Add to Memories | Tell a Friend

Comments {0}