Loading file system information into PostgreSQL
« previous entry | next entry »
Dec. 18th, 2008 | 11:03 am
GNU findutils is extremely useful. Unfortunately, the find command doesn't always scale well. For instance, running find on the entire machine can take a long time. And should you learn something that requires modifying your find expression, you have to start the command all over again from the beginning.
The sister program, locate, command helps by being much faster. However, it is missing the expressiveness of find. The locate command should support all the features of find -- save for maybe the -exec expression for security reasons.
Even still, the query syntax for find commands are also not scalable or very user-friendly.
I always wanted to import file system meta information into a database, and use SQL queries to find information about the file system. SQL has its own set of problems, but it would make asking questions about the files on a computer much more worthwhile and maybe even a bit exciting. Very interesting queries could be made, and they would be answered very quickly -- without having to wait.
So, I finally gave it a try. I describe here how I was able to load the results of find into PostgreSQL
The -printf expression of findutils was great for this task. It can generate the information about the file system. And it can have its output formatted to a tab-delimited file.
The following use of the find command makes a text file with entries for every file in your user directory.
$ find ~ -printf '%i\t%f\t%p\t%h\t%y\t%u\t%U\t%g\t%G\t%M\t%m\t%s\t%b\t%k\t%l\t%n\t%A
It's a pretty hideous command. These are all the fields it produces in the output:
- inode
- name
- wholename
- path
- type (file, link, directory, device, ...)
- user (symbolic)
- user_id (number)
- group
- group_id
- perm (e.g. "-rw-rw-r--")
- perm_octal (e.g. "0664")
- bytes
- blocks (512-byte blocks used of disk)
- kblocks (1k-blocks)
- symlink
- links (number of hard links)
- atime (last access time)
- mtime (last modification)
- ctime (last time modified or status changed)
- fstype (e.g. "ext3")
- dev_id (device number)
These represent the header fields in the output. Later, they will represent the table columns in the database.
Unfortunately, the output of find is one-file-per-line and tab-delimited. That means files with newline or tab characters in their names won't cooperate. A suboptimal solution is to just ignore those files on the system. That's easy to do with the find command.
$ find / ! -regex ".*[$(echo -ne '\n\t')].*" -printf '%i\t%f\t%p\t%h\t%y\t%u\t%U\t%g\t%G\t%M\t%m\t%s\t%b\t%k\t%l\t%n\t%A
Better yet, complain to the user on standard error (STDERR) every time the find command comes across one of these rare files.
$ find / \( -regex ".*[$(echo -ne '\n\t')].*" -exec sh -c 'echo >&2 "$0": File name has tab or newline' '{}' \; \) -o -printf '%i\t%f\t%p\t%h\t%y\t%u\t%U\t%g\t%G\t%M\t%m\t%s\t%b\t%k\t%l\t%n\t%A
If you want to know if your system has these wickedly named files, run the following locate command.
$ locate -r ".*[$(echo -ne '\n\t')].*"
To convert the file to a comma-separated value (CSV) file, I like to use Perl.
$ perl -mText::CSV_XS -e 'my $csv = Text::CSV_XS->new({ binary => 1, eol => $/ });' -ne 'chomp; split(/\t/); if ($csv->combine(@_)) {print $csv->string;} else {printf STDERR $csv->error_input;}' finddb.txt > finddb.csv
This is a table definition for PostgreSQL that can be loaded with the CSV or tab-delimited text file.
CREATE TABLE finddb (
inode bigint NOT NULL,
name text DEFAULT '' NOT NULL,
wholename text DEFAULT '' NOT NULL,
PRIMARY KEY (inode, wholename),
path text DEFAULT '' NOT NULL,
type character varying(1) NOT NULL,
"user" text DEFAULT '' NOT NULL,
user_id integer NOT NULL,
"group" text DEFAULT '' NOT NULL,
group_id integer NOT NULL,
perm character varying(10) DEFAULT '' NOT NULL,
perm_octal character varying(6) DEFAULT '' NOT NULL,
bytes bigint DEFAULT 0 NOT NULL,
blocks bigint DEFAULT 0 NOT NULL,
kblocks bigint DEFAULT 0 NOT NULL,
symlink text DEFAULT '' NOT NULL,
links integer DEFAULT 0 NOT NULL,
atime timestamp without time zone
DEFAULT '1970-01-01 00:00:00' NOT NULL,
mtime timestamp without time zone
DEFAULT '1970-01-01 00:00:00' NOT NULL,
ctime timestamp without time zone
DEFAULT '1970-01-01 00:00:00' NOT NULL,
fstype text DEFAULT '' NOT NULL,
dev_id integer NOT NULL
);
Loading the text file into PostgreSQL is as easy as:
$ psql -c '\copy finddb from STDIN' < finddb.txt
for the CSV file:
$ psql -c '\copy finddb from STDIN CSV FORCE NOT NULL path, symlink' < finddb.csv
I did come across a few names on a file system that -- I believe -- Postgres would complain about, because of improperly encoded characters. Postgres on my system expects everything to be UTF-8 encoded. According to James Youngman, the maintainer of GNU findutils,
Character encoding is of course a significant problem. The Unix file system API offers no way to record the character encoding in effect at the time the file is created/renamed, so files on a file system will often have differing encodings.
After the load is completed, here's an example query and a returned row.
=> SELECT * FROM finddb WHERE wholename = '/home/aaronh/.emacs'; -[ RECORD 1 ]-------------------- inode | 18842194 name | .emacs wholename | /home/aaronh/.emacs path | /home/aaronh type | f user | aaronh user_id | 500 group | aaronh group_id | 500 perm | -rw-rw-r-- perm_octal | 664 bytes | 2884 blocks | 8 kblocks | 4 symlink | links | 1 atime | 2008-11-04 12:26:46 mtime | 2008-10-24 13:10:24 ctime | 2008-10-24 13:10:24 fstype | ext3
The following are some more examples of queries on this database table.
This query finds the 5 largest graphic files that were last modified in 2007, but ignores the auxiliary files of many a version control system.
SELECT wholename, bytes, mtime
FROM finddb
WHERE "type" = 'f' AND "name" ~ '.jpe?g'
AND path not like '%/.svn/%'
AND path not like '%/.git/%'
AND path not like '%/.hg/%'
AND path not like '%/.bzr/%'
AND path not like '%/{arch}/%'
AND path not like E'%/\\_darcs/%'
AND mtime >= TIMESTAMP '2007-01-01 00:00:00'
AND mtime <= TIMESTAMP '2007-12-31 23:59:59'
ORDER BY bytes DESC
LIMIT 5;
This query shows every user owning a file on the system, with the the total megabytes used, and with the biggest users first in the list.
SELECT "user", SUM(kblocks) / 1000.0 AS "mbytes" FROM finddb GROUP BY "user" ORDER BY SUM(bytes) DESC;
This query tries to mimic the output of the ls -l / command.
SELECT perm, links, "user", "group", bytes, mtime, name
FROM finddb
WHERE path = '' AND name NOT LIKE '.%' ORDER BY name;
perm | links | user | group | bytes | mtime | name
------------ ------- ------ ------- ------- --------------------- ------------
drwxr-xr-x | 3 | root | root | 4096 | 2008-09-22 05:17:44 | backup
drwxr-xr-x | 2 | root | root | 4096 | 2008-10-29 18:30:19 | bin
drwxr-xr-x | 5 | root | root | 1024 | 2008-10-28 12:43:47 | boot
drwxr-xr-x | 2 | root | root | 4096 | 2008-09-10 04:11:52 | cdrom
drwxr-xr-x | 13 | root | root | 4460 | 2008-11-12 21:58:07 | dev
drwxr-xr-x | 117 | root | root | 8192 | 2008-11-12 21:57:53 | etc
drwxr-xr-x | 5 | root | root | 4096 | 2008-11-07 16:20:54 | home
drwxr-xr-x | 16 | root | root | 8192 | 2008-10-29 18:29:59 | lib
drwx------ | 2 | root | root | 16384 | 2008-09-10 03:05:56 | lost found
drwxr-xr-x | 2 | root | root | 4096 | 2008-11-12 21:57:28 | media
drwxr-xr-x | 2 | root | root | 4096 | 2008-04-07 17:44:40 | mnt
drwxr-xr-x | 2 | root | root | 4096 | 2008-04-07 17:44:40 | opt
dr-xr-xr-x | 107 | root | root | 0 | 2008-11-12 21:55:52 | proc
drwxr-x--- | 6 | root | root | 4096 | 2008-11-12 20:02:29 | root
drwxr-xr-x | 2 | root | root | 8192 | 2008-10-29 18:30:18 | sbin
drwxr-xr-x | 7 | root | root | 0 | 2008-11-12 21:55:52 | selinux
drwxr-xr-x | 2 | root | root | 4096 | 2008-04-07 17:44:40 | srv
drwxr-xr-x | 11 | root | root | 0 | 2008-11-12 21:55:52 | sys
drwxrwxrwt | 74 | root | root | 4096 | 2008-11-13 00:56:28 | tmp
drwxr-xr-x | 13 | root | root | 4096 | 2008-09-10 03:15:15 | usr
drwxr-xr-x | 21 | root | root | 4096 | 2008-09-24 06:53:23 | var
Sending queries against the data is loads of fun, but it really needs some improvements to match the strength of findutils matching expressions -- for example, the permissions matching rules and the -empty predicate. Some new tables with alternative perspectives on the data could accommodate better queries.
In a follow-up, I will present on how to handle all possible file names by using a null-delimited file rather than a tab-delimited one. There may be a piece on loading into MySQL. And in the last piece, I'll give more tastings on possible queries can be made of this file system information and provide the script that helps me manage the database loads from find.