Advertisement

Customize

Sorting UTF-8 strings in PHP

May. 28th, 2009 | 03:23 pm

With Unicode characters, in this case the popular UTF-8, sometimes you need to convert characters to ASCII to get things done in PHP. In the case of sorting Unicode, there are the existing solutions of collator_sort() for PHP5 and strcoll() since PHP4. However, they both assume a locale. A hack that is locale-agnostic would just "normalize" Unicode characters to ASCII.

This is far from complete, but seems to do the right thing.

    <?php

    /**
     * Normalize international characters for purposes like sorting and
     * searching by using a heuristic that just uses ASCII--the english
     * alphabet ordering--for a multilingual solution--no locale setting.
     */
    header("Content-type: text/plain; charset=utf-8");

    /**
     * Iñtërnâtiônàlizætiøn
     *
     * Example from Sam Ruby
     * http://intertwingly.net/stories/2004/04/14/i18n.html
     * 
     * By way of WACT team
     * http://www.phpwact.org/php/i18n/charsets
     */
    $internationalization = array(
				  "I", // I
                                  "\xC3\xB1", // ñ
                                  "t", // t
                                  "\xC3\xAB", // ë
                                  "r", // r
                                  "n", // n
                                  "\xC3\xA2", // â
                                  "t", // t
                                  "i", // i
                                  "\xC3\xB4", // ô
                                  "n", // n
                                  "\xC3\xA0", // à
                                  "l", // l
                                  "i", // i
                                  "z", // z
                                  "\xC3\xA6", // æ
                                  "t", // t
                                  "i", // i
                                  "\xC3\xB8", // ø
                                  "n"); // n
    
    /** 
     * Use strtr() with this dictionary to convert to ASCII.
     * This data structure is not comprehensive.
     */
    $utf8_dict = array("\xC3\x80" => "A", // À
                       "\xC3\x81" => "A", // Á
                       "\xC3\x82" => "A", // Â
                       "\xC3\x83" => "A", // Ã
                       "\xC3\x84" => "A", // Ä
                       "\xC3\x85" => "A", // Å
                       "\xC3\x86" => "A", // Æ
                       "\xC3\x9E" => "B", // Þ
                       "\xC3\x87" => "C", // Ç
                       "\xC4\x86" => "C", // Ć
                       "\xC4\x8C" => "C", // Č
                       "\xC4\x90" => "Dj", // Đ
                       "\xC3\x88" => "E", // È
                       "\xC3\x89" => "E", // É
                       "\xC3\x8A" => "E", // Ê
                       "\xC3\x8B" => "E", // Ë
                       "\xC4\x9E" => "G", // Ğ
                       "\xC3\x8C" => "I", // Ì
                       "\xC3\x8D" => "I", // Í
                       "\xC3\x8E" => "I", // Î
                       "\xC3\x8F" => "I", // Ï
                       "\xC4\xB0" => "I", // İ
                       "\xC3\x91" => "N", // Ñ
                       "\xC3\x92" => "O", // Ò
                       "\xC3\x93" => "O", // Ó
                       "\xC3\x94" => "O", // Ô
                       "\xC3\x95" => "O", // Õ
                       "\xC3\x96" => "O", // Ö
                       "\xC3\x98" => "O", // Ø
                       "\xC3\x9F" => "Ss", // ß
                       "\xC3\x99" => "U", // Ù
                       "\xC3\x9A" => "U", // Ú
                       "\xC3\x9B" => "U", // Û
                       "\xC3\x9C" => "U", // Ü
                       "\xC3\x9D" => "Y", // Ý
                       "\xC3\xA0" => "a", // à
                       "\xC3\xA1" => "a", // á
                       "\xC3\xA2" => "a", // â
                       "\xC3\xA3" => "a", // ã
                       "\xC3\xA4" => "a", // ä
                       "\xC3\xA5" => "a", // å
                       "\xC3\xA6" => "a", // æ
                       "\xC3\xBE" => "b", // þ
                       "\xC3\xA7" => "c", // ç
                       "\xC4\x87" => "c", // ć
                       "\xC4\x8D" => "c", // č
                       "\xC4\x91" => "dj", // đ
                       "\xC3\xA8" => "e", // è
                       "\xC3\xA9" => "e", // é
                       "\xC3\xAA" => "e", // ê
                       "\xC3\xAB" => "e", // ë
                       "\xC3\xAC" => "i", // ì
                       "\xC3\xAD" => "i", // í
                       "\xC3\xAE" => "i", // î
                       "\xC3\xAF" => "i", // ï
                       "\xC3\xB0" => "o", // ð
                       "\xC3\xB1" => "n", // ñ
                       "\xC3\xB2" => "o", // ò
                       "\xC3\xB3" => "o", // ó
                       "\xC3\xB4" => "o", // ô
                       "\xC3\xB5" => "o", // õ
                       "\xC3\xB6" => "o", // ö
                       "\xC3\xB8" => "o", // ø
                       "\xC5\x94" => "R", // Ŕ
                       "\xC5\x95" => "r", // ŕ
                       "\xC5\xA0" => "S", // Š
                       "\xC5\x9E" => "S", // Ş
                       "\xC5\xA1" => "s", // š
                       "\xC3\xB9" => "u", // ù
                       "\xC3\xBA" => "u", // ú
                       "\xC3\xBB" => "u", // û
                       "\xC3\xBC" => "u", // ü
                       "\xC3\xBD" => "y", // ý
                       "\xC3\xBD" => "y", // ý
                       "\xC3\xBF" => "y", // ÿ
                       "\xC5\xBD" => "Z", // Ž
                       "\xC5\xBE" => "z"); // ž
    
    $i18n = join("", $internationalization);
    print $i18n . "\n";

    /**
     * UTF-8 regular expression from
     * http://php.net/manual/en/function.utf8-decode.php (comment 57069)
     */
    $utf8_re = "/^([\\x00-\\x7f]|"
      . "[\\xc2-\\xdf][\\x80-\\xbf]|"
      . "\\xe0[\\xa0-\\xbf][\\x80-\\xbf]|"
      . "[\\xe1-\\xec][\\x80-\\xbf]{2}|"
      . "\\xed[\\x80-\\x9f][\\x80-\\xbf]|"
      . "\\xef[\\x80-\\xbf][\\x80-\\xbc]|"
      . "\\xee[\\x80-\\xbf]{2}|"
      . "\\xf0[\\x90-\\xbf][\\x80-\\xbf]{2}|"
      . "[\\xf1-\\xf3][\\x80-\\xbf]{3}|"
      . "\\xf4[\\x80-\\x8f][\\x80-\\xbf]{2})*$/";

    print "Valid UTF-8?: " . (preg_match($utf8_re, $i18n) > 0
			      ? "true" : "false") . "\n";

    print strtr($i18n, $utf8_dict) . "\n";

    // Doesn't work in PHP4?
    $sorted = preg_split("//u", $i18n, -1, PREG_SPLIT_NO_EMPTY);
    // So, just use the original array, instead.
    $sorted = $internationalization;

    function compare($s1, $s2)
    {
      global $utf8_dict;
      return strcasecmp(strtr($s1, $utf8_dict),
			strtr($s2, $utf8_dict));
    }

    usort($sorted, "compare");
    print join("", $sorted) . "\n";

    /**
     * Results:
     * 
     * Iñtërnâtiônàlizætiøn
     * Valid UTF-8?: true
     * Internationalization
     * àæâëIiiilñnnnøôrtttz
     */
    ?>

I tried the I18N_UnicodeNormalizer from the PHP PEAR project, and it didn't do what I wanted.

    <?php

    require_once('I18N/UnicodeNormalizer.php');

    print I18N_UnicodeNormalizer::toNFD($i18n) . "\n";
    print I18N_UnicodeNormalizer::toNFC($i18n) . "\n";
    ?>

There's a good chance I don't know what I'm doing there with the PEAR library, however.

Link | Leave a comment {6} | Add to Memories | Tell a Friend

Unicode hex in PHP string

May. 27th, 2009 | 08:32 pm

In Emacs, insert UTF-8 hex value for a PHP string of the character at point.

(defun php-hex-for-char ()
  (interactive)
  (insert
   (mapconcat (lambda (x) (format "\\x%02X" x))
              (encode-coding-char (char-after (point)) 'utf-8)
              "")))

Lisp lifted from `describe-char' and `encoded-string-description'.

Link | Leave a comment | Add to Memories | Tell a Friend

Shell hack: Files with some DOS lines

May. 19th, 2009 | 11:22 am

I came across a project whose source code contains both DOS text files and Unix text files. Some of the Unix files contain carriage return line endings. Though, perhaps they were DOS files with Unix end lines! I wanted to suggest converting those files with mixed line endings to Unix.

Sometimes, the file command is helpful for showing what files have a mixed end of line style, but not always. For example, the file command will say "ASCII C program text, with CRLF, LF line terminators". That's perfect. However, sometimes the command just says, "PHP script text".

I wrote this find expression that would get files that contain DOS carriage returns, but not entirely DOS files.

$ find -type f -execdir grep -qe '^V^M$' {} \; \
       ! -execdir awk 'BEGIN{is_dos=1;}!/\r$/{is_dos=0}END{exit(!is_dos);}' {} \; \
       -print

The above doesn't work, since many DOS files don't end in a newline (and without a carriage return) as they do for Unix text files.

Awk obviously considers the last line as a line, but since there's no carriage return the file is not considered a DOS file based on the logic I've written. This results in a false negative.

This change to the Awk script makes this hack work as it should.

$ find -type f -execdir grep -qe '^V^M$' {} \; \
       ! -execdir awk 'BEGIN{is_dos=1;}
                       !/\r$/ && is_dos{is_dos=0;n=NR}
                       END{exit(!is_dos && n != NR);}' {} \; \
       -print

Link | Leave a comment {1} | Add to Memories | Tell a Friend

Change log entries for HTML files

May. 5th, 2009 | 01:01 pm

Someone asked me if there was a good way to annotate the changes of an HTML file. It sounded like the person had to maintain some legacy, HTML-hell, home-brewed, template files for some business Web site.

I suggested using the ChangeLog support of Emacs, and using HTML comments to organize sections of an HTML source file. Here's a simple, made-up example of such an HTML file.

<html>
<head>
<title>Sample only</title>
</head>
<body>
<!-- begin header -->
<p>[ <a id="top" href="#bottom">bottom</a> ]</p>
<!-- end header -->
<h1>Sample title</h1>
<!-- BEGIN: PAGE_CONTENT -->
<div>
<p>Testing.</p>
</div>
<!-- END: PAGE_CONTENT --
  -- footer-bottom start -->
<p>[ <a id="bottom" href="#top">top</a> ]</p>
<!-- footer-bottom end -->
</body>
</html>

Unfortunately, support for either the above sectioning style, or even another alternative, is not provided by the HTML mode that ships with Emacs. This is understandable because there is no consistent standard of doing this, and people use other variations than even those covered in the example. Not to mention, HTML comments are used for other reasons than naming regions of the file.

Regardless, I've put together the following regular expression for add-log-current-defun-header-regexp. It handles the cases in the example above. It is set for all buffers using HTML mode. Just put the following in your .emacs file.

    (add-hook 'html-mode-hook
        (lambda ()
          (make-local-variable
           'add-log-current-defun-header-regexp)
           (setq add-log-current-defun-header-regexp
               (concat "^[ \t]*<?!?--[ \t]*\\(?:begin\\|BEGIN\\|start\\)?"
                       "[ \t:]*\\([-_[:alnum:]]+\\)"
                       "[ \t]*\\(?:begin\\|BEGIN\\|start\\)?[ \t]*--"))))

Use it by typing `C-x 4 a' (add-change-log-entry-other-window). An entry like the following will be added in a nearby ChangeLog file:

2009-05-05  Aaron S. Hawley  <aaronhawley@livejournal.com>

        * file.html (PAGE_CONTENT): Add a test paragraph.
        (footer-bottom): Added link to "#top".

This setup will work for most cases except for scenarios where there is nested sectioning or where you've run `C-x 4 a' from a point outside of a "section" and get a false-positive.

Link | Leave a comment {2} | Add to Memories | Tell a Friend

Shell hack: Avoiding built-ins

May. 1st, 2009 | 04:40 am

To avoid using a builtin command of a Bourne or Bash shell in a shell script, one can use the full path of the executable command. For example, rather than

$ echo Hello, World\!
Hello, World!

you could

$ /bin/echo Hello, World\!
Hello, World!

Here's a way to show the difference--and make fun of the GNU coding standards at the same time.

$ echo --version
--version
$ /bin/echo --version
echo (GNU coreutils) 6.12
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3 : GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Brian Fox and Chet Ramey.

I prefer to use exec than using the full path for a command so that the PATH environment variable is used, and avoid the day should the full path to a binary change some day.

Unfortunately, a consequence of exec is that it runs the command in the current process and therefore will exit on completion, thus cutting short the life of your shell script. To avoid that, just wrap an exec statement in a sub-shell by using parens:

$ ( exec echo --version )
echo (GNU coreutils) 6.12
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3 : GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Brian Fox and Chet Ramey.

I have never seen this written in a script before. Perhaps, there's another way--that's a bit more canonical--to do this. This construct is entirely redundant and contradictory--"exec something in the current shell, but also in a sub-shell". Further, it's probably pretty much always the case to opt for the shell built-in. There are zero to no cases where you want to avoid the built-in. My only scenarios are timing processes in the shell.

According to the Limitations of Shell Builtins section of the GNU Autoconf manual,

When it is desired to avoid a regular shell built-in, the workaround is to use some other forwarding command, such as env or nice, that will ensure a path search:

          $ pdksh -c 'exec true --version' | head -n1

          $ pdksh -c 'nice true --version' | head -n1
          true (GNU coreutils) 6.10
          $ pdksh -c 'env true --version' | head -n1
          true (GNU coreutils) 6.10
     

That manual has everything it it. I guess I'll go with env, doesn't sound as nice as "exec", but it's a good mnemonic since it use the environment's path variable to run the command.

$ env echo --version
echo (GNU coreutils) 6.12
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Brian Fox and Chet Ramey.

Link | Leave a comment {6} | Add to Memories | Tell a Friend

Shell hack: Date work

Apr. 26th, 2009 | 01:21 am

Needed to make some Apache redirects for some links on a unix user's group Web site I maintain. The new site is based in a Wiki, and a member of the group moved all the pages with meeting announcements by hand using more readable page names. The old pages had the data as a four-digit year, two-digit month followed by the two-digit day (for example, 20061219). The new pages have the spelled out version of the week day and month (for example, Tuesday, December 19, 2006).

Here's a sample of what I needed for the .htaccess file.

Redirect /group/meeting-20061219.html   http://host.org/group/wiki/index.php/Tuesday,_December_19,_2006
Redirect /group/meeting-20070417.html   http://host.org/group/wiki/index.php/Tuesday,_April_17,_2007
Redirect /group/meeting-20070515.html   http://host.org/group/wiki/index.php/Tuesday,_May_15,_2007
Redirect /group/meeting-20070717.html   http://host.org/group/wiki/index.php/Tuesday,_July_17,_2007
Redirect /group/meeting-20071128.html   http://host.org/group/wiki/index.php/Wednesday,_November_28,_2007
Redirect /group/meeting-20080618.html   http://host.org/group/wiki/index.php/Wednesday,_June_18,_2008

I could do this by-hand, but I'd rather get a shell script to do it right, the first time. I found it easy to do with an extended Grep expression, awk and the date command that comes with GNU coreutils.

$ ls -1 \
  | grep -Ee '[0-9]{8}.html$' \
  | perl -pe 's/([0-9]{4})([0-9]{2})([0-9]{2}).html$/$&\t\1-\2-\3/' \
  | awk '{printf $1 "\t";
          system("date +\"%A,_%B_%e,_%Y\" -d "  $2);}' \
  | awk '{print "Redirect", "/group/" $1,
                "http://host.org/group/wiki/index.php/" $2;}'

I'm thankful I consistently used a file naming convention with the old site.

Link | Leave a comment {8} | Add to Memories | Tell a Friend

Shell hack: Min function

Mar. 27th, 2009 | 03:06 pm

I couldn't find a minimum function for my shell scripting, nor a utility on GNU/Linux, so I am using this function.

###
 # min NUM ...
 #
 # Find smallest value of NUMs.
 ##
function min() {
    echo "$@" | tr '[[:space:]]' '\n' \
     | grep -Ee '^-?[[:digit:],]+(.[[:digit:]]*)?$' \
     | sort -n | sed 1q
} ## end min
It supports floating point numbers with decimal notation, but does not support exponential notation or other.

Example:

min 1 0.2 1,023.56 -0 -1.3

Gives: -1.3

A max function is the same thing but needing to change either

  • sort -nr
  • sed '$p;d'
  • sed -n '$p'

Link | Leave a comment {7} | Add to Memories | Tell a Friend

Load all meta data of files into PostgreSQL

Dec. 23rd, 2008 | 03:56 pm

In the previous installment, I quickly showed how to use GNU Findutils to load file system meta information into a PostgreSQL database. I did so using a comma separated value (CSV) file generated from a tab delimited file. The us of tabs limited the data set to files without newlines or tabs in their names. Here I will show how to load any file name.

The find command can output all possible files by separating the fields in the output with nulls, and each line by double nulls.

Here's the new -printf statement that outputs nulls between records, and two nulls between each line.

$ find / -printf '%i\0%f\0%p\0%h\0%y\0%u\0%U\0%g\0%G\0%M\0%m\0%s\0%b\0%k\0%l\0%n\0%AY-%Am-%Ad %AH:%AM:%AS\0%TY-%Tm-%Td %TH:%TM:%TS\0%CY-%Cm-%Cd %CH:%CM:%CS\0%F\0%D\0\0' > finddb.txt

This Perl scriptlet will convert the output to CSV.

$ perl -mText::CSV_XS -e 'my $csv = Text::CSV_XS->new ({ binary => 1, eol => $/ }); my $n = 21; my @c = (); local $/ = "\0\0";' -ne '$_ .= "\n"; push(@c, split(/\0/)); pop(@c); if ($#c + 1 < $n) {next;} elsif ($#c + 1 > $n) {pop; if ($csv->combine(@c[0 .. $n - 1])) {print $csv->string;} else {printf STDERR $csv->error_input;} @c = @c[$n .. $#_];}' -e 'if (@c > 0) {printf STDERR ("Extra fields at the end\n");}' finddb.txt > finddb.csv

Coincidentally, double-nulls don't help delimit records in the output, since they could be confused as an empty field. So the script above keeps an internal tally of fields in a record, and is hard-coded as 21. Thus, when enough null-delimited fields are read, then the next record is read.

With the resulting CSV file, loading into PostgreSQL is as easy as the following command.

$ psql -c '\copy finddb from STDIN CSV FORCE NOT NULL path, symlink' < finddb.csv

I put the CSV file generation bits together in a Perl script, find-csv.pl. It tries to maintain the consistency between GNU findutils -printf formatting fields, the names of database columns, and the Perl code for generating CSV files.

In a follow-up, I will give more tastings on possible database queries that can be made of this file system information.

Thanks to James Youngman for reading a previous version of this article. This post is the result of just one of his generous comments.

Link | Leave a comment | Add to Memories | Tell a Friend

Loading file system information into PostgreSQL

Dec. 18th, 2008 | 11:03 am

GNU findutils is extremely useful. Unfortunately, the find command doesn't always scale well. For instance, running find on the entire machine can take a long time. And should you learn something that requires modifying your find expression, you have to start the command all over again from the beginning.

The sister program, locate, command helps by being much faster. However, it is missing the expressiveness of find. The locate command should support all the features of find -- save for maybe the -exec expression for security reasons.

Even still, the query syntax for find commands are also not scalable or very user-friendly.

I always wanted to import file system meta information into a database, and use SQL queries to find information about the file system. SQL has its own set of problems, but it would make asking questions about the files on a computer much more worthwhile and maybe even a bit exciting. Very interesting queries could be made, and they would be answered very quickly -- without having to wait.

So, I finally gave it a try. I describe here how I was able to load the results of find into PostgreSQL

The -printf expression of findutils was great for this task. It can generate the information about the file system. And it can have its output formatted to a tab-delimited file.

The following use of the find command makes a text file with entries for every file in your user directory.

$ find ~ -printf '%i\t%f\t%p\t%h\t%y\t%u\t%U\t%g\t%G\t%M\t%m\t%s\t%b\t%k\t%l\t%n\t%AY-%Am-� %AH:%AM:%AS\t%TY-%Tm-%Td %TH:%TM:%TS\t%CY-%Cm-� %CH:%CM:%CS\t%F\t%D\n' > finddb.txt

It's a pretty hideous command. These are all the fields it produces in the output:

  • inode
  • name
  • wholename
  • path
  • type (file, link, directory, device, ...)
  • user (symbolic)
  • user_id (number)
  • group
  • group_id
  • perm (e.g. "-rw-rw-r--")
  • perm_octal (e.g. "0664")
  • bytes
  • blocks (512-byte blocks used of disk)
  • kblocks (1k-blocks)
  • symlink
  • links (number of hard links)
  • atime (last access time)
  • mtime (last modification)
  • ctime (last time modified or status changed)
  • fstype (e.g. "ext3")
  • dev_id (device number)

These represent the header fields in the output. Later, they will represent the table columns in the database.

Unfortunately, the output of find is one-file-per-line and tab-delimited. That means files with newline or tab characters in their names won't cooperate. A suboptimal solution is to just ignore those files on the system. That's easy to do with the find command.

$ find / ! -regex ".*[$(echo -ne '\n\t')].*" -printf '%i\t%f\t%p\t%h\t%y\t%u\t%U\t%g\t%G\t%M\t%m\t%s\t%b\t%k\t%l\t%n\t%AY-%Am-� %AH:%AM:%AS\t%TY-%Tm-%Td %TH:%TM:%TS\t%CY-%Cm-� %CH:%CM:%CS\t%F\t%D\n'

Better yet, complain to the user on standard error (STDERR) every time the find command comes across one of these rare files.

$ find / \( -regex ".*[$(echo -ne '\n\t')].*" -exec sh -c 'echo >&2 "$0": File name has tab or newline' '{}' \; \) -o -printf '%i\t%f\t%p\t%h\t%y\t%u\t%U\t%g\t%G\t%M\t%m\t%s\t%b\t%k\t%l\t%n\t%AY-%Am-� %AH:%AM:%AS\t%TY-%Tm-%Td %TH:%TM:%TS\t%CY-%Cm-� %CH:%CM:%CS\t%F\t%D\n' > finddb.txt

If you want to know if your system has these wickedly named files, run the following locate command.

$ locate -r ".*[$(echo -ne '\n\t')].*"

To convert the file to a comma-separated value (CSV) file, I like to use Perl.

$ perl -mText::CSV_XS -e 'my $csv = Text::CSV_XS->new({ binary => 1, eol => $/ });' -ne 'chomp; split(/\t/); if ($csv->combine(@_)) {print $csv->string;} else {printf STDERR $csv->error_input;}' finddb.txt > finddb.csv

This is a table definition for PostgreSQL that can be loaded with the CSV or tab-delimited text file.

CREATE TABLE finddb (
    inode bigint NOT NULL,
    name text DEFAULT '' NOT NULL,
    wholename text DEFAULT '' NOT NULL,
    PRIMARY KEY (inode, wholename),
    path text DEFAULT '' NOT NULL,
    type character varying(1) NOT NULL,
    "user" text DEFAULT '' NOT NULL,
    user_id integer NOT NULL,
    "group" text DEFAULT '' NOT NULL,
    group_id integer NOT NULL,
    perm character varying(10) DEFAULT '' NOT NULL,
    perm_octal character varying(6) DEFAULT '' NOT NULL,
    bytes bigint DEFAULT 0 NOT NULL,
    blocks bigint DEFAULT 0 NOT NULL,
    kblocks bigint DEFAULT 0 NOT NULL,
    symlink text DEFAULT '' NOT NULL,
    links integer DEFAULT 0 NOT NULL,
    atime timestamp without time zone
          DEFAULT '1970-01-01 00:00:00' NOT NULL,
    mtime timestamp without time zone
          DEFAULT '1970-01-01 00:00:00' NOT NULL,
    ctime timestamp without time zone
          DEFAULT '1970-01-01 00:00:00' NOT NULL,
    fstype text DEFAULT '' NOT NULL,
    dev_id integer NOT NULL
);

Loading the text file into PostgreSQL is as easy as:

$ psql -c '\copy finddb from STDIN' < finddb.txt

for the CSV file:

$ psql -c '\copy finddb from STDIN CSV FORCE NOT NULL path, symlink' < finddb.csv

I did come across a few names on a file system that -- I believe -- Postgres would complain about, because of improperly encoded characters. Postgres on my system expects everything to be UTF-8 encoded. According to James Youngman, the maintainer of GNU findutils,

Character encoding is of course a significant problem. The Unix file system API offers no way to record the character encoding in effect at the time the file is created/renamed, so files on a file system will often have differing encodings.

After the load is completed, here's an example query and a returned row.

=> SELECT * FROM finddb WHERE wholename = '/home/aaronh/.emacs';
-[ RECORD 1 ]--------------------
inode        | 18842194
name         | .emacs
wholename    | /home/aaronh/.emacs
path         | /home/aaronh
type         | f
user         | aaronh
user_id      | 500
group        | aaronh
group_id     | 500
perm         | -rw-rw-r--
perm_octal   | 664
bytes         | 2884
blocks       | 8
kblocks      | 4
symlink      | 
links        | 1
atime        | 2008-11-04 12:26:46
mtime        | 2008-10-24 13:10:24
ctime        | 2008-10-24 13:10:24
fstype       | ext3

The following are some more examples of queries on this database table.

This query finds the 5 largest graphic files that were last modified in 2007, but ignores the auxiliary files of many a version control system.

SELECT wholename, bytes, mtime
FROM finddb
WHERE "type" = 'f' AND "name" ~ '.jpe?g'
      AND path not like '%/.svn/%'
      AND path not like '%/.git/%'
      AND path not like '%/.hg/%'
      AND path not like '%/.bzr/%'
      AND path not like '%/{arch}/%'
      AND path not like E'%/\\_darcs/%'
      AND mtime >= TIMESTAMP '2007-01-01 00:00:00'
      AND mtime <= TIMESTAMP '2007-12-31 23:59:59'
ORDER BY bytes DESC
LIMIT 5;

This query shows every user owning a file on the system, with the the total megabytes used, and with the biggest users first in the list.

SELECT "user", SUM(kblocks) / 1000.0 AS "mbytes"
FROM finddb
GROUP BY "user"
ORDER BY SUM(bytes) DESC;

This query tries to mimic the output of the ls -l / command.

SELECT perm, links, "user", "group", bytes, mtime, name
FROM finddb
WHERE path = '' AND name NOT LIKE '.%' ORDER BY name;
    perm    | links | user | group | bytes |        mtime        |    name    
------------ ------- ------ ------- ------- --------------------- ------------
 drwxr-xr-x |     3 | root | root  |  4096 | 2008-09-22 05:17:44 | backup
 drwxr-xr-x |     2 | root | root  |  4096 | 2008-10-29 18:30:19 | bin
 drwxr-xr-x |     5 | root | root  |  1024 | 2008-10-28 12:43:47 | boot
 drwxr-xr-x |     2 | root | root  |  4096 | 2008-09-10 04:11:52 | cdrom
 drwxr-xr-x |    13 | root | root  |  4460 | 2008-11-12 21:58:07 | dev
 drwxr-xr-x |   117 | root | root  |  8192 | 2008-11-12 21:57:53 | etc
 drwxr-xr-x |     5 | root | root  |  4096 | 2008-11-07 16:20:54 | home
 drwxr-xr-x |    16 | root | root  |  8192 | 2008-10-29 18:29:59 | lib
 drwx------ |     2 | root | root  | 16384 | 2008-09-10 03:05:56 | lost found
 drwxr-xr-x |     2 | root | root  |  4096 | 2008-11-12 21:57:28 | media
 drwxr-xr-x |     2 | root | root  |  4096 | 2008-04-07 17:44:40 | mnt
 drwxr-xr-x |     2 | root | root  |  4096 | 2008-04-07 17:44:40 | opt
 dr-xr-xr-x |   107 | root | root  |     0 | 2008-11-12 21:55:52 | proc
 drwxr-x--- |     6 | root | root  |  4096 | 2008-11-12 20:02:29 | root
 drwxr-xr-x |     2 | root | root  |  8192 | 2008-10-29 18:30:18 | sbin
 drwxr-xr-x |     7 | root | root  |     0 | 2008-11-12 21:55:52 | selinux
 drwxr-xr-x |     2 | root | root  |  4096 | 2008-04-07 17:44:40 | srv
 drwxr-xr-x |    11 | root | root  |     0 | 2008-11-12 21:55:52 | sys
 drwxrwxrwt |    74 | root | root  |  4096 | 2008-11-13 00:56:28 | tmp
 drwxr-xr-x |    13 | root | root  |  4096 | 2008-09-10 03:15:15 | usr
 drwxr-xr-x |    21 | root | root  |  4096 | 2008-09-24 06:53:23 | var

Sending queries against the data is loads of fun, but it really needs some improvements to match the strength of findutils matching expressions -- for example, the permissions matching rules and the -empty predicate. Some new tables with alternative perspectives on the data could accommodate better queries.

In a follow-up, I will present on how to handle all possible file names by using a null-delimited file rather than a tab-delimited one. There may be a piece on loading into MySQL. And in the last piece, I'll give more tastings on possible queries can be made of this file system information and provide the script that helps me manage the database loads from find.

Link | Leave a comment | Add to Memories | Tell a Friend

Kickstarting a QEMU image with Fedora

Nov. 16th, 2008 | 03:31 pm

Making a QEMU image with kickstart would make it easier to build virtual Fedora systems. Consequently, this could help create a team of virtual RPM-building servers at my work.

Turn on, build everything, shut off.

Link | Leave a comment {2} | Add to Memories | Tell a Friend

rpmbuild -tb tarball

Nov. 14th, 2008 | 11:24 am

The usual manner to build a package with the Redhat package manager (RPM) is running rpmbuild on the RPM spec file.

$ rpmbuild -bb package.spec

This presumes that all the source files for the RPM are already copied to the SOURCES directory accessible by RPM (%_topdir/SOURCES).

Its possible to build a source RPM (SRPM) from a spec file.

$ rpmbuild -bs package.spec

The benefit of an SRPM, is it contains all the source files necessary to rebuild the RPM.

$ rpmbuild --rebuild package-1.0-1.src.rpm

Using RPM to build an SRPM guarantees those files are included, and will put files in your SOURCES directory for you.

RPM has an additional feature where it can build an RPM from a tarball. Only a spec file needs to exist in the archive for it to work.

$ rpmbuild -tb package-1.0.tar.gz

This isn't a popular feature, nor is it very well documented. It is likely a relic of another time, when RPMs were not maintained by a distribution, but software maintainers would try and have their source packages install using RPM. This feature of RPM is made more and more obsolete with the success of large RPM-based distributions with a large and vibrant posse of packagers.

The rpmbuild in its tarball mode will find a spec file, even if its not in the top-directory of the package. For example, its not uncommon to put a spec file inside pkg/fedora.

The tarball mode of rpmbuild presumes you're in the SOURCES directory (More proof that RPM's tarball mode is probably a legacy feature). So copy the tarball to the SOURCES directory and run rpmbuild on it from there.

Most software packages simply need to be built from their source archive and don't need any additional files. However, it's not uncommon for packages to need to be specially configured by RPM on some systems by including particular files. On this chance, the included RPM spec file will name other source files or patch files besides the tarball (Source1, Source2, Patch0, Patch1 and so on). These files will need to be copied to the SOURCES directory as well. (Did I mention you need to run rpmbuild -tb in your SOURCES directory?)

I presumed there was some way to tell RPM where to find these SOURCE files in the tarball. For example, if you put these extra source files in the same place as the spec file, pkg/fedora then RPM would find them. Unfortunately, RPM's tarball mode doesn't know to copy anything to the SOURCES directory for you. However, it should be easy to modify the spec file to have it copy the source files in pkg/fedora to the SOURCES directory.

Adding the following tar command to the %prep section of the RPM spec file to copy the source files to the SOURCES directory.

tar -C %{name}-%{version}/pkg/fedora -cf - . | tar -C %{_sourcedir} -xf -

Alternatively, a single tar command on the actual tarball could extract the files into the SOURCES directory.

tar --strip=3 -C %{_sourcedir} -zxf %{SOURCE0} %{name}-%{version}/pkg/fedora/\*

The latter would only use a single execution of the tar command. The former may be more reliable should GNU tar not be available.

With that line inserted, a tar archive with such a SPEC file can bootstrap its own RPM.. The rpmbuild -ta will build both the binary and source RPMS.

Unfortunately, the rpmbuild -ts command will not work in this scenario, until the SOURCE files are present. You can copy the files yourself for it to work. Or run the the %prep stage of rpmbuild to get the task done.

$ rpmbuild -tp
$ rpmbuild -ts

And one other final word of warning, don't make changes to the tarball's source files in the SOURCES directory. Since the source files are extracted every time on each build, any changes to these files will be overwritten, unless you "short-circuit" the rpmbuild. Although short circuiting in RPM will not allow you to actually build the package.

Being able to build an RPM from the tarball source package is something for software maintainers to advertise to their users, but isn't a reliable way to develop RPM packages.

Link | Leave a comment | Add to Memories | Tell a Friend

Free software maintainer manual

Oct. 24th, 2008 | 04:53 pm

When you get stumped on how to handle a situation with your free software project, its always good to find advice. I found the Free Software Project Management HOWTO by Benjamin Mako Hill to be very useful. It has very sharp and leveled commentary, but it also does a good job of quoting other worthy sources on the topic.

It was updated as recently as August of 2008. However, the booklet covers topics that have lasting currency.

Link | Leave a comment {1} | Add to Memories | Tell a Friend

HTTP/1.1 request with telnet

Sep. 26th, 2008 | 04:15 pm

Here's how I make an HTTP/1.1 request from the command-line using telnet.

  $ ( echo "GET / HTTP/1.1";
      echo "Host: www.yahoo.com";
      echo "User-Agent: $(bash --version | head -n 1)";
      echo "Connection: close";
      echo; echo;
      sleep 1 ) | telnet www.microsoft.com 80

Sometimes you need to increase the sleep time if the Web server is taking longer to return the response, and you want to keep telnet from closing the connection prematurely.

The output sent to the Web server can be shown by simply removing the pipe to telnet from above.

  $ ( echo "GET / HTTP/1.1";
      echo "Host: www.yahoo.com";
      echo "User-Agent: $(bash --version | head -n 1)";
      echo "Connection: close";
      echo; echo;
      sleep 1 )
  GET / HTTP/1.1
  Host: www.yahoo.com
  User-Agent: GNU bash, version 3.2.33(1)-release (i386-redhat-linux-gnu)
  Connection: close
  [2 empty lines] 

I really enjoy reporting my User-Agent as GNU Bash shell

Yes, the domain names used in this example are only a humorous mention of Carl Icahn's latest proxy battle.

Telnet is a bit of a pain, might as well just use GNU Wget if you can.

  $ wget -S -O - http://www.gnewsense.org/

That's much shorter to type.

Link | Leave a comment {5} | Add to Memories | Tell a Friend

Getting directories with GNU Wget

Sep. 5th, 2008 | 02:45 pm

Sometimes there are files that are available from a Web server using Apache's auto index module (mod_autoindex), and you want to copy them to your machine. And you're satisfied retrieving them over HTTP this one time, rather than another file transfer method like SSH, FTP or rsync for that matter.

I usually feel confident retrieving things with GNU Wget things over HTTP, but its command-line arguments are hard to memorize. It took me a long time to put together, but the following will copy a directory on a Web server to your current directory.

  $ wget -r -N -nH -nd -np -R "index.html*" -P nyc-2008 \
         'http://localhost/~ashawley/photos/nyc-2008/

The command deletes all the file listings -- index.html* -- created by Apache's autoindex module. These files are used by Wget for retrieving your files recursively, but that's it. There should probably be an option for this in Wget.

The long option alternatives of Wget are easier to read, but don't help me much in remembering them.

  $ wget --recursive --timestamping --no-host-directories \
         --no-directories --no-parent --directory-prefix=nyc-2008 \
         http://localhost/~ashawley/photos/nyc-2008/

Now this post will help me remember them.

In an idealized microkernel environment -- like GNU/Hurd, you could have a translator that converts the HTTP protocol to a file system that can be accessed the same as the other files on your machine. For copying, you would just use the command you're used to using for copying files.

  $ cp -pr /http/localhost/~ashawley/photos/nyc-2008/ .

Or use your favorite more complex unix commands to get only the things you want.

  $ find /http/localhost/~ashawley/photos/nyc-2008/ \
         -type f -name '*.jpg' -size -1M -print0 \
      | cpio -0 -pd nyc-2008 

Someday I'll have my pie in the sky.

Link | Leave a comment {2} | Add to Memories | Tell a Friend

Yum remember movie-player

Sep. 2nd, 2008 | 03:28 pm

After making a few upgrades with Yum in Fedora, you learn to quickly uninstall packages that seem superfluous at the time. I mean, why download and upgrade the video player and its associated plug-ins and codecs if you don't really use it. Uninstalling is also a good survival tactic if Yum is having no success finding some dependencies or doing a poor of job getting the newer version of every package.

Today, I wanted to see the GNU Birthday wish from Stephen Fry. Of course, I don't have a video player since the last upgrade. And I don't remember what the package name is called.

I'm sure there are individuals with Fedora credentials or otherwise who can quickly tell you the name of popular free software multimedia packages. I don't follow free software distributions as closely as I used to, however. The user interface of Fedora doesn't help you here, either. The menu item for the movie player is called "Movie Player". Of course, that has its benefits for beginners, but is less than informative. Running yum install movie-player doesn't do anything. The group information of Yum does have what you need, though:

See if there is a group with the word "video" in it.

  $ yum grouplist | grep -i video
     Sound and Video

Yes. See what the default packages are for the group are.

  $ yum groupinfo "Sound and Video" \
      | sed -ne '/^ Default/,/^ Optional/p' \
      | sed -e '$d'
   Default Packages:
     alsa-plugins-pulseaudio
     bluez-utils-alsa
     cdparanoia
     codeina
     genisoimage
     gstreamer-plugins-good
     gstreamer-plugins-pulse
     icedax
     pavucontrol
     pulseaudio
     pulseaudio-utils
     rhythmbox
     sound-juicer
     sox
     totem
     totem-mozplugin
     totem-nautilus
     wodim

I guess totem is what I wanted.

Link | Leave a comment {3} | Add to Memories | Tell a Friend

Feeding entropy to GnuPG on Fedora

Aug. 28th, 2008 | 03:08 pm

In a previous post, I mentioned we are putting together an RPM build server at work. The RPMs that are built are signed by an encryption key and uploaded to the Yum server. The GnuPG (GPG) signing will give us confidence that the RPMs were from the build server and weren't tampered with since they were built and copied to the Yum repository.

At this point, the security of the signing key is not important. I say this confidently even after the recent package signing compromise at Fedora and Red Hat. We want to have automated package signing and we're only building packages for distribution inside the office.

One nice feature of GnuPG is its automatic key generation. The RPM build server is generating its own key, and preferably as non-interactive as possible. Unfortunately, this requires entropy to work consistently.

For information about automatically generating keys with GPG see the section "Unattended key generation" in the DETAILS file that comes with GnuPG. That documentation can be found on a GNU/Linux system with the following command.

  $ less -p "^Unattended" /usr/share/doc/gnupg-*/DETAILS

As the summary says:

This feature allows unattended generation of keys controlled by a parameter file. To use this feature, you use --gen-key together with --batch and feed the parameters either from stdin or from a file given on the command line [sic].

Here's an example of automatically generating a secret GPG key.

  $ cat gpg-key.conf
  %echo Generating a package signing key
  Key-Type: DSA
  Key-Length: 1024
  Subkey-Type: ELG-E
  Subkey-Length: 2048
  Name-Real: Build Server
  Name-Email: builds@site.org
  Expire-Date: 0
  Passphrase: Does not ex1st!
  %commit
  %echo Done
  $ gpg --batch --gen-key gpg-key.conf \
        > gpg-keygen.log \
        2> gpg-keygen_error.log

Those familliar with generating keys know that it is an extremely interactive process. Not just for entering the details about the key, but because you need to inject entropy into the computer to ensure the newly generated key is random. (Debian had erroneously weakened the random number generation in a security-related package necessitating a significant response to those systems affected by the vulnerability.) Usually, GnuPG receives entropy by jiggling the mouse or banging on the keyboard. As the GnuPG README says:

If you see no progress during key generation you should start some other activities such as moving the mouse or hitting the CTRL and SHIFT keys. Generate a key only on a machine where you have direct physical access - don't do it over the network or on a machine also used by others, especially if you have no access to the root account. (original emphasis)

This becomes a problem on servers that don't have mice or keyboards attached. One would typically see the following message from GnuPG complaining about not having enough entropy.

  $ gpg --batch --gen-key gpg-key.conf
  gpg: Generating a package signing key
  .++++++++++++++++++++...+++++..++++++++++++++++++++++++++++++++++++++++++++++++
  +++++++.+++++++++++++++++++++++++++++++++++++++++++++++++++++++..>+++++...+++++

  Not enough random bytes available.  Please do some other work to give
  the OS a chance to collect more entropy! (Need 123 more bytes)

  gpg: Interrupt caught ... exiting

As a sidebar, the "Key generation" section of the DETAILS file explains all those special characters spit to the screen when the key is generated.

    Key generation shows progress by printing different characters to
    stderr:
	     "."  Last 10 Miller-Rabin tests failed
	     "+"  Miller-Rabin test succeeded
	     "!"  Reloading the pool with fresh prime numbers
	     "^"  Checking a new value for the generator
	     "<"  Size of one factor decreased
	     ">"  Size of one factor increased

I tried various complicated strategies of creating entropy on a headless system to no success. One of them was piping the output of /dev/random into /dev/urandom and visa verse. Let's see if I can rehash it here.

  $ b=2048; \
    future=$(date -d'+6 seconds' +'%s' ); \
    while [ ${future} -gt $(date +'%s') ]; do \
      head -c b /dev/random > /dev/urandom; \
      head -c ${b} /dev/urandom > /dev/random; \
    done &
  $ gpg --batch --gen-key gpg-key.conf

Anyway, it didn't work.

Running this does, though.

  # rngd -r /dev/urandom

The rngd service provides "true random number generation" (RNG). It comes as part of the rng-tools package.

According to the documentation in the Linux kernel:

The hw_random framework is software that makes use of a special hardware feature on your CPU or motherboard, a Random Number Generator (RNG). The software has two parts: a core providing the /dev/hw_random character device and its sysfs support, plus a hardware-specific driver that plugs into that core.

In Fedora, this package can be installed with Yum.

  # yum install rng-utils

I've arrived on Planet Fedora. Planet Fedora is an aggregation of article feeds from members of the Fedora Project -- a community project affiliated with Red Hat that distributes the GNU/Linux operating system.

Link | Leave a comment {9} | Add to Memories | Tell a Friend

GPG signing RPMs unattended

Aug. 20th, 2008 | 10:55 am

At my office we package our work on Fedora systems into RPMs. I have been trying to cobble a build server to take our work from Subversion, build it, and upload it to our office's Yum repository. Sounds easy.

Fedora provides a lot of packages for automating a build server. Unfortunately, one of the final steps in the process of signing the packages with a GPG secret key cannot be automated. I've tried using an empty passphrase, and using the gpg-agent feature of GnuPG. I had to write a wrapper to fake sending the passphrase, see below.

My understanding of the problem is that RPM uses the getpass function to get the key passphrase and there's no way around this.

According to package maintainer documentation on the the Fedora Project Web site:

A Release Engineer [signs] and pushes out your updates. The signing step is currently a manual process, so your updates will not be instantly released once submitted to bodhi.

Related, Fedora is working on a putting together signing server. Apparently, they've had issues with "separating 'Who can sign' from 'Who knows the gpg passphrase'".

To get around this RPM flaw, I wrote an Expect wrapper to automatically sign RPM packages. I'm not an expert at Expect programming, but fortunately autoexpect helps.

Using the wrapper is as simple as the RPM signing command.

  $ ./rpm-sign.exp PACKAGE-FILE

Here's the script.

  #!/usr/bin/expect -f
  
  ### rpm-sign.exp -- Sign RPMs by sending the passphrase.
   
  spawn rpm --addsign {*}$argv
  expect -exact "Enter pass phrase: "
  send -- "Secret passphrase\r"
  expect eof
  
  ## end of rpm-sign.exp

Thank goodness for Tcl hackers.

Link | Leave a comment {8} | Add to Memories | Tell a Friend

Playing Ogg

Aug. 6th, 2008 | 03:20 pm

I like to listen to music while I work. In support of the Play Ogg! movement, I converted some of my audio collection into the patent-unencumbered audio file format, Ogg Vorbis.

To convert my audio CDs to Ogg, I use the friendly abcde package. It converts the CDs to ogg in batch mode, and it does a nice job of retrieving and storing the album and track information and organizing things into neat folders.

To play the music I use the command-line program ogg123 which comes with the vorbis-tools package. Everyday, I send the program the name of every music file I have on the computer and have it play them with shuffle turned on. The command has a shuffle option, but since I'm sending the files on standard input I shuffle them with the shuf command that comes with GNU coreutils.

Here's the command I type everyday.

  $ find -type f -name '*.ogg' | shuf | ogg123 -q -@ -

This setup is missing many things most other popular audio player software have. It doesn't have a fancy graphical user interface nor even a handy Emacs interface. However, it's really reliable, and sometimes when you have to logout out of your graphical windows environment and work in just a terminal screen -- you can still listen to music! I always found that feature very important when trying to proceed with an operating system upgrade.

Link | Leave a comment {4} | Add to Memories | Tell a Friend

Shell hack: md5sum file name output

Jul. 6th, 2008 | 08:40 am

Noticed an interesting thing about md5sum command that comes with GNU Coreutils. If there is a newline or backslash in the filename, md5sum leads the output with a backslash.

I wasn't able to find this in the user manual. Although, I may have been searching for the keyword "null" (to understand this read on) rather than a more appropriate term like "slash". Instead, I tried to verify it myself.

  $ touch foo-bar
  $ md5sum foo-bar
  d41d8cd98f00b204e9800998ecf8427e  foo-bar
  $ touch foo\\bar
  $ md5sum foo\\bar
  \d41d8cd98f00b204e9800998ecf8427e  foo\\bar

I eventually did find where the feature is documented in the manual:

If FILE contains a backslash or newline, the line is started with a backslash, and each problematic character in the file name is escaped with a backslash, making the output unambiguous even in the presence of arbitrary file names.

I even found in the md5sum source code where this is intentionally done:

    /* Output a leading backslash if the file name contains
       a newline or backslash.  */
    if (strchr (file, '\n') || strchr (file, '\\'))
      putchar ('\\');

Clearly, the motivation for this is to handle arbitrarily named files. However, it departs from the GNU standard to use the null character to delimit lines, see "NUL Terminated File Names" in the GNU tar manual.

If you're using the output of md5sum as part of a process with your shell programming, this behavior of md5sum becomes less than helpful. No other command uses this format for representing arbitrary file names. Fortunately, it is easy enough to convert the output of md5sum to the null-terminated line format -- understood by many GNU programs -- and carry on with your work. Here's one solution using GNU gawk that converts md5sum output to null-line terminated format:

  $ touch foo\\bar 'foo^V^Jbar'
  $ md5sum foo\\bar 'foo^V^Jbar'
  $ cat md5sum2null.awk 
  #!/usr/bin/gawk

  /^\\/ {
      gsub(/^\\/, "");
      gsub(/\\\\/, "\\");
      gsub(/\\n/, "\n");
  }
  {
      printf "%s\0", $0;
  }
  $ md5sum foo\\bar 'foo^V^Jbar' | gawk -f md5sum2null.awk

Now back to your regularly scheduled programming.

Link | Leave a comment | Add to Memories | Tell a Friend

Shell hack: Random password generator

Jun. 19th, 2008 | 12:29 pm

In a previous post about shell hacking, I wrote.

"The command tools in unix shell programming are general enough to do pretty monumental tasks with just using a small number of commands -- in both breadth and length. Even a little bit of properly written complex shell programming can allow you to write a pretty full-proof command -- as a proof-of-concept or as temporary solution until you discover a shortcoming. [...] Although rare, if the shell doesn't have what you need, then you're using the wrong tool."

In that spirit, I wanted to see how the shell and its sister tools in unix-land could handle generating random passwords.

After searching around a bit, I was able to find some good strategies for generating random passwords with the shell, but nothing I was entirely pleased with. The following explains my approach to this problem.

The best way to get an unlimited number of bits on a unix system is with the system device /dev/urandom. For shell programming, it can handily spit out random characters for you. I don't want every character possible, however. For my purpose of generating random passwords, the alpha-numeric and punctuation characters would be enough and the more randomness the better. I don't need the passwords to be human-readable or memorable.

The tr command can filter to those characters you want, and the head command can limit the number of characters you want. To get 6 random characters that are either alpha-numeric or punctuation you can use the following command in GNU Bash.

  $ tr -dc "[:alnum:][:punct:]" < /dev/urandom | head -c 6 && echo
  S>t^V`

The echo inserts a newline for display purposes after the characters are printed, since neither tr, /dev/urandom, or head insert an endline character for you.

According to the GNU Grep user manual , there are 32 punctuation characters. That means there are a total of 94 distinct characters available to a random password here. If we generated just 8 character passwords, that would give -- 94 to the power of 8 (94^8) -- 6,095,689,385,410,816 (6.1e15) different possibilities, roughly 2 to the power of 52 (2^52).

The other desirable characteristic of a random password generator is to have variable password lengths. How to generate a random integer in the shell? Most Bourne shells -- including GNU Bash -- have a built-in RANDOM environmental variable to return a random number.

  $ echo $RANDOM
  6472

To generate a random number between 8 and 16 -- inclusive:

  $ ( min_length=8; \
      max_length=16; \
      echo $(( $RANDOM % ($max_length - $min_length + 1) + $min_length )) )
  15

Combine this all together.

  password=$(tr -dc "[:alnum:][:punct:]" < /dev/urandom \
             | head -c $( RANDOM=$$; echo $(( $RANDOM % (8 + 1) + 8 )) ) )
  echo "${password}";
  P50.6kw41

Note that it's good practice to seed the random number generator with the current process number -- RANDOM=$$;, even though most shells properly initialize it, already.

According to my handy Emacs calculator, the sum of the series of 94 to the power of k where k ranges from 8 to 16 gives 37,556,971,331,618,802,283,689,774,779,136 (3.76e31) different possible passwords -- approximately 2 to the power of 104 (2^104).

I needed this to automatically reset a system password for accounts at my workplace. You can take the result of the shell random password generator and send it to the passwd command.

# ( password=$(tr -dc "[:alnum:][:punct:]" < /dev/urandom \
               | head -c $( RANDOM=$$; \
                            echo $(( $RANDOM % (8 + 1) + 8 )) ) ); \
    echo "${password}"; echo "${password}"; ) | passwd USERNAME
New UNIX password: Retype new UNIX password: passwd: password updated successfully

More appropriately, password administration on most GNU/Linux systems can be done with chpasswd.

# ( user=warehouse; \
    password=$(tr -dc "[:alnum:][:punct:]" < /dev/urandom \
               | head -c $( RANDOM=$$; \
                            echo $(( $RANDOM % (8 + 1) + 8 )) ) ); \
    echo "${user}:${password}" ) | chpasswd

Admittedly, this scriptlet is about as good as the pwgen command with the -s and -y options.

For further reading, see pwgen.sh where I have accumulated all of this together into a shell script.

Link | Leave a comment | Add to Memories | Tell a Friend

Advertisement

Customize