Shell hack: Accessing words (the epilogue)

« previous entry | next entry »
Mar. 12th, 2008 | 12:52 am

In my last post on shell programming, I described the problem of accessing elements in a list. I made the challenge really easy and set the bar low for the Bourne shell: I wanted only the first element in the list. I received a lot of suggestions from people about ways to do it, and all of them were really good. They were some of the same ones I had found. I will show that all are lacking in some respect. I'll quickly go over them, but first I'll make the point why the shell should meet this challenge.

In the last post I said, "Words are important to shells because every command you type in is composed of one or more distinct words."

For example, the command for listing files in the current directory is "ls". It is one word:

$ ls

Changing to a different directory is two words ("cd" and "folder"):

$ cd folder

Listing the file properties for files in /etc is 3 words:

$ ls -l /etc

The shell numbers these words respectively -- "ls" is zero, the "-l" is 1 and "/etc" is token number two -- and passes them to the underlying operating system for execution. Most of us use the shell's word knowledge and ability to parse words without any notice.

Advanced shell users use the word recognition with command history expansion. For instance, to run a backup of /etc after just listing the files as was previously done, you could run this in at least the GNU Bash shell -- and maybe a few other shell varieties:

$ backup !!:3

The previous will be expanded to:

$ backup /etc

The savings in key pressings was abysmal for this last example, but it's not hard to imagine scenarios with larger command arguments where history expansion could save you keystrokes.

And for the really advanced shell users -- the shell programmers, they use the word splitting feature all the time. It's not uncommon to see this idiom in programs written in shell script:

#!/bin/sh

first_argument=$1
shift
second_argument=$1
shift

The above shell script takes the first argument, then removes (shifts) the arguments, and $1 is the second argument. These numeric variables are called the positional parameters, and are consistent as how I explained the word numbering in a shell command earlier.

The shell has a setting it uses to determine how to split words. This setting is the IFS variable. Let's see what the default is set to in GNU Bash:

$ echo -n "${IFS}" | od -c
0000000      \t  \n
0000003

Looks like the space, tab and newline characters. This would make sense -- any unquoted whitespace splits up words.

$ echo -n "${IFS}" | od -t d1
0000000   32    9   10
0000003

Yup, printing them as decimal shows 32, which is the ASCII space character.

Some shell programmers temporarily modify the well established IFS variable to use shell to split words for them -- for example, split on the slashes in a directory path, or the colons in a setting, or split on the empty string ("") for finding the individual characters of a string. Ok, maybe I'm getting it confused with the Awk language, but hopefully you just get the point of this variable. I hope you really believe that the shell understands how to split words.

But even with all the shell's familiarity with word splitting, I still had trouble accessing individual words. I was sort of stunned that I couldn't come up with some trick to pull out the first word from a list in my shell programming.

The classic way to get things done in the shell is to let a separate utility do the work. The cut command is good for extracting columns:

$ echo 1 2 3 | cut -d' ' -f1
1

There are also utilities for extracting lines, if we put each word on a separate line. Here's how to use three useful unix tools -- head, sed and awk. I'm not going to explain each of these in detail. If the documentation on your system doesn't do it, you should complain to your local -- help desk, administrator, customer support rep, union, whatever.

$ (echo 1; echo 2; echo 3)
1
2
3
$ (echo 1; echo 2; echo 3) | head -n 1
1
$ (echo 1; echo 2; echo 3) | sed -ne 1p
1
$ (echo 1; echo 2; echo 3) | sed 1q
1
$ (echo 1; echo 2; echo 3) | awk 'NR == 1'
1

Approaches that use external utilities seem to get the job done, but work around the shell's lack of a word accessing. It makes shell look silly, but it also has implications. Starting a separate application to do this task is expensive -- spawning a process, loading the binary from disk, occupying system memory. The performance problem is exacerbated when you use one of these idioms repeatedly in your shell script. Shouldn't the shell be able to handle the word splitting on its own?

The for-loop does word splitting, we can just run it once with the break statement.

$ for n in 1 2 3; do echo "${n}"; break; done
1

Nice. The for syntax is part of the shell. Still seems like an expensive operation, and it is a little verbose.

How about using the shell's word expansion like a shell programmer would use?

$ bash -c 'echo "$0"' 1 2 3
1

That looks pretty good, but it requires spawning an external shell.

The Bourne shell does support parameter expansion, where you can set a variable and manipulate it with various operations. Unfortunately, we have to assign the value to a parameter to manipulate it with these features. This is expected in an imperative language -- as opposed to functional.

This function deletes the first space and everything after it:

$ echo "$(var="1 2 3"; echo "${var%% *}")"
1

Really, we should use the variable IFS mentioned previously to find the words.

$ echo "$(var="1 2 3"; echo "${var%%[${IFS}]*}")"
1

Nice.

Our real savior here is GNU Bash's array notation, but you have to say it looks pretty disgusting and its not really a one-liner.

$ echo "$(declare -a var; var=(1 2 3); echo "${var[0]}")"

There's also a variation on using the word splitting of a sub-shell command, it's called the set command.

$ set -- 1 2 3; echo "$1"
1

This is short, it uses only the shell syntax, and doesn't spawn a sub-shell. Unfortunately, any positional parameters you were using have now been unset and set to new values. This can be a nasty and unwanted side effect.

The problem is that out of all these candidates, not only are they visually unappealing, it's unknown how well they perform with more realistic situations rather than watered-down examples. If words had any quoted whitespace, a few examples would fail. I, for one, don't live and operate in a theoretical vacuum. Instead, the words I want to process add another complication: The words are the result of filename expansion. And some of the filenames may very well have whitespace. You can never know.

Consider the files foo bar\n1, foo bar\n2 and foo bar\n3 -- where "\n" is the newline.

$ touch "foo bar^V^J"{1,2,3}

Yes, you insert the newline in a shell with Ctrl-V and then Ctrl-J.

With filename expansion, you never know what order the words will appear. GNU Bash is pretty polite and puts them in the alphabetic order you'd expect:

$ echo "foo bar^V^J"?
foo bar
1 foo bar
2 foo bar
3

For my specific purposes and for the purpose of these examples, it doesn't matter. We just want one word to be extracted.

Let the gauntlet begin.

$ echo "foo bar^V^J"? | cut -d' ' -f1
foo
1
2
3

Right, cut fails because it keeps handling each line of input, and its whitespace delimiting rules are limited to a single character.

The for-loop hack:

$ for f in "foo bar^V^J"?; do echo "${f}"; break; done
foo bar
1

Pretty good as one would expect.

How about the three line-oriented utilities?

$ ls -1 "foo bar^V^J"? | head -n 1
foo bar
$ ls -1 "foo bar^V^J"? | sed -ne 1p
foo bar
$ ls -1 "foo bar^V^J"? | sed 1q
foo bar
$ ls -1 "foo bar^V^J"? | awk 'NR == 1'
foo bar

They suffer the similar fate of cut.

How about the shell subprocess?

$ bash -c 'echo "$0"' "foo bar^V^J"?
foo bar
1

Nice.

What about the set built-in?

$ set -- "foo bar^V^J"?; echo "$1"
foo bar
1

Also good.

Parameter expansion is going to have a rough time I predict.

$ echo "$(var=$(echo "foo bar^V^J"?); echo "${var%% *}")"
foo
$ echo "$(var=$(echo "foo bar^V^J"?); echo "${var%%[${IFS}]*}")"
foo

Since parameter expansion is just ignorant search and replacement, it can't handle any strange "splitting" rules like whitespace.

Let's try arrays:

$ echo "$(declare -a var; var=("foo bar^V^J"?); echo "${var[0]}")"
foo bar
1

Good as one would expect.

If I had to pick one, I really like the sub-shell bash alternative (bash -c 'echo $0' 1 2 3) the best: It is done in one statement -- one line -- and doesn't have any unwanted side effects. Clearly, there are many alternatives that are equally robust at getting the correct value. I suppose some test would need to be written to see how truly robust each of these alternatives are for actual use. Just as important is to determine which has the most clarity to other programmers who may be reading your shell script later. So even after the analysis I've just completed, I don't really have a strong opinion about any of them. Let me end instead by making a general statement about shell programming.

The command tools in unix shell programming are general enough to do pretty monumental tasks with just using a small number of commands -- in both breadth and length. Even a little bit of properly written complex shell programming can allow you to write a pretty full-proof command -- as a proof-of-concept or as temporary solution until you discover a shortcoming. However, the shell is missing even the most basic of programming language features. If they can be made available they're a bit circuitous and unintuitive. Although rare, if the shell doesn't have what you need, then you're using the wrong tool. More likely, there's a way to do it but no idiomatic way to do it. I'm not trying to propose idioms in the management sense of controlling the labor conditions of programmers. I'm just saying programming with the shell can be less than straightforward for beginning and even for advanced learners. Fortunately, books and resources exist and code sharing amongst the free software community can bring some semblance of sanity to the situation. Sometimes, you just need to use another tool. In my case, I'm annoyed by an imperative and lacking programming language that has strange evaluation and expansion rules when what I really want is a well-defined and functional programming language. I miss my lambda.

Remember kids: "Always practice safe shell by wearing a quotation mark on your expressions -- every time and all the time."

Link | Leave a comment | Share

Comments {1}

(no subject)

from: anonymous
date: Apr. 1st, 2008 10:28 pm (UTC)
Link

Have your car and drive it too: scsh, baby.

Reply | Thread