Trailing blanks, vim and git

Trailing blanks may be␣␣␣␣␣
evil when git displays diffs.␣␣␣␣␣␣␣
Time to remove them.

I'm currently going through a major transition on my main machine in that I have configured my vim to strip trailing blanks, that is, to automatically remove space characters (as in U+0020) immediately before the ends of lines[1].

Why do I do this? I suppose it mainly started with PEP 8, a style guide für Python source code which says trailing whitespace is evil. It has a point, but I have to say trailing whitespace really became a problem only when style checkers started rejecting trailing blanks, which then made all kinds of tools – including other peoples' editors – automatically strip trailing whitespace.

That, in turn, causes the diffs coming out of version control systems to inflate, usually without anyone – neither the people leaving the trailing whitespace nor the ones whose tools remove them – actually wanting that. And well, I tackled this about now because I was fed up with humonguous continuous integration runs failing at the very end because they found a blank at the end of some source file.

So, while I can't say I'm convinced trailing whitespace actually is as evil as all that, I still have to stomp it out to preserve everyones' nerves.

Configuring vim to replace trailing blanks with nothing when saving files is relatively straightforward (at least if you're willing to accept a cursor jump now and then). The internet is full of guides explaining what to do to just about any depth and sophistication.

Me, I am using a variant of a venerable vintage 2010 recipe that uses an extra function to preserve the state over a search/replace operation to avoid jumping cursors. I particularly like about it that the Preserve function may come in handy in other contexts, too:

function! Preserve(command)
  " run command without changing vim's internal state (much)
  let _s=@/
  let prevpos = getcurpos()
  execute a:command
  let @/=_s
  call cursor(prevpos[1], prevpos[2])
endfunction

au BufWritePre * if !&binary | call Preserve("%s/  *$//e") | endif

That is now in my ~/.vimrc.

But I still have all the repositories containing files having trailing blanks. To keep their histories comprehensible, I want to remove all trailing blanks in one commit and have that commit only do these whitespace fixes. The trouble is that even with version control (that lets you back out of overzealous edits) you will want to be careful what files you change. Strip trailing blanks in a (more or less) binary file and you will probably break that file.

So, here is what I do to fix trailing blanks in files that need it while leaving alone the ones that would break, using this blog's VCS (about) as an example:

  1. In preparation, make sure you have committed all other changes. Bulk operations are dangerous, and you may want to roll back everything in case of a fateful typo. Also, you don't want to pollute some other, meaningful commit with all the whitespace noise.

  2. In the root of the repository, look for candidate files containing trailing blanks, combining find and grep like this:

    find . -type f | xargs grep -l ' $'
    

    A brief reminder what's going on here: grep -l just lists file names with matches of the regular expression, ' $' is a regular expression matching a blank at the end of a line; xargs is a brilliant program reading command line arguments for the program named in its arguments from stdin, and the find invocation prints all names of actual files (as opposed to directories) below the current directory.

    It may be preferable to use some grep with built-in find functionality (I sometimes use ripgrep), but if I can make do with basic GNU or even better POSIX, I do, because that's something that's on many boxes rather reliably.

    The price to pay in this particular case: this recipe won't work if you have blanks in your file names (using -print0 in find and -0 in xargs would fix things here, but then the next step would break). Do yourself a favour and don't have blanks in your filenames. Having dashes in them looks-better-anyway: it makes you look like a die-hard-LISP-person.

  3. Now use egrep -v to filter file names, building patterns of names to ignore and process later, respectively. For instance, depending on your VCS, you will usually have lots of matches in .git or .svn or whatever, and most of these will break when you touch them (not to mention they won't spoil your history anyway). Coversely, it is clear that I want to strip trailing blanks on ReStructuredText files. My two patterns now grow in two separate egrep calls, one for files I don't want to look at, the other for files I will want to strip trailing blanks in:

    find . -type f |\
      egrep -v '\.git' |\
      egrep -v '\.rst$' | xargs grep -l ' $'
    

    This prints a much smaller list of names of files for which I have not yet decided whether or not to strip them.

  4. Repeat this: On the side of files I shouldn't touch, I spot some names ending in .jpeg, .png, and .db. On the side of files that need processing, I notice .html, .css, and .py. So, my next iteration is:

    find . -type f |\
      egrep -v '\.git|\.(jpeg|png|db)$' |\
      egrep -v '\.(rst|html|css|py)$' |\
      xargs grep -l ' $'
    

    That's a still smaller list of file names, among which I spot the index files used by my search engine in .xapian_db, .pyc files used by Python, and a vim .swp file. On the other hand I do want to process some files without an extension, so my next search command ends up as:

    find . -type f |\
      egrep -v '\.git|\.xapian_db|\.(jpeg|png|db|pyc|swp)$' |\
      egrep -v 'README|build-one|\.(rst|html|css|py)$' |\
      xargs grep -l ' $'
    

    That's it – this only leaves a few files as undecided, and I can quickly eyeball their names to ascertain I do not want to touch them. My second pattern now describes the set of files that I want to strip trailing blanks from.

  5. Stripping trailing blanks is easily done from the command line with sed and its inline (-i) option: sed -i 's/  *$//' <file1> <file2>...[2]. The file names I can produce with find alone, because at least GNU find supports the extended regular expressions I have just produced in my patterns; it needs a -regexptype option to correctly interpret them, though:

    find . -regextype egrep -regex 'README|build-one|.*\.(rst|html|css|py)$' |\
      xargs grep -l ' $'
    

    The advantage of using find alone over simply inverting the egrep (by dropping the -v) is that my gut feeling is the likelihood of false positives slipping through is lower this way. However, contrary to the egrep above, find's -regex needs to match the entire file name, and so I need the .* before my pattern of extensions, and editing REs might very well produce false positives to begin with… Ah well.

    Have a last look at the list and then run the the in-place sed:

    find . -regextype egrep -regex 'README|build-one|.*\.(rst|html|css|py)$' |\
      xargs grep -l ' $' |\
      xargs sed -i 's/  *$//'
    
  6. Skim the output of git diff (or svn diff or whatever). Using the blacklist built above, you can see whether you have indeed removed trailing whitespace from files you wanted to process:

    find . -type f |\
      egrep -v '\.git|\.xapian_db|\.(jpeg|png|db|pyc|swp)$' |\
      xargs grep -l ' $'
    

    If these checks have given you some confidence that the trailing blanks have vanished and nothing else has been damaged, commit with a comment stressing that only whitespace has been changed. Then take a deep breath before tackling the next repo in this way.

[1]This post assumes your sed and you agree on what marks the end of the line. Given it's been quite a while since I've last had to think about CRs or CRLFs, it would seem that's far less of a problem these days than it used to be.
[2]Incidentally, that's a nice example for why I was so hesitant about stripping white space for all these years: Imagine some edits make it so a line break sneaks in between sed -i 's/ and *$//'. Then both blanks that are there are gone, and even if the text is reflowed again later, it will still be broken (though not catastrophically so in this particular case).
Kategorie: edv

Letzte Ergänzungen