Thursday 26 August 2010

How to remove commas from quoted strings in csv files

I needed to remove commas (,) from within double quotes (") in a Comma Separated Variable (CSV) file.

For example, you can use cut -d, -f2,4,5 to extract fields 2, 4 and 5.

But if there is a comma in the text of a field like this "hello, world" you are generally stuck.

Also SQLite can also import csv files, but again commas in quotes cause problems.

I was stuck until I managed to create this sed script that will work in many cases.


# This bash function uses sed to remove up to 4 individual commas
# sequences from a quoted string in a csv file.


# eg. "Hello,, World, nice day." -> "Hello World nice day."



function removeCommas(){
  while read data; do
    echo "$data" | sed -e 's/^/,/g' | sed -e "s/$/,/g" \
      | sed -e 's/\(,\"[^,]*\),*\([^,]*\),*\([^,]*\),*\([^,]*\),*\(.*\",\)/\1\2\3\4\5/g' \
      | sed -e 's/^,//g' | sed -e 's/,$//g'
  done
}


A friend mentioned that the sed commands can be combined as follows:


    echo "$data" \
      | sed \
      -e 's/^/,/g' \
      -e "s/$/,/g" \
      -e 's/\(,\"[^,]*\),*\([^,]*\),*\([^,]*\),*\([^,]*\),*\(.*\",\)/\1\2\3\4\5/g' \
      -e 's/^,//g' \

      -e 's/,$//g'


How it works


The engine uses sed.


This sed statement takes input from stdin and replaces the regular expression 'from' with 'to' for any and all occurrences of 'from'.


sed -e 's/from/to/g'

First, I add a comma to the start and end of each line with this:
sed -e 's/^/,/g' | sed -e "s/$/,/g"

These make the function work for special cases and they get removed after the commas have been removed.

The 'from' string starts of by finding the beginning of a quoted string ',\"' then all non-comma characters '[^,]*'.

This pattern is enclosed in brackets '\(' and '\)' to assign the matching pattern to, in this case, part 1.

Then it matches the next one or more commas ',*'. This is not in brackets since we don't want them.

The next part of the pattern '\([^,]*\)' is like the first: it finds a string of non-comma characters and keeps the values as part 2.

Then we skip over any commas again.

This sequence can be repeated as many times as you like. I did it 3 times.

The pattern ends like it starts. But this time it reads any character including commas up to the trailing quote using '.*'. Then it reads the trailing quote '\"' and comma ','. All of this makes part 5.

This means that it will only filter the first 4 sequences of commas. To make it do more, repeat the middle pattern ',*\([^,]*\)' and increase the output parts (below).

The 'to' expression is simply the concatenation of the 5 parts of the quoted string that are guaranteed to not contain a comma (well, except possibly for the last part), where '\N' is the nth part.

\1\2\3\4\5

Testing

I tested the function with this.


function testthis(){
  echo -n "\"$1\" --> \"$2\"..."
  echo "$1" | removeCommas | grep -qE "$2" && echo "Pass" \
    || echo "FAIL"
}


# test removeCommas
testthis "this test should fail to test the 'tester'" "anything but this"
testthis "" ""
testthis "," ","
testthis ",," ",,"
testthis ",\"\"," ",\"\","
testthis ",\",\"," ",\"\","
testthis "1,2,\"3,4 5 6 7\",8,9,10" "1,2,\"34 5 6 7\",8,9,10"
testthis "1,2,\"3,4,5 6 7\",8,9,10" "1,2,\"345 6 7\",8,9,10"
testthis "1,2,\"3,4,5,6 7\",8,9,10" "1,2,\"3456 7\",8,9,10"
testthis "1,2,\"3,4,5,6,7\",8,9,10" "1,2,\"34567\",8,9,10"
testthis "1,2,\"3,,4,,5,,6,,7\",8,9,10" "1,2,\"34567\",8,9,10"
testthis "1,2,\"3,,4,,5,,6,,7,8\",8,9,10" "1,2,\"34567,8\",8,9,10"
testthis "\"tricky one where the quoted string starts the line 3,,4,,5,,6,,7,8\",8,9,10" "\"tricky one where the quoted string starts the line 34567,8\",8,9,10
testthis "1,2,\"3,,4,,5,,6,,7,8 tricky one where the quoted string ends the line \"" "1,2,\"34567,8 tricky one where the quoted string ends the line \""