I have recently had the opportunity to use awk and sqlite on a project that I’m working on.
My first thought was that awk should be able to do that. I found some code that someone else did that parses quote and comma delimited.
I started off installing sqlite on cygwin.
After getting sqlite installed, I felt that I needed some test data in order to work with so I did a search and found this site:
https://www.briandunning.com/sample-data/
I grabbed the free file for testing.
1 | wget https://www.briandunning.com/sample-data/us-500.zip |
The file is comma delimited file that has these fields:
- First Name
- Last Name
- Company
- Address
- City
- County (where applicable)
- State/Province (where applicable)
- ZIP/Postal Code
- Phone 1
- Phone 2
- Web
1 2 3 4 5 6 7 8 9 10 11 12 13 | CREATE TABLE us500 ( first_name TEXT NOT NULL, last_name TEXT NOT NULL, company TEXT NOT NULL, address TEXT NOT NULL, city TEXT NOT NULL, county TEXT NOT NULL, state TEXT NOT NULL, zipcode TEXT NOT NULL, phone1 TEXT NOT NULL, phone2 TEXT NOT NULL, email TEXT NOT NULL, web TEXT NOT NULL); |
Here I used the SQLite command line to issue the command to create the table.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | sqlite> CREATE TABLE us500 ( ...> first_name TEXT NOT NULL, ...> last_name TEXT NOT NULL, ...> company TEXT NOT NULL, ...> address TEXT NOT NULL, ...> city TEXT NOT NULL, ...> county TEXT NOT NULL, ...> state TEXT NOT NULL, ...> zipcode TEXT NOT NULL, ...> phone1 TEXT NOT NULL, ...> phone2 TEXT NOT NULL, ...> email TEXT NOT NULL, ...> web TEXT NOT NULL); sqlite> .tables us500 sqlite> |
I attempted to import the csv file but got an error.
1 2 3 | ssqlite> .mode csv sqlite> .import us-500lf.csv us500 Error: us-500lf.csv line 2: expected 12 columns of data but found 13 |
I can tell that the records are from a Macintosh system because the file has a carriage return record delimiter.
so I wanted to view a hexdump of the file. I used cat, tr , head and hexdump.
You can see in the HEX dump below that the file now has hex 09 line feeds.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | $ cat us-500.csv | tr "\r" "\n" | head -n 3 |hexdump -C 00000000 22 66 69 72 73 74 5f 6e 61 6d 65 22 2c 22 6c 61 |"first_name","la| 00000010 73 74 5f 6e 61 6d 65 22 2c 22 63 6f 6d 70 61 6e |st_name","compan| 00000020 79 5f 6e 61 6d 65 22 2c 22 61 64 64 72 65 73 73 |y_name","address| 00000030 22 2c 22 63 69 74 79 22 2c 22 63 6f 75 6e 74 79 |","city","county| 00000040 22 2c 22 73 74 61 74 65 22 2c 22 7a 69 70 22 2c |","state","zip",| 00000050 22 70 68 6f 6e 65 31 22 2c 22 70 68 6f 6e 65 32 |"phone1","phone2| 00000060 22 2c 22 65 6d 61 69 6c 22 2c 22 77 65 62 22 0a |","email","web".| 00000070 22 4a 61 6d 65 73 22 2c 22 42 75 74 74 22 2c 22 |"James","Butt","| 00000080 42 65 6e 74 6f 6e 2c 20 4a 6f 68 6e 20 42 20 4a |Benton, John B J| 00000090 72 22 2c 22 36 36 34 39 20 4e 20 42 6c 75 65 20 |r","6649 N Blue | 000000a0 47 75 6d 20 53 74 22 2c 22 4e 65 77 20 4f 72 6c |Gum St","New Orl| 000000b0 65 61 6e 73 22 2c 22 4f 72 6c 65 61 6e 73 22 2c |eans","Orleans",| 000000c0 22 4c 41 22 2c 37 30 31 31 36 2c 22 35 30 34 2d |"LA",70116,"504-| 000000d0 36 32 31 2d 38 39 32 37 22 2c 22 35 30 34 2d 38 |621-8927","504-8| 000000e0 34 35 2d 31 34 32 37 22 2c 22 6a 62 75 74 74 40 |45-1427","jbutt@| 000000f0 67 6d 61 69 6c 2e 63 6f 6d 22 2c 22 68 74 74 70 |gmail.com","http| 00000100 3a 2f 2f 77 77 77 2e 62 65 6e 74 6f 6e 6a 6f 68 |://www.bentonjoh| 00000110 6e 62 6a 72 2e 63 6f 6d 22 0a 22 4a 6f 73 65 70 |nbjr.com"."Josep| 00000120 68 69 6e 65 22 2c 22 44 61 72 61 6b 6a 79 22 2c |hine","Darakjy",| 00000130 22 43 68 61 6e 61 79 2c 20 4a 65 66 66 72 65 79 |"Chanay, Jeffrey| 00000140 20 41 20 45 73 71 22 2c 22 34 20 42 20 42 6c 75 | A Esq","4 B Blu| 00000150 65 20 52 69 64 67 65 20 42 6c 76 64 22 2c 22 42 |e Ridge Blvd","B| 00000160 72 69 67 68 74 6f 6e 22 2c 22 4c 69 76 69 6e 67 |righton","Living| 00000170 73 74 6f 6e 22 2c 22 4d 49 22 2c 34 38 31 31 36 |ston","MI",48116| 00000180 2c 22 38 31 30 2d 32 39 32 2d 39 33 38 38 22 2c |,"810-292-9388",| 00000190 22 38 31 30 2d 33 37 34 2d 39 38 34 30 22 2c 22 |"810-374-9840","| 000001a0 6a 6f 73 65 70 68 69 6e 65 5f 64 61 72 61 6b 6a |josephine_darakj| 000001b0 79 40 64 61 72 61 6b 6a 79 2e 6f 72 67 22 2c 22 |y@darakjy.org","| 000001c0 68 74 74 70 3a 2f 2f 77 77 77 2e 63 68 61 6e 61 |http://www.chana| 000001d0 79 6a 65 66 66 72 65 79 61 65 73 71 2e 63 6f 6d |yjeffreyaesq.com| 000001e0 22 0a |".| 000001e2 |
The problem is how to handle quote comma delimited files. I did a search to see if someone already had a solution for that and found this:
The code contains a function called parse_csv(). You can look at the usage of the parameters on the link above. The important part is how to call this function.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | { num_fields = parse_csv($0, csv, ",", "\"", "\"", "\n", 1); if (num_fields < 0) { printf "ERROR: %s (%d) -> %s\n", csverr, num_fields, $0; } else { #printf "%s -> ", $0; #printf "%s", num_fields; for (i = 0;i < num_fields;i++) { fld = csv[i]; gsub(/\"/,"",fld); if(i < num_fields-1) printf "%s\t", fld; else printf "%s\n", fld; } } } |
1 | cat us-500.csv |tr "\r" "\n" |awk -f parsecsv.awk >test1.pip |
I’m using AWK to convert the CSV file into a pipe delimited file.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | cat test1.pip| head -n 3 |hexdump -C 00000000 66 69 72 73 74 5f 6e 61 6d 65 09 6c 61 73 74 5f |first_name.last_| 00000010 6e 61 6d 65 09 63 6f 6d 70 61 6e 79 5f 6e 61 6d |name.company_nam| 00000020 65 09 61 64 64 72 65 73 73 09 63 69 74 79 09 63 |e.address.city.c| 00000030 6f 75 6e 74 79 09 73 74 61 74 65 09 7a 69 70 09 |ounty.state.zip.| 00000040 70 68 6f 6e 65 31 09 70 68 6f 6e 65 32 09 65 6d |phone1.phone2.em| 00000050 61 69 6c 09 77 65 62 0a 4a 61 6d 65 73 09 42 75 |ail.web.James.Bu| 00000060 74 74 09 42 65 6e 74 6f 6e 2c 20 4a 6f 68 6e 20 |tt.Benton, John | 00000070 42 20 4a 72 09 36 36 34 39 20 4e 20 42 6c 75 65 |B Jr.6649 N Blue| 00000080 20 47 75 6d 20 53 74 09 4e 65 77 20 4f 72 6c 65 | Gum St.New Orle| 00000090 61 6e 73 09 4f 72 6c 65 61 6e 73 09 4c 41 09 37 |ans.Orleans.LA.7| 000000a0 30 31 31 36 09 35 30 34 2d 36 32 31 2d 38 39 32 |0116.504-621-892| 000000b0 37 09 35 30 34 2d 38 34 35 2d 31 34 32 37 09 6a |7.504-845-1427.j| 000000c0 62 75 74 74 40 67 6d 61 69 6c 2e 63 6f 6d 09 68 |butt@gmail.com.h| 000000d0 74 74 70 3a 2f 2f 77 77 77 2e 62 65 6e 74 6f 6e |ttp://www.benton| 000000e0 6a 6f 68 6e 62 6a 72 2e 63 6f 6d 0a 4a 6f 73 65 |johnbjr.com.Jose| 000000f0 70 68 69 6e 65 09 44 61 72 61 6b 6a 79 09 43 68 |phine.Darakjy.Ch| 00000100 61 6e 61 79 2c 20 4a 65 66 66 72 65 79 20 41 20 |anay, Jeffrey A | 00000110 45 73 71 09 34 20 42 20 42 6c 75 65 20 52 69 64 |Esq.4 B Blue Rid| 00000120 67 65 20 42 6c 76 64 09 42 72 69 67 68 74 6f 6e |ge Blvd.Brighton| 00000130 09 4c 69 76 69 6e 67 73 74 6f 6e 09 4d 49 09 34 |.Livingston.MI.4| 00000140 38 31 31 36 09 38 31 30 2d 32 39 32 2d 39 33 38 |8116.810-292-938| 00000150 38 09 38 31 30 2d 33 37 34 2d 39 38 34 30 09 6a |8.810-374-9840.j| 00000160 6f 73 65 70 68 69 6e 65 5f 64 61 72 61 6b 6a 79 |osephine_darakjy| 00000170 40 64 61 72 61 6b 6a 79 2e 6f 72 67 09 68 74 74 |@darakjy.org.htt| 00000180 70 3a 2f 2f 77 77 77 2e 63 68 61 6e 61 79 6a 65 |p://www.chanayje| 00000190 66 66 72 65 79 61 65 73 71 2e 63 6f 6d 0a |ffreyaesq.com.| |
I then imported the data into the database.
1 2 3 4 5 6 | sqlite> select first_name,last_name from us500 limit 5; first_name|last_name James|Butt Josephine|Darakjy Art|Venere Lenna|Paprocki |
I’m seeing the first row has the column titles. I’ll look into how to import without that line later.