The Problem
Earlier this week, I had to do a mass-insert into a database table from a CSV file.
The CSV file had ~20 columns and ~500k rows. I only really cared about 4 of the columns.
Visually, the problem can represented accurately as the following:
Previously I had seen this problem solved with massive amounts of over-engineering, and sometimes with clever usage of Microsoft Excel. However, in my never-ending quest to use the tools available to me more effectively. I challenged my self to see if I could generate the inserts I needed without leaving my terminal window.
First Steps
Some Example Data
For this post I am going to use the heart.csv data set sourced from Kaggle
It has the following (14) columns:
age age in years
sex(1 = male; 0 = female)
cp chest pain type
trestbps resting blood pressure (in mm Hg on admission to the hospital)
chol serum cholestoral in mg/dl
fbs (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
restecg resting electrocardiographic results
thalach maximum heart rate achieved
exang exercise induced angina (1 = yes; 0 = no)
oldpeak ST depression induced by exercise relative to rest
slope the slope of the peak exercise ST segment
ca number of major vessels (0-3) colored by flourosopy
thal 3 = normal; 6 = fixed defect; 7 = reversable defect
target 1 or 0
Enter bash and IFS
Let's read this using a small script that leverages IFS
:
#!/bin/bash
# Read a CSV file and presumably do stuff with what we read
# Configure the Internal Field Seperator (see: Delimeter)
IFS=','
# while stuff in FILE, read COLUMNS using $IFS
while read -r age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
do
# do things
echo "$age $sex $cp $trestbps $chol $fbs $restecg $thalach $exang $oldpeak $slope $ca $thal $target"
done < "$1"`
We can use this script by running the following:
`# make it executable
chmod +X <SCRIPT_NAME>
# run it
./<SCRIPT_NAME> <PATH_TO_CSV_FILE>
Which should yield:
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
63 1 3 145 233 1 0 150 0 2.3 0 0 1 1
37 1 2 130 250 0 1 187 0 3.5 0 0 2 1
41 0 1 130 204 0 0 172 0 1.4 2 0 2 1
56 1 1 120 236 0 1 178 0 0.8 2 0 2 1
57 0 0 120 354 0 1 163 1 0.6 2 0 2 1
57 1 0 140 192 0 1 148 0 0.4 1 0 1 1
56 0 1 140 294 0 0 153 0 1.3 1 0 2 1
44 1 1 120 263 0 1 173 0 0 2 0 3 1
...
57 0 0 140 241 0 1 123 1 0.2 1 0 3 0
45 1 3 110 264 0 1 132 0 1.2 1 0 3 0
68 1 0 144 193 1 1 141 0 3.4 1 2 3 0
57 1 0 130 131 0 1 115 1 1.2 1 1 3 0
57 0 1 130 236 0 0 174 0 0 1 1 2 0
Let's further assume that we have a database schema DATA that contains a table called HEART in it.
The table was created with the following SQL statement:
CREATE TABLE DATA.HEART(
ID INT NOT NULL,
AGE INT NOT NULL,
THAL INT,
CHOL DECIMAL,
PRIMARY KEY (ID)
);
Generating SQL Inserts
Let's modify our pre-existing script to create a bunch of insert statements from our heart.csv
file for this table:
#!/bin/bash
# Read a CSV file and presumably doing stuff with what we read
# Configure the Internal Field Seperator (see: Delimeter)
IFS=','
# Keep track of the sequence
declare -i id=0
# while stuff in FILE, read COLUMNS using $IFS
while read -r age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
do
# ignore the first row since it contains column headers
# craft an insert statement and append to the end of a file
if [ $id -eq 0 ]
then
echo "ignoring first row"
else
echo "INSERT INTO DATA.HEART (ID, AGE, THAL, CHOL) VALUES ($id, $age, $thal, $chol);" >> $2
fi
id=$((id+1))
done < "$1"
We can now run this with a second parameter (the output file name):
./<SCRIPT_NAME> <IN_FILE_NAME> <OUT_FILE_NAME>
Which gives us something like this:
INSERT INTO DATA.HEART (ID, AGE, THAL, CHOL) VALUES (1, 63, 1, 233);
INSERT INTO DATA.HEART (ID, AGE, THAL, CHOL) VALUES (2, 37, 2, 250);
INSERT INTO DATA.HEART (ID, AGE, THAL, CHOL) VALUES (3, 41, 2, 204);
INSERT INTO DATA.HEART (ID, AGE, THAL, CHOL) VALUES (4, 56, 2, 236);
INSERT INTO DATA.HEART (ID, AGE, THAL, CHOL) VALUES (5, 57, 2, 354);
INSERT INTO DATA.HEART (ID, AGE, THAL, CHOL) VALUES (6, 57, 1, 192);
INSERT INTO DATA.HEART (ID, AGE, THAL, CHOL) VALUES (7, 56, 2, 294);
INSERT INTO DATA.HEART (ID, AGE, THAL, CHOL) VALUES (8, 44, 3, 263);
INSERT INTO DATA.HEART (ID, AGE, THAL, CHOL) VALUES (9, 52, 3, 199);
INSERT INTO DATA.HEART (ID, AGE, THAL, CHOL) VALUES (10, 57, 2, 168);
INSERT INTO DATA.HEART (ID, AGE, THAL, CHOL) VALUES (11, 54, 2, 239);
INSERT INTO DATA.HEART (ID, AGE, THAL, CHOL) VALUES (12, 48, 2, 275);
INSERT INTO DATA.HEART (ID, AGE, THAL, CHOL) VALUES (13, 49, 2, 266);
INSERT INTO DATA.HEART (ID, AGE, THAL, CHOL) VALUES (14, 64, 2, 211);
INSERT INTO DATA.HEART (ID, AGE, THAL, CHOL) VALUES (15, 58, 2, 283);
INSERT INTO DATA.HEART (ID, AGE, THAL, CHOL) VALUES (16, 50, 2, 219);
...
And you're done! (kind of, you'd still have to actually run the INSERT script)
Caveats
This post doesn't cover how to deal with internal commas and special character escaping weirdness. But it is a good starting point for a bare-bones CSV parser.
I might follow this up with another small post about using this script to easily pull out and aggregate different columns.
One of the things I've learned from using this approach to deal with CSV files is that the simplest tool is often the best and can yield great results without much overhead.
If you enjoyed this post or have any thoughts, feel free to comment or find me on Twitter.
Thanks for reading!
References
IFS Usage - https://bash.cyberciti.biz/guide/$IFS