Dataset Massage (head, tail, split)

Recently at work, I found an opportunity to revise / level-up my bash-fu. It probably sounds mundane and trivial to some, but I had fun solving my little problem with the command line instead of retreating into Python or JavaScript.

I received a CSV file containing millions of rows. My job is to split this file into a couple of smaller files, with each file containing an increasing number of rows.

Assuming there are no CSV headers (minor details like this can be postponed, but not forgotten!), and suppose I want to split my files in this manner:

1,000 lines
2,000 lines
5,000 lines
10,000 lines
20,000 lines
50,000 lines
100,000 lines
And so on

The first command I stumbled upon was split.

I tried solving my problem with just split, like:

Split into files containing 1,000 lines each
Keep the first file from (1)
Shave off 1,000 lines from the original file
Split the file from (3) into 2,000 lines each
Keep the first file from (4)
Repeat steps 3-5 but remember to increase the number of lines to split/shave

I used tail to shave the file. Usage:

# Write to output.csv, starting from the 1001st line of file.csv
tail -n +1001 file.csv > output.csv

Why -n and why +?

man tail

# tail – display the last part of a file

# -n number, --lines=number
#         The location is number lines.

# Numbers having a leading plus ('+') sign are relative to the beginning of the input
# Numbers having a leading minus ('-') sign or no explicit sign are relative to the end of the input

Well, split and tail worked. But that was a lot of unnecessary I/O. If I were to use a programming language, I would iterate through the lines of the file instead of copying parts of the file multiple times.

I consulted Pair and it suggested the use of head. Putting tail and head together, I get:

head -n 1000 file.csv > output_1.csv
tail -n +1001 file.csv | head -n 2000 > output_2.csv

It’s also possible to use head first, before piping the output into tail. To me, starting with tail helped me better visualise the front part of the file getting isolated and then taking the first X lines from this isolated part.

Using head first would mean you’re taking a larger chunk of the file from the front and then reading from the back of this chunk to shave off the very first part. According to the manual (man head), head traverses in only one direction - from the start.

Finally, I had a couple of tail ... | head ... commands. While this meant the file was opened and read from the start again and again, at least I didn’t have to produce so many intermediate output files. For my use case, it was still a quick operation, and I didn’t optimise the script any further.

Dataset Massage (head, tail, split)

Relevant posts