Data Wrangling

I managed to catch one of the NUS Hackers’ workshops, “Hacker Tools: Data Wrangling” by Julius. I got to learn many new things since I don’t use CLI for text processing. Julius’ slides are in his GitHub repo.

Below is a very brief and dry summary of what is more important to me. If I “miss” something out, it’s likely because I remembered it before. This summary does no justice to the workshop, but I want to write them down here before I forget.

`cat TEXT | grep -vE REGEX`

Get all lines from TEXT which don’t match REGEX. -v means invert match and -E means extended regex.

`cat TEXT | grep -E '.*' | sed s/REGEX/SUBSTITUTION/FLAGS`

Substitute REGEX match(es) in TEXT with SUBSTITUTION. SUBSTITUTION can be a capture group i.e. \1, \2, \3 and so on.

`| sort -nk1,2`

Sort lines numerically (not lexicographically i.e. 1, 2, 10 instead of 1, 10, 2) (-n). Sort only by the 1st to 2nd whitespace-separated column (-k1,2). -k stands for sort key. Search for KEYDEF in man sort to find out more.

`| uniq -c`

Collapse adjacent lines which are duplicates of each other. Prefix each unique line with the total number of lines which collapsed into one (the count).

`| awk '$1 == 1 && $2 ~ /^r[^ ]*t$/ { print $3 }'`

awk is a programming language with the basic syntax: awk PATTERN { DO THIS IF PATTERN MATCHES }. In the (partial) command above, the pattern is $1 == 1 && $2 ~ /^r[^ ]*t$/.

$1 and $2 are the first and second elements, with whitespace as the default delimiter. For the lines which match the pattern, print the 3rd element ($3). For the lines which don’t match the pattern, do nothing. These lines don’t get printed and we can count the number of matches with | wc -l.

`| paste -sd, -`

Combine lines from the standard output with the specified delimiter (, in this case). -s means serial input e.g. a single file or a bunch of text lines. -d means delimiter and , is the delimiter.

`| bc`

A calculator which reads and interprets text i.e. something which can return “3” when you pass in “1 + 2”.

`| xargs COMMAND`

Split standard output before the pipe into whitespace-delimited arguments and fed into COMMAND.

`cat TEXT | tr "a" "z"`

Replace (or “translate”) “a” with “z” in TEXT.