Data Processing Performance with Python, Go, Rust, and C

full source code is available here.

Performance is important, and yet our intuition about it is often wrong. Previously we deployed optimized Python across a cluster of machines to analyze the NYC Taxi dataset. How was its performance?

Let's try to discover a reasonable baseline for data processing performance and build intuition that can guide our decisions. We'll do this by experimenting with simple transformations of generated data using various formats, techniques, and languages on a single CPU core.

Whether we are configuring and using off the shelf software or building bespoke systems, we need the ability to intuit problems and detect low hanging fruit.

We'll say that our data is a sequence of rows, that a row is made of 8 columns, and that a column is a random dictionary word.

We'll generate our dataset as CSV with the following script.

# gen_csv.py
import sys
import random

words = [
    ...
]

num_rows = int(sys.argv[1])

for _ in range(num_rows):
    row = [random.choice(words) for _ in range(8)]
    print(','.join(row))

Our first transformation will be selecting a subset of columns.

Let's try Python.

# select.py
import sys

for line in sys.stdin:
    columns = line.split(',')
    print(f'{columns[2]},{columns[6]}')

First we need some data.

>> pypy3 gen_csv.py 1000000 > /tmp/data.csv

>> ls -lh /tmp/data.csv | awk '{print $5}'

72M

We're gonna need more data.

>> pypy3 gen_csv.py 15000000 > /tmp/data.csv

>> ls -lh /tmp/data.csv | awk '{print $5}'

1.1G

That'll do. Now let's try our selection. We'll make sure a subset of the result is sane, then we'll check the hash of the entire result using xxhsum. All other runs we'll discard the output and time execution.

>> python select.py </tmp/data.csv | head -n3

epigram,Madeleine
strategies,briefed
Doritos,putsch

>> python select.py </tmp/data.csv | xxhsum

12927f314ca6e9eb

Seems sane. Let's time it.

>> time python select.py </tmp/data.csv >/dev/null

real    0m10.076s
user    0m9.779s
sys     0m0.200s

Let's try Coreutils cut.

>> cut -d, -f3,7 </tmp/data.csv | xxhsum
12927f314ca6e9eb

>> time cut -d, -f3,7 </tmp/data.csv >/dev/null

real    0m3.534s
user    0m3.341s
sys     0m0.159s

Faster. We may need to look at compiled languages for a reasonable baseline.

Let's optimize by avoiding allocations and doing as little work as possible. We'll pull rows off a buffered reader, setup columns as offsets into that buffer, and access columns by slicing the row data.

Let's try Go.

>> go build -o select_go select.go

>> ./select_go </tmp/data.csv | xxhsum

12927f314ca6e9eb

>> time ./select_go </tmp/data.csv >/dev/null

real    0m2.832s
user    0m2.559s
sys     0m0.312s

Faster than cut. This is progress.

Let's try Rust.

>> rustc -O -o select_rust select.rs

>> ./select_rust </tmp/data.csv | xxhsum

12927f314ca6e9eb

>> time ./select_rust </tmp/data.csv >/dev/null

real    0m2.602s
user    0m2.491s
sys     0m0.110s

Pretty much the same. Let's try C. We'll grab a few header only dependencies for CSV parsing and buffered writing.

>> gcc -Iutils -O3 -flto -march=native -mtune=native -o select_c select.c

>> ./select_c </tmp/data.csv | xxhsum

12927f314ca6e9eb

>> time ./select_c </tmp/data.csv >/dev/null

real    0m2.716s
user    0m2.569s
sys     0m0.120s

So Rust, Go, and C are very similar. We may have established a baseline when working with CSV.

Let's try a similar optimization with PyPy.

>> pypy select_inlined.py </tmp/data.csv | xxhsum

12927f314ca6e9eb

>> time pypy select_inlined.py </tmp/data.csv >/dev/null

real    0m4.491s
user    0m4.293s
sys     0m0.170s

Not bad.

Let's try using Protobuf and Go. We'll call the data format PSV.

>> (cd psv && protoc -I=row --go_out=row row/row.proto)

>> (cd psv && go build -o psv psv.go)

>> (cd psv && go build -o select select.go)

>> ./psv/psv </tmp/data.csv >/tmp/data.psv

>> ./psv/select </tmp/data.psv | xxhsum

12927f314ca6e9eb  stdin

>> time ./psv/select </tmp/data.psv >/dev/null

real    0m10.424s
user    0m10.465s
sys     0m0.251s

Interesting. Slower than naive Python and CSV.

Is reading and writing data to some format a majority of the work?

Let's think about our optimized code from before. Our representation of a row is 3 pieces of data. A byte array of content, an array of column start positions, and an array of column sizes. Writing a row as CSV was easy, but reading was hard.

What if we made it easier? All we want is an array of bytes and two int arrays.

Let's let a row written as bytes be:

the max zero based index
the column sizes
the column data

| u16:max | u16:size | ... | u8[]:column | ... |

This should be easy to write, and more importantly easy to read. We can read max, which tells us how many sizes to read. From the sizes we can reconstruct the offsets and the size of the row data. We can then read the row data, and access the columns by offset and size.

Our optimized code also buffered reads and writes into large chunks.

Let's let a chunk of rows written as bytes be:

size
data

| i32:size | u8[]:row | ... |

We'll constrain a chunk to only contain complete rows and be smaller than some maximum size.

We'll call this format BSV. We'll implement buffered reading and writing of chunks, as well as loading and dumping of rows.

Let's implement our transformation using BSV in C.

>> gcc -Iutils -O3 -flto -march=native -mtune=native -o bsv/bsv bsv/bsv.c

>> ./bsv/bsv </tmp/data.csv >/tmp/data.bsv

>> gcc -Iutils -O3 -flto -march=native -mtune=native -o bsv/select bsv/select.c

>> ./bsv/select </tmp/data.bsv | xxhsum

12927f314ca6e9eb

>> time ./bsv/select </tmp/data.bsv >/dev/null

real    0m0.479s
user    0m0.339s
sys     0m0.140s

We've processed the same data, and system time has been fairly consistent, but user time has varied significantly.

Let's try a second transformation where we reverse the columns of every row.

We'll implement it with CSV in Python, PyPy and C, then with BSV in C.

>> python reverse.py </tmp/data.csv | xxhsum

e221974c95d356f9

>> time python reverse.py </tmp/data.csv >/dev/null

real    0m13.915s
user    0m13.743s
sys     0m0.170s

>> pypy3 reverse_inlined.py </tmp/data.csv | xxhsum

e221974c95d356f9

>> time pypy3 reverse_inlined.py </tmp/data.csv >/dev/null

real    0m6.141s
user    0m5.880s
sys     0m0.220s

>> gcc -Iutils -O3 -flto -march=native -mtune=native -o reverse reverse.c

>> ./reverse </tmp/data.csv | xxhsum

e221974c95d356f9

>> time ./reverse </tmp/data.csv >/dev/null

real    0m2.890s
user    0m2.719s
sys     0m0.170s

>> gcc -Iutils -O3 -flto -march=native -mtune=native -o bsv/reverse bsv/reverse.c

>> ./bsv/reverse </tmp/data.bsv | xxhsum

e221974c95d356f9

>> time ./bsv/reverse </tmp/data.bsv >/dev/null

real    0m1.052s
user    0m0.891s
sys     0m0.161s

Let's try a third transformation where we count every column where the first character of the first column is "f".

We'll implement it with CSV in Python, PyPy and C, then with BSV in C.

>> time python count.py </tmp/data.csv

467002

real    0m6.385s
user    0m6.223s
sys     0m0.160s

>> time pypy3 count_inlined.py </tmp/data.csv

467002

real    0m3.147s
user    0m2.938s
sys     0m0.180s

>> gcc -Iutils -O3 -flto -march=native -mtune=native -o count count.c

>> time ./count </tmp/data.csv

467002

real    0m2.367s
user    0m2.245s
sys     0m0.121s

>> gcc -Iutils -O3 -flto -march=native -mtune=native -o bsv/count bsv/count.c

>> time bsv/count </tmp/data.bsv

467002

real    0m0.260s
user    0m0.135s
sys     0m0.125s

In transformations 2 and 3 we again see significant variance in user time.

Let's put our user time results in a table.

First we have our select transformation, which outputs 25% of its input.

format	language	user seconds	gigabytes / second
PSV	Go	10.4	0.1
CSV	Python	9.7	0.1
CSV	PyPy	4.3	0.2
CSV	Go	2.6	0.4
CSV	C	2.6	0.4
CSV	Rust	2.5	0.4
BSV	C	0.3	3.3

Second we have our reverse transformation, which outputs 100% of its input.

format	language	user seconds	gigabytes / second
CSV	Python	13.7	0.1
CSV	PyPy	5.8	0.2
CSV	C	2.7	0.4
BSV	C	0.9	1.1

Third we have our count transformation, which outputs <0.001% of its input.

format	language	user seconds	gigabytes / second
CSV	Python	6.2	0.2
CSV	PyPy	2.9	0.3
CSV	C	2.2	0.5
BSV	C	0.1	10

Interesting. Let's take a closer look at the CSV and BSV results for C based on the ratio of inputs to outputs.

inputs / outputs	format	language	user seconds	gigabytes / second
1 / 1	CSV	C	2.7	0.4
4 / 1	CSV	C	2.6	0.4
1000 / 1	CSV	C	2.2	0.5

inputs / outputs	format	language	user seconds	gigabytes / second
1 / 1	BSV	C	0.9	1.1
4 / 1	BSV	C	0.3	3.3
1000 / 1	BSV	C	0.1	10

Now this is interesting. When dealing with CSV, the ratio of inputs to outputs has almost no impact on performance. When dealing with BSV the impact is x3 at each step. This suggests that for CSV, parsing the input dominates, while for BSV, writing the output dominates. This asks an interesting question, how can we optimize output? For simplicity, the BSV code is outputting CSV. It may be worth experimenting with other output formats, but we'll skip that for now.

Do we have enough information to establish a baseline? Perhaps.

We've seen Python process CSV and Go process Protobuf at 100 megabytes / second.

We've seen C, Go, and Rust process CSV at 400 megabytes / second.

We've seen C process BSV at 1-10 gigabytes / second.

Why don't we start with the following baseline. We'll think of it as napkin math.

category	rate
slow	<=100 megabytes / second / CPU core
decent	~500 megabytes / second / CPU core
fast	>=1000 megabytes / second / CPU core

As we do data processing, either by configuring and using off the shelf software, or by building bespoke systems, we can keep these rates in mind.

If you are interested in BSV, you can find it here.

For further experimentation with Go, Rust, and C, look here.

For examples of applying BSV to distributed compute, look here.