<?xml version='1.0' encoding='UTF-8'?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" version="2.0">
  <channel>
    <title>nathants.com</title>
    <link>https://nathants.com</link>
    <description>posts and projects from nathants.com</description>
    <docs>http://www.rssboard.org/rss-specification</docs>
    <generator>python-feedgen</generator>
    <language>en</language>
    <lastBuildDate>Fri, 13 Oct 2023 00:23:16 +0000</lastBuildDate>
    <item>
      <title>scaling python data processing vertically</title>
      <link>https://nathants.com/posts/scaling-python-data-processing-vertically</link>
      <description>
                
                        
&lt;p&gt;full source code is available &lt;a href="https://github.com/nathants/posts/tree/001/001_scaling_python_data_processing_vertically"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;processing inconveniently large data is a common task these days, and there are many tools and techniques available to help. here we are going to explore how far we can take python on a single machine.&lt;/p&gt;
&lt;p&gt;we'll be working with the &lt;a href="https://registry.opendata.aws/nyc-tlc-trip-records-pds/" rel="nofollow"&gt;nyc taxi&lt;/a&gt; dataset in the aws region where it lives, us-east-1. bandwidth between ec2 and s3 is only free within the same region, so make sure you are in us-east-1 if you are following along.&lt;/p&gt;
&lt;p&gt;we'll be using some &lt;a href="https://gist.github.com/nathants/741b066af9faa15f3ed50ed6cf677d67"&gt;bash functions&lt;/a&gt;, &lt;a href="https://github.com/nathants/cli-aws"&gt;aws tooling&lt;/a&gt;, and the &lt;a href="https://aws.amazon.com/cli/" rel="nofollow"&gt;official aws cli&lt;/a&gt;. one could also use other tools without much trouble.&lt;/p&gt;
&lt;p&gt;how is the dataset organized?&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws s3 ls &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;s3://nyc-tlc/trip data/&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; head

2016-08-11 07:16:22          0
2016-08-11 07:32:21   85733063 fhv_tripdata_2015-01.csv
2016-08-11 07:33:04   97863482 fhv_tripdata_2015-02.csv
2016-08-11 07:33:40  102220197 fhv_tripdata_2015-03.csv
2016-08-11 07:34:24  121250461 fhv_tripdata_2015-04.csv
2016-08-11 07:35:14  133469666 fhv_tripdata_2015-05.csv
2016-08-11 07:35:48  132209226 fhv_tripdata_2015-06.csv
2016-08-11 07:36:09  137153004 fhv_tripdata_2015-07.csv
2016-08-11 07:36:45  164291700 fhv_tripdata_2015-08.csv
2016-08-11 07:37:37  205607912 fhv_tripdata_2015-09.csv&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;looks like a bunch of csv in a folder. are the prefixes constant?&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws s3 ls &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;s3://nyc-tlc/trip data/&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; awk &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;{print $NF}&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; cut -d_ -f1 \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; sort \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; uniq -c

      1 0
     64 fhv
     17 fhvhv
     83 green
    138 yellow&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;nope. we probably want the yellow data. let's check on the sizes first.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws s3 ls &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;s3://nyc-tlc/trip data/&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; grep yellow \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; awk &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;{print $3}&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; py &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;"{:,}".format(sum(int(x) for x in i.splitlines()))&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

251,267,607,652&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;looks like about 250GB. what about the others?&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws s3 ls &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;s3://nyc-tlc/trip data/&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; awk &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;{print $NF}&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; cut -d_ -f1 \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; sort -u \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; tail -n+2 \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; &lt;span class="pl-k"&gt;while&lt;/span&gt; &lt;span class="pl-c1"&gt;read&lt;/span&gt; prefix&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;do&lt;/span&gt;
          &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-smi"&gt;$prefix&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;aws s3 ls &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;s3://nyc-tlc/trip data/&lt;span class="pl-smi"&gt;${prefix}&lt;/span&gt;_&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \&lt;/span&gt;
&lt;span class="pl-s"&gt;                          &lt;span class="pl-k"&gt;|&lt;/span&gt; awk &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;{print $3}&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \&lt;/span&gt;
&lt;span class="pl-s"&gt;                          &lt;span class="pl-k"&gt;|&lt;/span&gt; py &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;"{:,}".format(sum(int(x) for x in i.splitlines())).rjust(20, ".")&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;
      &lt;span class="pl-k"&gt;done&lt;/span&gt; \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; column -t

fhv     ......37,567,264,171
fhvhv   ......19,542,027,956
green   ......10,381,632,797
yellow  .....251,267,607,652&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;definitely the yellow dataset then. let's setup some convenience variables.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; prefix=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;s3://nyc-tlc/trip data&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; keys=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;aws s3 ls &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-smi"&gt;$prefix&lt;/span&gt;/&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \&lt;/span&gt;
&lt;span class="pl-s"&gt;    &lt;span class="pl-k"&gt;|&lt;/span&gt; grep yellow \&lt;/span&gt;
&lt;span class="pl-s"&gt;    &lt;span class="pl-k"&gt;|&lt;/span&gt; awk &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;{print $NF}&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;let's take a peek at the headers of the first file for each year, selecting the first 10 columns.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; (for key &lt;span class="pl-k"&gt;in&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;echo &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-smi"&gt;$keys&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; awk &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;NR % 12 == 1&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;do&lt;/span&gt;
       aws s3 cp &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-smi"&gt;$prefix&lt;/span&gt;/&lt;span class="pl-smi"&gt;$key&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; - &lt;span class="pl-k"&gt;2&amp;gt;&lt;/span&gt;/dev/null \
        &lt;span class="pl-k"&gt;|&lt;/span&gt; head -n1 \
        &lt;span class="pl-k"&gt;|&lt;/span&gt; cut -d, -f1-8 &lt;span class="pl-k"&gt;&amp;amp;&lt;/span&gt;
    &lt;span class="pl-k"&gt;done&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; wait) &lt;span class="pl-k"&gt;|&lt;/span&gt; column -s, -t

VendorID     tpep_pickup_datetime  tpep_dropoff_datetime  passenger_count   trip_distance   pickup_longitude   pickup_latitude     RateCodeID
VendorID     tpep_pickup_datetime  tpep_dropoff_datetime  passenger_count   trip_distance   RatecodeID         store_and_fwd_flag  PULocationID
vendor_id    pickup_datetime       dropoff_datetime       passenger_count   trip_distance   pickup_longitude   pickup_latitude     rate_code
vendor_id    pickup_datetime       dropoff_datetime       passenger_count   trip_distance   pickup_longitude   pickup_latitude     rate_code
VendorID     tpep_pickup_datetime  tpep_dropoff_datetime  passenger_count   trip_distance   RatecodeID         store_and_fwd_flag  PULocationID
vendor_id    pickup_datetime       dropoff_datetime       passenger_count   trip_distance   pickup_longitude   pickup_latitude     rate_code
VendorID     tpep_pickup_datetime  tpep_dropoff_datetime  passenger_count   trip_distance   pickup_longitude   pickup_latitude     RatecodeID
vendor_id    pickup_datetime       dropoff_datetime       passenger_count   trip_distance   pickup_longitude   pickup_latitude     rate_code
VendorID     tpep_pickup_datetime  tpep_dropoff_datetime  passenger_count   trip_distance   RatecodeID         store_and_fwd_flag  PULocationID
vendor_name  Trip_Pickup_DateTime  Trip_Dropoff_DateTime  Passenger_Count   Trip_Distance   Start_Lon          Start_Lat           Rate_Code
VendorID     tpep_pickup_datetime  tpep_dropoff_datetime  passenger_count   trip_distance   RatecodeID         store_and_fwd_flag  PULocationID
vendor_id    pickup_datetime       dropoff_datetime       passenger_count   trip_distance   pickup_longitude   pickup_latitude     rate_code&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;looks like the first 5 columns are consistent, and then it gets messy. we can punt on data cleanup by just working with those first 5, which contain interesting data like distance, passengers, and date.&lt;/p&gt;
&lt;p&gt;before we jump on ec2, let's grab the first million rows of the first file to our local environment and prototype our data scripts.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws s3 cp &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-smi"&gt;$prefix&lt;/span&gt;/&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;echo &lt;span class="pl-smi"&gt;$keys&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; awk &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;{print $1}&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; - &lt;span class="pl-k"&gt;2&amp;gt;&lt;/span&gt;/dev/null \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; head -n1000000 \
    &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; /tmp/taxi.csv

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ls -lh /tmp/taxi.csv &lt;span class="pl-k"&gt;|&lt;/span&gt; awk &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;{print $5}&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

172M

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; head /tmp/taxi.csv \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; cut -d, -f1-5 \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; column -s, -t

vendor_name  Trip_Pickup_DateTime  Trip_Dropoff_DateTime  Passenger_Count  Trip_Distance
VTS          2009-01-04 02:52:00   2009-01-04 03:02:00    1                2.6299999999999999
VTS          2009-01-04 03:31:00   2009-01-04 03:38:00    3                4.5499999999999998
VTS          2009-01-03 15:43:00   2009-01-03 15:57:00    5                10.35
DDS          2009-01-01 20:52:58   2009-01-01 21:14:00    1                5
DDS          2009-01-24 16:18:23   2009-01-24 16:24:56    1                0.40000000000000002
DDS          2009-01-16 22:35:59   2009-01-16 22:43:35    2                1.2
DDS          2009-01-21 08:55:57   2009-01-21 09:05:42    1                0.40000000000000002
VTS          2009-01-04 04:31:00   2009-01-04 04:36:00    1                1.72&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;now that we have data, it's time to ask questions. let's group by passengers and count.&lt;/p&gt;
&lt;p&gt;first let's try python's &lt;a href="https://docs.python.org/3/library/csv.html" rel="nofollow"&gt;csv&lt;/a&gt; module.&lt;/p&gt;
&lt;div class="highlight highlight-source-python"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;# passenger_counts_stdlib.py&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;csv&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;sys&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;collections&lt;/span&gt;

&lt;span class="pl-s1"&gt;sys&lt;/span&gt;.&lt;span class="pl-s1"&gt;stdin&lt;/span&gt;.&lt;span class="pl-en"&gt;readline&lt;/span&gt;() &lt;span class="pl-c"&gt;# skip the header&lt;/span&gt;

&lt;span class="pl-s1"&gt;result&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;collections&lt;/span&gt;.&lt;span class="pl-en"&gt;defaultdict&lt;/span&gt;(&lt;span class="pl-s1"&gt;int&lt;/span&gt;)

&lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;cols&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;csv&lt;/span&gt;.&lt;span class="pl-en"&gt;reader&lt;/span&gt;(&lt;span class="pl-s1"&gt;sys&lt;/span&gt;.&lt;span class="pl-s1"&gt;stdin&lt;/span&gt;):
    &lt;span class="pl-k"&gt;try&lt;/span&gt;:
        &lt;span class="pl-s1"&gt;passengers&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;cols&lt;/span&gt;[&lt;span class="pl-c1"&gt;3&lt;/span&gt;]
    &lt;span class="pl-k"&gt;except&lt;/span&gt; &lt;span class="pl-v"&gt;IndexError&lt;/span&gt;:
        &lt;span class="pl-k"&gt;continue&lt;/span&gt;
    &lt;span class="pl-k"&gt;else&lt;/span&gt;:
        &lt;span class="pl-s1"&gt;result&lt;/span&gt;[&lt;span class="pl-s1"&gt;passengers&lt;/span&gt;] &lt;span class="pl-c1"&gt;+=&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;

&lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;passengers&lt;/span&gt;, &lt;span class="pl-s1"&gt;count&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;result&lt;/span&gt;.&lt;span class="pl-en"&gt;items&lt;/span&gt;():
    &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s"&gt;f'&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;passengers&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;,&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;count&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;'&lt;/span&gt;)&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; cat /tmp/taxi.csv \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; python3 passenger_counts_stdlib.py \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; sort -nr -k2 -t, \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; column -s, -t

1  669627
2  166658
5  93718
3  44360
4  20904
6  4685
0  46

real    0m2.316s
user    0m2.259s
sys     0m0.162s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;let's see how &lt;a href="https://pandas.pydata.org/" rel="nofollow"&gt;pandas&lt;/a&gt; compares.&lt;/p&gt;
&lt;div class="highlight highlight-source-python"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;# passenger_counts_pandas.py&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;pandas&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;sys&lt;/span&gt;

&lt;span class="pl-s1"&gt;df&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;pandas&lt;/span&gt;.&lt;span class="pl-en"&gt;read_csv&lt;/span&gt;(&lt;span class="pl-s1"&gt;sys&lt;/span&gt;.&lt;span class="pl-s1"&gt;stdin&lt;/span&gt;)

&lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;df&lt;/span&gt;.&lt;span class="pl-s1"&gt;iloc&lt;/span&gt;[:,&lt;span class="pl-c1"&gt;3&lt;/span&gt;].&lt;span class="pl-en"&gt;value_counts&lt;/span&gt;())&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; cat /tmp/taxi.csv &lt;span class="pl-k"&gt;|&lt;/span&gt; python3 passenger_counts_pandas.py

1    669627
2    166658
5     93718
3     44360
4     20904
6      4685
0        46
Name: Passenger_Count   dtype: int64

real    0m2.164s
user    0m2.085s
sys     0m0.499s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;about the same.&lt;/p&gt;
&lt;p&gt;if we know that our input is well formed, without quotes or escaped delimiters, we can just split on comma. let's try that.&lt;/p&gt;
&lt;div class="highlight highlight-source-python"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;# passenger_counts.py&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;sys&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;collections&lt;/span&gt;

&lt;span class="pl-s1"&gt;sys&lt;/span&gt;.&lt;span class="pl-s1"&gt;stdin&lt;/span&gt;.&lt;span class="pl-en"&gt;readline&lt;/span&gt;() &lt;span class="pl-c"&gt;# skip the header&lt;/span&gt;

&lt;span class="pl-s1"&gt;result&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;collections&lt;/span&gt;.&lt;span class="pl-en"&gt;defaultdict&lt;/span&gt;(&lt;span class="pl-s1"&gt;int&lt;/span&gt;)

&lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;line&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;sys&lt;/span&gt;.&lt;span class="pl-s1"&gt;stdin&lt;/span&gt;:
    &lt;span class="pl-s1"&gt;cols&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;line&lt;/span&gt;.&lt;span class="pl-en"&gt;split&lt;/span&gt;(&lt;span class="pl-s"&gt;','&lt;/span&gt;)
    &lt;span class="pl-k"&gt;try&lt;/span&gt;:
        &lt;span class="pl-s1"&gt;passengers&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;cols&lt;/span&gt;[&lt;span class="pl-c1"&gt;3&lt;/span&gt;]
    &lt;span class="pl-k"&gt;except&lt;/span&gt; &lt;span class="pl-v"&gt;IndexError&lt;/span&gt;:
        &lt;span class="pl-k"&gt;continue&lt;/span&gt;
    &lt;span class="pl-k"&gt;else&lt;/span&gt;:
        &lt;span class="pl-s1"&gt;result&lt;/span&gt;[&lt;span class="pl-s1"&gt;passengers&lt;/span&gt;] &lt;span class="pl-c1"&gt;+=&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;

&lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;passengers&lt;/span&gt;, &lt;span class="pl-s1"&gt;count&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;result&lt;/span&gt;.&lt;span class="pl-en"&gt;items&lt;/span&gt;():
    &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s"&gt;f'&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;passengers&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;,&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;count&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;'&lt;/span&gt;)&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; cat /tmp/taxi.csv \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; python3 passenger_counts.py \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; sort -nr -k2 -t, \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; column -s, -t

1  669627
2  166658
5  93718
3  44360
4  20904
6  4685
0  46

real    0m0.668s
user    0m0.633s
sys     0m0.099s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;that is a lot faster, about x4. if we can safely assume that the data is well formed, simple split looks like a good idea. after peeking at this dataset for the fields we care about, this is likely ok.&lt;/p&gt;
&lt;p&gt;let's run it again with x25 more data by repeating the input over and over. using tail we can skip the header in all but the first input.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; (cat /tmp/taxi.csv&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-smi"&gt;i&lt;/span&gt; &lt;span class="pl-k"&gt;in&lt;/span&gt; {1..24}&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;do&lt;/span&gt; tail -n+2 /tmp/taxi.csv&lt;span class="pl-k"&gt;;&lt;/span&gt; done) \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; python3 passenger_counts.py &lt;span class="pl-k"&gt;&amp;amp;&lt;/span&gt;&lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt;/dev/null

real    0m16.295s
user    0m16.101s
sys     0m2.771s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;what if we try &lt;a href="https://pypy.org" rel="nofollow"&gt;pypy&lt;/a&gt;?&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; (cat /tmp/taxi.csv&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-smi"&gt;i&lt;/span&gt; &lt;span class="pl-k"&gt;in&lt;/span&gt; {1..24}&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;do&lt;/span&gt; tail -n+2 /tmp/taxi.csv&lt;span class="pl-k"&gt;;&lt;/span&gt; done) \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; pypy3 passenger_counts.py &lt;span class="pl-k"&gt;&amp;amp;&lt;/span&gt;&lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt;/dev/null

real    0m17.260s
user    0m16.386s
sys     0m4.011s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;well that's not ideal. let's see if we can apply performance lessons from compiled languages, which can be summarized as avoid allocations and do as little work as possible. the following file has some &lt;a href="https://github.com/nathants/py-csv"&gt;boiler plate&lt;/a&gt; elided, refer to the full &lt;a href="https://github.com/nathants/posts/tree/001/001_scaling_python_data_processing_vertically/passenger_counts_inlined.py"&gt;source&lt;/a&gt; for the details.&lt;/p&gt;
&lt;div class="highlight highlight-source-python"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;# passenger_counts_inlined.py&lt;/span&gt;
...

&lt;span class="pl-s1"&gt;result&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;collections&lt;/span&gt;.&lt;span class="pl-en"&gt;defaultdict&lt;/span&gt;(&lt;span class="pl-s1"&gt;int&lt;/span&gt;)

... &lt;span class="pl-c"&gt;# FOR ROW IN STDIN&lt;/span&gt;
    ...

    &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-s1"&gt;max&lt;/span&gt; &lt;span class="pl-c1"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;3&lt;/span&gt;:
        &lt;span class="pl-s1"&gt;passengers&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;read_buffer&lt;/span&gt;[&lt;span class="pl-s1"&gt;starts&lt;/span&gt;[&lt;span class="pl-c1"&gt;3&lt;/span&gt;]:&lt;span class="pl-s1"&gt;ends&lt;/span&gt;[&lt;span class="pl-c1"&gt;3&lt;/span&gt;]]
        &lt;span class="pl-s1"&gt;result&lt;/span&gt;[&lt;span class="pl-s1"&gt;passengers&lt;/span&gt;] &lt;span class="pl-c1"&gt;+=&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;

...

&lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;passengers&lt;/span&gt;, &lt;span class="pl-s1"&gt;count&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;result&lt;/span&gt;.&lt;span class="pl-en"&gt;items&lt;/span&gt;():
    &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s"&gt;f'&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;passengers&lt;/span&gt;.&lt;span class="pl-en"&gt;decode&lt;/span&gt;()&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;,&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;count&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;'&lt;/span&gt;)&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; (cat /tmp/taxi.csv&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-smi"&gt;i&lt;/span&gt; &lt;span class="pl-k"&gt;in&lt;/span&gt; {1..24}&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;do&lt;/span&gt; tail -n+2 /tmp/taxi.csv&lt;span class="pl-k"&gt;;&lt;/span&gt; done) \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; pypy3 passenger_counts_inlined.py &lt;span class="pl-k"&gt;&amp;amp;&lt;/span&gt;&lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt;/dev/null

real    0m10.245s
user    0m8.876s
sys     0m3.108s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;a x2 improvement on user time, and nearly as much on wall clock. we'll take it. if interested, see further optimizations in &lt;a href="https://github.com/nathants/bsv/tree/master/experiments/cut"&gt;go, rust, and c&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;a final optimization we can make is to work with less data. since we know we only care about the first 5 columns, we can drop unused data upstream.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; cat /tmp/taxi.csv &lt;span class="pl-k"&gt;|&lt;/span&gt; cut -d, -f1-5 &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; /tmp/taxi.csv.slim

real    0m0.409s
user    0m0.359s
sys     0m0.140s&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; (cat /tmp/taxi.csv.slim&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-smi"&gt;i&lt;/span&gt; &lt;span class="pl-k"&gt;in&lt;/span&gt; {1..24}&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;do&lt;/span&gt; tail -n+2 /tmp/taxi.csv.slim&lt;span class="pl-k"&gt;;&lt;/span&gt; done) \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; pypy3 passenger_counts_inlined.py &lt;span class="pl-k"&gt;&amp;amp;&lt;/span&gt;&lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt;/dev/null

real    0m3.764s
user    0m3.196s
sys     0m1.155s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;another x2 improvement, we'll take it.&lt;/p&gt;
&lt;p&gt;our first significant improvement we got by avoiding allocations, and here we get another one by dropping unused data upstream.&lt;/p&gt;
&lt;p&gt;let's take another look at our improvements.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; (cat /tmp/taxi.csv&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-smi"&gt;i&lt;/span&gt; &lt;span class="pl-k"&gt;in&lt;/span&gt; {1..24}&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;do&lt;/span&gt; tail -n+2 /tmp/taxi.csv&lt;span class="pl-k"&gt;;&lt;/span&gt; done) \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; python3 passenger_counts_stdlib.py &lt;span class="pl-k"&gt;&amp;amp;&lt;/span&gt;&lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt;/dev/null

real    0m57.986s
user    0m57.854s
sys     0m3.610s

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; (cat /tmp/taxi.csv.slim&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-smi"&gt;i&lt;/span&gt; &lt;span class="pl-k"&gt;in&lt;/span&gt; {1..24}&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;do&lt;/span&gt; tail -n+2 /tmp/taxi.csv.slim&lt;span class="pl-k"&gt;;&lt;/span&gt; done) \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; pypy3 passenger_counts_inlined.py &lt;span class="pl-k"&gt;&amp;amp;&lt;/span&gt;&lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt;/dev/null

real    0m3.726s
user    0m3.401s
sys     0m0.907s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;by doing less work, manually inlining code, avoiding allocations, and reducing the data set, we can get sizeable performance improvements.&lt;/p&gt;
&lt;p&gt;just for fun, let's take a look at going even faster. we'll explore this in a later &lt;a href="https://nathants.com/posts" rel="nofollow"&gt;post&lt;/a&gt;&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; cat /tmp/taxi.csv \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; tail -n+2 \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; bsv \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; bschema &lt;span class="pl-k"&gt;*&lt;/span&gt;,&lt;span class="pl-k"&gt;*&lt;/span&gt;,&lt;span class="pl-k"&gt;*&lt;/span&gt;,&lt;span class="pl-k"&gt;*&lt;/span&gt;,&lt;span class="pl-k"&gt;*&lt;/span&gt;,... --filter \
    &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; /tmp/taxi.bsv.slim

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; (for i &lt;span class="pl-k"&gt;in&lt;/span&gt; {1..25}&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;do&lt;/span&gt; cat /tmp/taxi.bsv.slim&lt;span class="pl-k"&gt;;&lt;/span&gt; done) \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; bcut 4 \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; bcounteach-hash &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt;/dev/null

real    0m0.742s
user    0m0.801s
sys     0m0.950s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;system time as the bottleneck is a really good problem to have.&lt;/p&gt;
&lt;p&gt;back to python, it's time to deploy and scale vertically. first we're going to need an ec2 instance. let's use a &lt;a href="https://aws.amazon.com/ec2/instance-types/i3en/" rel="nofollow"&gt;i3en.24xlarge&lt;/a&gt; with &lt;a href="https://wiki.archlinux.org/" rel="nofollow"&gt;archlinux&lt;/a&gt;.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;export&lt;/span&gt; region=us-east-1

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-max-spot-price i3en.24xlarge

on demand: 10.848, spot offers 70% savings
us-east-1a 3.254400
us-east-1b 3.254400
us-east-1c 3.254400
us-east-1d 3.254400
us-east-1f 3.254400&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;looks like cost will be $3/hour.&lt;/p&gt;
&lt;p&gt;our machine is going to need s3 access to get the dataset, so let's make an instance profile.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-iam-ensure-instance-profile \
    --policy AmazonS3ReadOnlyAccess \
    s3-readonly&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;we are also going to need a vpc, keypair, and security group access for port 22. if you already have aws setup you're probably fine, otherwise do something like this.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-vpc-new adhoc-vpc

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-authorize-ip &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;curl checkip.amazonaws.com&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt; adhoc-vpc --yes

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-keypair-new &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;whoami&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;~&lt;/span&gt;/.ssh/id_rsa.pub&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;before we start, let's note the time.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; start=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;date +%s&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;now it's time to spin up our machine.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; id=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;aws-ec2-new --type i3en.24xlarge \&lt;/span&gt;
&lt;span class="pl-s"&gt;                         --ami arch \&lt;/span&gt;
&lt;span class="pl-s"&gt;                         --profile s3-readonly \&lt;/span&gt;
&lt;span class="pl-s"&gt;                         test-machine&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;

real    1m10.673s
user    0m2.510s
sys     0m0.434s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;it takes a moment to format the instance store ssd, so we wait.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-ssh &lt;span class="pl-smi"&gt;$id&lt;/span&gt; --yes --cmd &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;       while true; do&lt;/span&gt;
&lt;span class="pl-s"&gt;           sleep 1&lt;/span&gt;
&lt;span class="pl-s"&gt;           df -h | grep /mnt &amp;amp;&amp;amp; break&lt;/span&gt;
&lt;span class="pl-s"&gt;       done&lt;/span&gt;
&lt;span class="pl-s"&gt;   &lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;now we need to install some things.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-ssh &lt;span class="pl-smi"&gt;$id&lt;/span&gt; --yes --cmd &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;       sudo pacman -Sy --noconfirm python-pip pypy3 git&lt;/span&gt;
&lt;span class="pl-s"&gt;       sudo pip install awscli git+https://github.com/nathants/py-{util,shell,pool}&lt;/span&gt;
&lt;span class="pl-s"&gt;   &lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;then we bump linux limits, reboot, and wait for the machine to come back up.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-ssh &lt;span class="pl-smi"&gt;$id&lt;/span&gt; --yes --cmd &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;       curl -s https://raw.githubusercontent.com/nathants/bootstraps/master/scripts/limits.sh | bash&lt;/span&gt;
&lt;span class="pl-s"&gt;       sudo reboot&lt;/span&gt;
&lt;span class="pl-s"&gt;   &lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-wait-for-ssh &lt;span class="pl-smi"&gt;$id&lt;/span&gt; --yes&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;baking an &lt;a href="https://github.com/nathants/bootstraps/tree/master/amis"&gt;ami&lt;/a&gt; instead of starting from vanilla linux can save some bootstrap time.&lt;/p&gt;
&lt;p&gt;let's deploy our code.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-scp passenger_counts_inlined.py :/mnt &lt;span class="pl-smi"&gt;$id&lt;/span&gt; --yes&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;our data pipeline is going to look like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;fetch the dataset&lt;/li&gt;
&lt;li&gt;select the columns we need&lt;/li&gt;
&lt;li&gt;group by and count&lt;/li&gt;
&lt;li&gt;merge results&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;step 1 will fetch and select passengers. this pipeline will run once per input key, and will run in parallel on all cpus.&lt;/p&gt;
&lt;div class="highlight highlight-source-python"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;# download_and_select.py&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;os&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;shell&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;pool&lt;/span&gt;.&lt;span class="pl-s1"&gt;thread&lt;/span&gt;

&lt;span class="pl-s1"&gt;shell&lt;/span&gt;.&lt;span class="pl-en"&gt;run&lt;/span&gt;(&lt;span class="pl-s"&gt;'mkdir -p /mnt/data'&lt;/span&gt;)

&lt;span class="pl-s1"&gt;prefix&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"s3://nyc-tlc/trip data"&lt;/span&gt;

&lt;span class="pl-s1"&gt;keys&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; [&lt;span class="pl-s1"&gt;x&lt;/span&gt;.&lt;span class="pl-en"&gt;split&lt;/span&gt;()[&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;1&lt;/span&gt;] &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;x&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;shell&lt;/span&gt;.&lt;span class="pl-en"&gt;run&lt;/span&gt;(&lt;span class="pl-s"&gt;f'aws s3 ls "&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;prefix&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;/"'&lt;/span&gt;).&lt;span class="pl-en"&gt;splitlines&lt;/span&gt;() &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-s"&gt;'yellow'&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;x&lt;/span&gt;]

&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;download&lt;/span&gt;(&lt;span class="pl-s1"&gt;key&lt;/span&gt;):
    &lt;span class="pl-s1"&gt;shell&lt;/span&gt;.&lt;span class="pl-en"&gt;run&lt;/span&gt;(&lt;span class="pl-s"&gt;f'aws s3 cp "&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;prefix&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;/&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;key&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;" - | cut -d, -f1-5 &amp;gt; /mnt/data/&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;key&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;'&lt;/span&gt;, &lt;span class="pl-s1"&gt;echo&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;)

&lt;span class="pl-s1"&gt;pool&lt;/span&gt;.&lt;span class="pl-s1"&gt;thread&lt;/span&gt;.&lt;span class="pl-s1"&gt;size&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;os&lt;/span&gt;.&lt;span class="pl-en"&gt;cpu_count&lt;/span&gt;()

&lt;span class="pl-en"&gt;list&lt;/span&gt;(&lt;span class="pl-s1"&gt;pool&lt;/span&gt;.&lt;span class="pl-s1"&gt;thread&lt;/span&gt;.&lt;span class="pl-en"&gt;map&lt;/span&gt;(&lt;span class="pl-s1"&gt;download&lt;/span&gt;, &lt;span class="pl-s1"&gt;keys&lt;/span&gt;))&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-scp download_and_select.py :/mnt &lt;span class="pl-smi"&gt;$id&lt;/span&gt; --yes

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; aws-ec2-ssh &lt;span class="pl-smi"&gt;$id&lt;/span&gt; --yes --cmd &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;python /mnt/download_and_select.py&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

real    1m43.209s
user    0m0.371s
sys     0m0.214s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;step 2 will group by passengers and count. this pipeline will run once per input file, and will run in parallel on all cpus.&lt;/p&gt;
&lt;p&gt;we'll use shell redirection instead of cat for the input since it's more efficient.&lt;/p&gt;
&lt;div class="highlight highlight-source-python"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;# group_and_count.py&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;os&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;shell&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;pool&lt;/span&gt;.&lt;span class="pl-s1"&gt;thread&lt;/span&gt;

&lt;span class="pl-s1"&gt;shell&lt;/span&gt;.&lt;span class="pl-en"&gt;run&lt;/span&gt;(&lt;span class="pl-s"&gt;'mkdir -p /mnt/results'&lt;/span&gt;)

&lt;span class="pl-s1"&gt;paths&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;shell&lt;/span&gt;.&lt;span class="pl-en"&gt;files&lt;/span&gt;(&lt;span class="pl-s"&gt;'/mnt/data'&lt;/span&gt;, &lt;span class="pl-s1"&gt;abspath&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;)

&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;process&lt;/span&gt;(&lt;span class="pl-s1"&gt;path&lt;/span&gt;):
    &lt;span class="pl-s1"&gt;shell&lt;/span&gt;.&lt;span class="pl-en"&gt;run&lt;/span&gt;(&lt;span class="pl-s"&gt;f'&amp;lt; &lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;path&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt; pypy3 /mnt/passenger_counts_inlined.py &amp;gt; /mnt/results/&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;os&lt;/span&gt;.&lt;span class="pl-s1"&gt;path&lt;/span&gt;.&lt;span class="pl-en"&gt;basename&lt;/span&gt;(&lt;span class="pl-s1"&gt;path&lt;/span&gt;)&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;'&lt;/span&gt;, &lt;span class="pl-s1"&gt;echo&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;)

&lt;span class="pl-s1"&gt;pool&lt;/span&gt;.&lt;span class="pl-s1"&gt;thread&lt;/span&gt;.&lt;span class="pl-s1"&gt;size&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;os&lt;/span&gt;.&lt;span class="pl-en"&gt;cpu_count&lt;/span&gt;()

&lt;span class="pl-en"&gt;list&lt;/span&gt;(&lt;span class="pl-s1"&gt;pool&lt;/span&gt;.&lt;span class="pl-s1"&gt;thread&lt;/span&gt;.&lt;span class="pl-en"&gt;map&lt;/span&gt;(&lt;span class="pl-s1"&gt;process&lt;/span&gt;, &lt;span class="pl-s1"&gt;paths&lt;/span&gt;))&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-scp group_and_count.py :/mnt &lt;span class="pl-smi"&gt;$id&lt;/span&gt; --yes

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; aws-ec2-ssh &lt;span class="pl-smi"&gt;$id&lt;/span&gt; --yes --cmd &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;python /mnt/group_and_count.py&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

real    0m11.062s
user    0m0.262s
sys     0m0.018s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;step 3 will merge the results from step 2. we haven't actually written this code yet, so let's do that now. this pipeline runs on a single core and takes all results as input.&lt;/p&gt;
&lt;div class="highlight highlight-source-python"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;# merge_results.py&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;sys&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;collections&lt;/span&gt;

&lt;span class="pl-s1"&gt;result&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;collections&lt;/span&gt;.&lt;span class="pl-en"&gt;defaultdict&lt;/span&gt;(&lt;span class="pl-s1"&gt;int&lt;/span&gt;)

&lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;line&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;sys&lt;/span&gt;.&lt;span class="pl-s1"&gt;stdin&lt;/span&gt;:
    &lt;span class="pl-s1"&gt;passengers&lt;/span&gt;, &lt;span class="pl-s1"&gt;count&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;line&lt;/span&gt;.&lt;span class="pl-en"&gt;split&lt;/span&gt;(&lt;span class="pl-s"&gt;','&lt;/span&gt;)
    &lt;span class="pl-s1"&gt;result&lt;/span&gt;[&lt;span class="pl-s1"&gt;passengers&lt;/span&gt;] &lt;span class="pl-c1"&gt;+=&lt;/span&gt; &lt;span class="pl-en"&gt;int&lt;/span&gt;(&lt;span class="pl-s1"&gt;count&lt;/span&gt;)

&lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;passengers&lt;/span&gt;, &lt;span class="pl-s1"&gt;count&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;result&lt;/span&gt;.&lt;span class="pl-en"&gt;items&lt;/span&gt;():
    &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s"&gt;f'&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;passengers&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;,&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;count&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;'&lt;/span&gt;)&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-scp merge_results.py :/mnt &lt;span class="pl-smi"&gt;$id&lt;/span&gt; --yes

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; aws-ec2-ssh &lt;span class="pl-smi"&gt;$id&lt;/span&gt; --yes --cmd &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;       cat /mnt/results/* \&lt;/span&gt;
&lt;span class="pl-s"&gt;         | python /mnt/merge_results.py \&lt;/span&gt;
&lt;span class="pl-s"&gt;         | tr , " " \&lt;/span&gt;
&lt;span class="pl-s"&gt;         | sort -nrk 2 \&lt;/span&gt;
&lt;span class="pl-s"&gt;         | head -n9 \&lt;/span&gt;
&lt;span class="pl-s"&gt;         | column -t&lt;/span&gt;
&lt;span class="pl-s"&gt;   &lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

real    0m1.580s
user    0m0.189s
sys     0m0.038s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;a final optimization we can apply here is to combine steps 1 and 2, which will avoid iowait as a bottleneck since we never touch local disk.&lt;/p&gt;
&lt;div class="highlight highlight-source-python"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;# combined.py&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;os&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;shell&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;pool&lt;/span&gt;.&lt;span class="pl-s1"&gt;thread&lt;/span&gt;

&lt;span class="pl-s1"&gt;shell&lt;/span&gt;.&lt;span class="pl-en"&gt;run&lt;/span&gt;(&lt;span class="pl-s"&gt;'mkdir -p /mnt/results'&lt;/span&gt;)

&lt;span class="pl-s1"&gt;prefix&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"s3://nyc-tlc/trip data"&lt;/span&gt;

&lt;span class="pl-s1"&gt;keys&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; [&lt;span class="pl-s1"&gt;x&lt;/span&gt;.&lt;span class="pl-en"&gt;split&lt;/span&gt;()[&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;1&lt;/span&gt;] &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;x&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;shell&lt;/span&gt;.&lt;span class="pl-en"&gt;run&lt;/span&gt;(&lt;span class="pl-s"&gt;f'aws s3 ls "&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;prefix&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;/"'&lt;/span&gt;).&lt;span class="pl-en"&gt;splitlines&lt;/span&gt;() &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-s"&gt;'yellow'&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;x&lt;/span&gt;]

&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;process&lt;/span&gt;(&lt;span class="pl-s1"&gt;key&lt;/span&gt;):
    &lt;span class="pl-s1"&gt;shell&lt;/span&gt;.&lt;span class="pl-en"&gt;run&lt;/span&gt;(&lt;span class="pl-s"&gt;f'aws s3 cp "&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;prefix&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;/&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;key&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;" - '&lt;/span&gt;
              &lt;span class="pl-s"&gt;f'| cut -d, -f1-5'&lt;/span&gt;
              &lt;span class="pl-s"&gt;f'| pypy3 /mnt/passenger_counts_inlined.py'&lt;/span&gt;
              &lt;span class="pl-s"&gt;f'&amp;gt; /mnt/results/&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;key&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;'&lt;/span&gt;,
              &lt;span class="pl-s1"&gt;echo&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;)

&lt;span class="pl-s1"&gt;pool&lt;/span&gt;.&lt;span class="pl-s1"&gt;thread&lt;/span&gt;.&lt;span class="pl-s1"&gt;size&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;os&lt;/span&gt;.&lt;span class="pl-en"&gt;cpu_count&lt;/span&gt;()

&lt;span class="pl-en"&gt;list&lt;/span&gt;(&lt;span class="pl-s1"&gt;pool&lt;/span&gt;.&lt;span class="pl-s1"&gt;thread&lt;/span&gt;.&lt;span class="pl-en"&gt;map&lt;/span&gt;(&lt;span class="pl-s1"&gt;process&lt;/span&gt;, &lt;span class="pl-s1"&gt;keys&lt;/span&gt;))&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-scp combined.py :/mnt &lt;span class="pl-smi"&gt;$id&lt;/span&gt; --yes

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; aws-ec2-ssh &lt;span class="pl-smi"&gt;$id&lt;/span&gt; --yes --cmd &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;python /mnt/combined.py&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

real    0m53.036s
user    0m0.334s
sys     0m0.069s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;interesting. reading from the network is faster than writing to disk, and in this case get's us a x2 wall clock improvement.&lt;/p&gt;
&lt;p&gt;since we are paying $3/hour for this instance, let's shut it down.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-rm &lt;span class="pl-smi"&gt;$id&lt;/span&gt; --yes&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;let's see how much money we spent getting this result.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; job took &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$((&lt;/span&gt; ($(date &lt;span class="pl-k"&gt;+%&lt;/span&gt;s) &lt;span class="pl-k"&gt;-&lt;/span&gt; &lt;span class="pl-smi"&gt;$start&lt;/span&gt;) &lt;span class="pl-k"&gt;/&lt;/span&gt; &lt;span class="pl-c1"&gt;60&lt;/span&gt; &lt;span class="pl-pds"&gt;))&lt;/span&gt;&lt;/span&gt; minutes

job took 6 minutes&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;for less than $1, we analyzed a 250GB dataset with python. an individual query took as little as 10 seconds reading from local disk, or 60 seconds reading from s3. vertical scaling with python is a decent technique. now that we've maxed out our instance size, the only way to scale further is to go &lt;a href="/posts/scaling-python-data-processing-horizontally"&gt;horizontal&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;when analyzing data, it's always good to check the results with an alternate implementation. if they disagree, at least one of them is wrong. you can find alternate implementations of this analysis &lt;a href="https://github.com/nathants/s4/tree/go/examples/nyc_taxi_bsv"&gt;here&lt;/a&gt;.&lt;/p&gt;

                      
            </description>
      <guid isPermaLink="false">https://nathants.com/posts/scaling-python-data-processing-vertically</guid>
    </item>
    <item>
      <title>scaling python data processing horizontally</title>
      <link>https://nathants.com/posts/scaling-python-data-processing-horizontally</link>
      <description>
                
                        
&lt;p&gt;full source code is available &lt;a href="https://github.com/nathants/posts/tree/002/002_scaling_python_data_processing_horizontally"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;we scaled an analysis of the nyc taxi dataset &lt;a href="/posts/scaling-python-data-processing-vertically"&gt;vertically&lt;/a&gt; on a single machine, now let's scale horizontally on multiple machines. instead of a single i3en.24xlarge we'll use twelve i3en.2xlarge.&lt;/p&gt;
&lt;p&gt;we'll be working with the &lt;a href="https://registry.opendata.aws/nyc-tlc-trip-records-pds/" rel="nofollow"&gt;nyc taxi&lt;/a&gt; dataset in the aws region where it lives, us-east-1. bandwidth between ec2 and s3 is only free within the same region, so make sure you are in us-east-1 if you are following along.&lt;/p&gt;
&lt;p&gt;we'll be using some &lt;a href="https://github.com/nathants/cli-aws"&gt;aws tooling&lt;/a&gt; and the &lt;a href="https://aws.amazon.com/cli/" rel="nofollow"&gt;official aws cli&lt;/a&gt;. one could also use other tools without much trouble.&lt;/p&gt;
&lt;p&gt;we'll be using the same code and aws setup from &lt;a href="/posts/scaling-python-data-processing-vertically"&gt;vertical scaling&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;first we're going to need some ec2 instances.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;export&lt;/span&gt; region=us-east-1

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-max-spot-price i3en.2xlarge

on demand: 0.904, spot offers 70% savings
us-east-1b 0.271200
us-east-1c 0.271200
us-east-1d 0.271200
us-east-1f 0.272200
us-east-1a 0.288600&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;at about $0.25/hour/instance cost will be $3/hour.&lt;/p&gt;
&lt;p&gt;before we start, let's note the time.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; start=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;date +%s&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;now it's time to spin up our machines. the following may look familiar. it is almost identical to how we instantiated our machine for &lt;a href="/posts/scaling-python-data-processing-vertically"&gt;vertical scaling&lt;/a&gt;, except that we capture and use multiple instance &lt;code&gt;$ids&lt;/code&gt; instead of just one &lt;code&gt;$id&lt;/code&gt;.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; ids=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;aws-ec2-new --type i3en.2xlarge \&lt;/span&gt;
&lt;span class="pl-s"&gt;                          --num 12 \&lt;/span&gt;
&lt;span class="pl-s"&gt;                          --ami arch \&lt;/span&gt;
&lt;span class="pl-s"&gt;                          --profile s3-readonly \&lt;/span&gt;
&lt;span class="pl-s"&gt;                          test-machines&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;

real    1m57.050s
user    0m3.154s
sys     0m0.744s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;it takes a moment to format the instance store ssd, so we wait.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-ssh &lt;span class="pl-smi"&gt;$ids&lt;/span&gt; --yes --cmd &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;       while true; do&lt;/span&gt;
&lt;span class="pl-s"&gt;           sleep 1&lt;/span&gt;
&lt;span class="pl-s"&gt;           df -h | grep /mnt &amp;amp;&amp;amp; break&lt;/span&gt;
&lt;span class="pl-s"&gt;       done&lt;/span&gt;
&lt;span class="pl-s"&gt;   &lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;now we need to install some things.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-ssh &lt;span class="pl-smi"&gt;$ids&lt;/span&gt; --yes --cmd &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;       sudo pacman -Sy --noconfirm python-pip pypy3 git&lt;/span&gt;
&lt;span class="pl-s"&gt;       sudo pip install awscli git+https://github.com/nathants/py-{util,shell,pool}&lt;/span&gt;
&lt;span class="pl-s"&gt;   &lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;then we bump linux limits, reboot, and wait for the machines to come back up.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-ssh &lt;span class="pl-smi"&gt;$ids&lt;/span&gt; --yes --cmd &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;       curl -s https://raw.githubusercontent.com/nathants/bootstraps/master/scripts/limits.sh | bash&lt;/span&gt;
&lt;span class="pl-s"&gt;       sudo reboot&lt;/span&gt;
&lt;span class="pl-s"&gt;   &lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-wait-for-ssh &lt;span class="pl-smi"&gt;$ids&lt;/span&gt; --yes&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;baking an &lt;a href="https://github.com/nathants/bootstraps/tree/master/amis"&gt;ami&lt;/a&gt; instead of starting from vanilla linux can save some bootstrap time.&lt;/p&gt;
&lt;p&gt;our data pipeline is going to look like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;fetch the dataset&lt;/li&gt;
&lt;li&gt;select the columns we need&lt;/li&gt;
&lt;li&gt;group by and count&lt;/li&gt;
&lt;li&gt;merge results&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;step 1 will fetch and select passengers. this pipeline will run once per input key, and will run in parallel on all cpus of every machine.&lt;/p&gt;
&lt;div class="highlight highlight-source-python"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;# download_and_select.py&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;os&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;shell&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;pool&lt;/span&gt;.&lt;span class="pl-s1"&gt;thread&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;sys&lt;/span&gt;

&lt;span class="pl-s1"&gt;shell&lt;/span&gt;.&lt;span class="pl-en"&gt;run&lt;/span&gt;(&lt;span class="pl-s"&gt;'mkdir -p /mnt/data'&lt;/span&gt;)

&lt;span class="pl-s1"&gt;prefix&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"s3://nyc-tlc/trip data"&lt;/span&gt;

&lt;span class="pl-s1"&gt;keys&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;sys&lt;/span&gt;.&lt;span class="pl-s1"&gt;stdin&lt;/span&gt;.&lt;span class="pl-en"&gt;read&lt;/span&gt;().&lt;span class="pl-en"&gt;splitlines&lt;/span&gt;()

&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;download&lt;/span&gt;(&lt;span class="pl-s1"&gt;key&lt;/span&gt;):
    &lt;span class="pl-s1"&gt;shell&lt;/span&gt;.&lt;span class="pl-en"&gt;run&lt;/span&gt;(&lt;span class="pl-s"&gt;f'aws s3 cp "&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;prefix&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;/&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;key&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;" - | cut -d, -f1-5 &amp;gt; /mnt/data/&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;key&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;'&lt;/span&gt;, &lt;span class="pl-s1"&gt;echo&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;)

&lt;span class="pl-s1"&gt;pool&lt;/span&gt;.&lt;span class="pl-s1"&gt;thread&lt;/span&gt;.&lt;span class="pl-s1"&gt;size&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;os&lt;/span&gt;.&lt;span class="pl-en"&gt;cpu_count&lt;/span&gt;()

&lt;span class="pl-en"&gt;list&lt;/span&gt;(&lt;span class="pl-s1"&gt;pool&lt;/span&gt;.&lt;span class="pl-s1"&gt;thread&lt;/span&gt;.&lt;span class="pl-en"&gt;map&lt;/span&gt;(&lt;span class="pl-s1"&gt;download&lt;/span&gt;, &lt;span class="pl-s1"&gt;keys&lt;/span&gt;))&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;since we are running on multiple machines, we'll need to orchestrate the activity. we'll be using a local process and ssh. the local process will divide the keys to process across the machines and monitor their execution.&lt;/p&gt;
&lt;div class="highlight highlight-source-python"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;# orchestrate_download_and_select.py&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;shell&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;pool&lt;/span&gt;.&lt;span class="pl-s1"&gt;thread&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;util&lt;/span&gt;.&lt;span class="pl-s1"&gt;iter&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;sys&lt;/span&gt;

&lt;span class="pl-s1"&gt;ids&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;sys&lt;/span&gt;.&lt;span class="pl-s1"&gt;argv&lt;/span&gt;[&lt;span class="pl-c1"&gt;1&lt;/span&gt;:]

&lt;span class="pl-s1"&gt;prefix&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"s3://nyc-tlc/trip data"&lt;/span&gt;

&lt;span class="pl-s1"&gt;keys&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; [&lt;span class="pl-s1"&gt;x&lt;/span&gt;.&lt;span class="pl-en"&gt;split&lt;/span&gt;()[&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;1&lt;/span&gt;] &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;x&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;shell&lt;/span&gt;.&lt;span class="pl-en"&gt;run&lt;/span&gt;(&lt;span class="pl-s"&gt;f'aws s3 ls "&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;prefix&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;/"'&lt;/span&gt;).&lt;span class="pl-en"&gt;splitlines&lt;/span&gt;() &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-s"&gt;'yellow'&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;x&lt;/span&gt;]

&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;download&lt;/span&gt;(&lt;span class="pl-s1"&gt;arg&lt;/span&gt;):
    &lt;span class="pl-s1"&gt;id&lt;/span&gt;, &lt;span class="pl-s1"&gt;keys&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;arg&lt;/span&gt;
    &lt;span class="pl-s1"&gt;keys&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;'&lt;span class="pl-cce"&gt;\n&lt;/span&gt;'&lt;/span&gt;.&lt;span class="pl-en"&gt;join&lt;/span&gt;(&lt;span class="pl-s1"&gt;keys&lt;/span&gt;)
    &lt;span class="pl-s1"&gt;shell&lt;/span&gt;.&lt;span class="pl-en"&gt;run&lt;/span&gt;(&lt;span class="pl-s"&gt;f'aws-ec2-ssh &lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;id&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt; --yes --cmd "python /mnt/download_and_select.py" --stdin "&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;keys&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;" &amp;gt;/dev/null'&lt;/span&gt;, &lt;span class="pl-s1"&gt;stream&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;)

&lt;span class="pl-s1"&gt;pool&lt;/span&gt;.&lt;span class="pl-s1"&gt;thread&lt;/span&gt;.&lt;span class="pl-s1"&gt;size&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;len&lt;/span&gt;(&lt;span class="pl-s1"&gt;ids&lt;/span&gt;)

&lt;span class="pl-s1"&gt;args&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;zip&lt;/span&gt;(&lt;span class="pl-s1"&gt;ids&lt;/span&gt;, &lt;span class="pl-s1"&gt;util&lt;/span&gt;.&lt;span class="pl-s1"&gt;iter&lt;/span&gt;.&lt;span class="pl-en"&gt;chunks&lt;/span&gt;(&lt;span class="pl-s1"&gt;keys&lt;/span&gt;, &lt;span class="pl-s1"&gt;num_chunks&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-en"&gt;len&lt;/span&gt;(&lt;span class="pl-s1"&gt;ids&lt;/span&gt;)))

&lt;span class="pl-en"&gt;list&lt;/span&gt;(&lt;span class="pl-s1"&gt;pool&lt;/span&gt;.&lt;span class="pl-s1"&gt;thread&lt;/span&gt;.&lt;span class="pl-en"&gt;map&lt;/span&gt;(&lt;span class="pl-s1"&gt;download&lt;/span&gt;, &lt;span class="pl-s1"&gt;args&lt;/span&gt;))&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-scp passenger_counts_inlined.py :/mnt &lt;span class="pl-smi"&gt;$ids&lt;/span&gt; --yes

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-scp download_and_select.py :/mnt &lt;span class="pl-smi"&gt;$ids&lt;/span&gt; --yes

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; python orchestrate_download_and_select.py &lt;span class="pl-smi"&gt;$ids&lt;/span&gt;

real    1m23.778s
user    0m4.588s
sys     0m1.950s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;step 2 will group by passengers and count. this pipeline will run once per input file, and will run in parallel on all cpus of every machine.&lt;/p&gt;
&lt;div class="highlight highlight-source-python"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;# group_and_count.py&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;os&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;shell&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;pool&lt;/span&gt;.&lt;span class="pl-s1"&gt;thread&lt;/span&gt;

&lt;span class="pl-s1"&gt;shell&lt;/span&gt;.&lt;span class="pl-en"&gt;run&lt;/span&gt;(&lt;span class="pl-s"&gt;'mkdir -p /mnt/results'&lt;/span&gt;)

&lt;span class="pl-s1"&gt;paths&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;shell&lt;/span&gt;.&lt;span class="pl-en"&gt;files&lt;/span&gt;(&lt;span class="pl-s"&gt;'/mnt/data'&lt;/span&gt;, &lt;span class="pl-s1"&gt;abspath&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;)

&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;process&lt;/span&gt;(&lt;span class="pl-s1"&gt;path&lt;/span&gt;):
    &lt;span class="pl-s1"&gt;shell&lt;/span&gt;.&lt;span class="pl-en"&gt;run&lt;/span&gt;(&lt;span class="pl-s"&gt;f'&amp;lt; &lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;path&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt; pypy3 /mnt/passenger_counts_inlined.py &amp;gt; /mnt/results/&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;os&lt;/span&gt;.&lt;span class="pl-s1"&gt;path&lt;/span&gt;.&lt;span class="pl-en"&gt;basename&lt;/span&gt;(&lt;span class="pl-s1"&gt;path&lt;/span&gt;)&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;'&lt;/span&gt;, &lt;span class="pl-s1"&gt;echo&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;)

&lt;span class="pl-s1"&gt;pool&lt;/span&gt;.&lt;span class="pl-s1"&gt;thread&lt;/span&gt;.&lt;span class="pl-s1"&gt;size&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;os&lt;/span&gt;.&lt;span class="pl-en"&gt;cpu_count&lt;/span&gt;()

&lt;span class="pl-en"&gt;list&lt;/span&gt;(&lt;span class="pl-s1"&gt;pool&lt;/span&gt;.&lt;span class="pl-s1"&gt;thread&lt;/span&gt;.&lt;span class="pl-en"&gt;map&lt;/span&gt;(&lt;span class="pl-s1"&gt;process&lt;/span&gt;, &lt;span class="pl-s1"&gt;paths&lt;/span&gt;))&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;the local machine will orchestrate.&lt;/p&gt;
&lt;div class="highlight highlight-source-python"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;# orchestrate_group_and_count.py&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;shell&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;pool&lt;/span&gt;.&lt;span class="pl-s1"&gt;thread&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;sys&lt;/span&gt;

&lt;span class="pl-s1"&gt;ids&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;sys&lt;/span&gt;.&lt;span class="pl-s1"&gt;argv&lt;/span&gt;[&lt;span class="pl-c1"&gt;1&lt;/span&gt;:]

&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;process&lt;/span&gt;(&lt;span class="pl-s1"&gt;id&lt;/span&gt;):
    &lt;span class="pl-s1"&gt;shell&lt;/span&gt;.&lt;span class="pl-en"&gt;run&lt;/span&gt;(&lt;span class="pl-s"&gt;f'aws-ec2-ssh &lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;id&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt; --yes --cmd "python /mnt/group_and_count.py" &amp;gt;/dev/null'&lt;/span&gt;, &lt;span class="pl-s1"&gt;stream&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;)

&lt;span class="pl-s1"&gt;pool&lt;/span&gt;.&lt;span class="pl-s1"&gt;thread&lt;/span&gt;.&lt;span class="pl-s1"&gt;size&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;len&lt;/span&gt;(&lt;span class="pl-s1"&gt;ids&lt;/span&gt;)

&lt;span class="pl-en"&gt;list&lt;/span&gt;(&lt;span class="pl-s1"&gt;pool&lt;/span&gt;.&lt;span class="pl-s1"&gt;thread&lt;/span&gt;.&lt;span class="pl-en"&gt;map&lt;/span&gt;(&lt;span class="pl-s1"&gt;process&lt;/span&gt;, &lt;span class="pl-s1"&gt;ids&lt;/span&gt;))&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-scp group_and_count.py :/mnt &lt;span class="pl-smi"&gt;$ids&lt;/span&gt; --yes

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; python orchestrate_group_and_count.py &lt;span class="pl-smi"&gt;$ids&lt;/span&gt;

real    0m17.984s
user    0m2.980s
sys     0m0.933s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;step 3 will merge the results. this pipeline runs locally on a single core after fetching results from all machines with &lt;a href="https://wiki.archlinux.org/index.php/Rsync" rel="nofollow"&gt;rsync&lt;/a&gt;.&lt;/p&gt;
&lt;div class="highlight highlight-source-python"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;# merge_results.py&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;sys&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;collections&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;shell&lt;/span&gt;

&lt;span class="pl-s1"&gt;ids&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;' '&lt;/span&gt;.&lt;span class="pl-en"&gt;join&lt;/span&gt;(&lt;span class="pl-s1"&gt;sys&lt;/span&gt;.&lt;span class="pl-s1"&gt;argv&lt;/span&gt;[&lt;span class="pl-c1"&gt;1&lt;/span&gt;:])

&lt;span class="pl-k"&gt;with&lt;/span&gt; &lt;span class="pl-s1"&gt;shell&lt;/span&gt;.&lt;span class="pl-en"&gt;tempdir&lt;/span&gt;():
    &lt;span class="pl-s1"&gt;shell&lt;/span&gt;.&lt;span class="pl-en"&gt;run&lt;/span&gt;(&lt;span class="pl-s"&gt;f'aws-ec2-rsync :/mnt/results/ ./results/ --yes &lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;ids&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt; 1&amp;gt;&amp;amp;2'&lt;/span&gt;, &lt;span class="pl-s1"&gt;stream&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;)

    &lt;span class="pl-s1"&gt;result&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;collections&lt;/span&gt;.&lt;span class="pl-en"&gt;defaultdict&lt;/span&gt;(&lt;span class="pl-s1"&gt;int&lt;/span&gt;)

    &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;path&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;shell&lt;/span&gt;.&lt;span class="pl-en"&gt;files&lt;/span&gt;(&lt;span class="pl-s"&gt;'results'&lt;/span&gt;, &lt;span class="pl-s1"&gt;abspath&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;):
        &lt;span class="pl-k"&gt;with&lt;/span&gt; &lt;span class="pl-en"&gt;open&lt;/span&gt;(&lt;span class="pl-s1"&gt;path&lt;/span&gt;) &lt;span class="pl-k"&gt;as&lt;/span&gt; &lt;span class="pl-s1"&gt;f&lt;/span&gt;:
            &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;line&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;f&lt;/span&gt;:
                &lt;span class="pl-s1"&gt;passengers&lt;/span&gt;, &lt;span class="pl-s1"&gt;count&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;line&lt;/span&gt;.&lt;span class="pl-en"&gt;split&lt;/span&gt;(&lt;span class="pl-s"&gt;','&lt;/span&gt;)
                &lt;span class="pl-s1"&gt;result&lt;/span&gt;[&lt;span class="pl-s1"&gt;passengers&lt;/span&gt;] &lt;span class="pl-c1"&gt;+=&lt;/span&gt; &lt;span class="pl-en"&gt;int&lt;/span&gt;(&lt;span class="pl-s1"&gt;count&lt;/span&gt;)

&lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;passengers&lt;/span&gt;, &lt;span class="pl-s1"&gt;count&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;result&lt;/span&gt;.&lt;span class="pl-en"&gt;items&lt;/span&gt;():
    &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s"&gt;f'&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;passengers&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;,&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;count&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;'&lt;/span&gt;)&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; python merge_results.py &lt;span class="pl-smi"&gt;$ids&lt;/span&gt; \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; tr , &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt; &lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; sort -nrk 2 \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; head -n9 \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; column -t

real    0m2.638s
user    0m0.465s
sys     0m0.095s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;an optimization we can apply here is to combine steps 1 and 2, which will avoid iowait as a bottleneck since we never touch local disk.&lt;/p&gt;
&lt;div class="highlight highlight-source-python"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;# combined.py&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;os&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;shell&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;pool&lt;/span&gt;.&lt;span class="pl-s1"&gt;thread&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;sys&lt;/span&gt;

&lt;span class="pl-s1"&gt;shell&lt;/span&gt;.&lt;span class="pl-en"&gt;run&lt;/span&gt;(&lt;span class="pl-s"&gt;'mkdir -p /mnt/results'&lt;/span&gt;)

&lt;span class="pl-s1"&gt;prefix&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"s3://nyc-tlc/trip data"&lt;/span&gt;

&lt;span class="pl-s1"&gt;keys&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;sys&lt;/span&gt;.&lt;span class="pl-s1"&gt;stdin&lt;/span&gt;.&lt;span class="pl-en"&gt;read&lt;/span&gt;().&lt;span class="pl-en"&gt;splitlines&lt;/span&gt;()

&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;process&lt;/span&gt;(&lt;span class="pl-s1"&gt;key&lt;/span&gt;):
    &lt;span class="pl-s1"&gt;shell&lt;/span&gt;.&lt;span class="pl-en"&gt;run&lt;/span&gt;(&lt;span class="pl-s"&gt;f'aws s3 cp "&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;prefix&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;/&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;key&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;" - '&lt;/span&gt;
              &lt;span class="pl-s"&gt;f'| cut -d, -f1-5'&lt;/span&gt;
              &lt;span class="pl-s"&gt;f'| pypy3 /mnt/passenger_counts_inlined.py'&lt;/span&gt;
              &lt;span class="pl-s"&gt;f'&amp;gt; /mnt/results/&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;key&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;'&lt;/span&gt;,
              &lt;span class="pl-s1"&gt;echo&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;)

&lt;span class="pl-s1"&gt;pool&lt;/span&gt;.&lt;span class="pl-s1"&gt;thread&lt;/span&gt;.&lt;span class="pl-s1"&gt;size&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;os&lt;/span&gt;.&lt;span class="pl-en"&gt;cpu_count&lt;/span&gt;()

&lt;span class="pl-en"&gt;list&lt;/span&gt;(&lt;span class="pl-s1"&gt;pool&lt;/span&gt;.&lt;span class="pl-s1"&gt;thread&lt;/span&gt;.&lt;span class="pl-en"&gt;map&lt;/span&gt;(&lt;span class="pl-s1"&gt;process&lt;/span&gt;, &lt;span class="pl-s1"&gt;keys&lt;/span&gt;))&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-python"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;# orchestrate_combined.py&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;shell&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;pool&lt;/span&gt;.&lt;span class="pl-s1"&gt;thread&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;util&lt;/span&gt;.&lt;span class="pl-s1"&gt;iter&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;sys&lt;/span&gt;

&lt;span class="pl-s1"&gt;ids&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;sys&lt;/span&gt;.&lt;span class="pl-s1"&gt;argv&lt;/span&gt;[&lt;span class="pl-c1"&gt;1&lt;/span&gt;:]

&lt;span class="pl-s1"&gt;prefix&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"s3://nyc-tlc/trip data"&lt;/span&gt;

&lt;span class="pl-s1"&gt;keys&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; [&lt;span class="pl-s1"&gt;x&lt;/span&gt;.&lt;span class="pl-en"&gt;split&lt;/span&gt;()[&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;1&lt;/span&gt;] &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;x&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;shell&lt;/span&gt;.&lt;span class="pl-en"&gt;run&lt;/span&gt;(&lt;span class="pl-s"&gt;f'aws s3 ls "&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;prefix&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;/"'&lt;/span&gt;).&lt;span class="pl-en"&gt;splitlines&lt;/span&gt;() &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-s"&gt;'yellow'&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;x&lt;/span&gt;]

&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;process&lt;/span&gt;(&lt;span class="pl-s1"&gt;arg&lt;/span&gt;):
    &lt;span class="pl-s1"&gt;id&lt;/span&gt;, &lt;span class="pl-s1"&gt;keys&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;arg&lt;/span&gt;
    &lt;span class="pl-s1"&gt;keys&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;'&lt;span class="pl-cce"&gt;\n&lt;/span&gt;'&lt;/span&gt;.&lt;span class="pl-en"&gt;join&lt;/span&gt;(&lt;span class="pl-s1"&gt;keys&lt;/span&gt;)
    &lt;span class="pl-s1"&gt;shell&lt;/span&gt;.&lt;span class="pl-en"&gt;run&lt;/span&gt;(&lt;span class="pl-s"&gt;f'aws-ec2-ssh &lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;id&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt; --yes --cmd "python /mnt/combined.py" --stdin "&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;keys&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;" &amp;gt;/dev/null'&lt;/span&gt;, &lt;span class="pl-s1"&gt;stream&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;)

&lt;span class="pl-s1"&gt;pool&lt;/span&gt;.&lt;span class="pl-s1"&gt;thread&lt;/span&gt;.&lt;span class="pl-s1"&gt;size&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;len&lt;/span&gt;(&lt;span class="pl-s1"&gt;ids&lt;/span&gt;)

&lt;span class="pl-s1"&gt;args&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;zip&lt;/span&gt;(&lt;span class="pl-s1"&gt;ids&lt;/span&gt;, &lt;span class="pl-s1"&gt;util&lt;/span&gt;.&lt;span class="pl-s1"&gt;iter&lt;/span&gt;.&lt;span class="pl-en"&gt;chunks&lt;/span&gt;(&lt;span class="pl-s1"&gt;keys&lt;/span&gt;, &lt;span class="pl-s1"&gt;num_chunks&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-en"&gt;len&lt;/span&gt;(&lt;span class="pl-s1"&gt;ids&lt;/span&gt;)))

&lt;span class="pl-en"&gt;list&lt;/span&gt;(&lt;span class="pl-s1"&gt;pool&lt;/span&gt;.&lt;span class="pl-s1"&gt;thread&lt;/span&gt;.&lt;span class="pl-en"&gt;map&lt;/span&gt;(&lt;span class="pl-s1"&gt;process&lt;/span&gt;, &lt;span class="pl-s1"&gt;args&lt;/span&gt;))&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-scp combined.py :/mnt &lt;span class="pl-smi"&gt;$ids&lt;/span&gt; --yes

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; python orchestrate_combined.py &lt;span class="pl-smi"&gt;$ids&lt;/span&gt;

real    1m19.735s
user    0m4.867s
sys     0m1.949s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;since we are paying $3/hour for this instance, let's shut it down.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-rm &lt;span class="pl-smi"&gt;$ids&lt;/span&gt; --yes&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;lets see how much money we spent getting this result.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; job took &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$((&lt;/span&gt; ($(date &lt;span class="pl-k"&gt;+%&lt;/span&gt;s) &lt;span class="pl-k"&gt;-&lt;/span&gt; &lt;span class="pl-smi"&gt;$start&lt;/span&gt;) &lt;span class="pl-k"&gt;/&lt;/span&gt; &lt;span class="pl-c1"&gt;60&lt;/span&gt; &lt;span class="pl-pds"&gt;))&lt;/span&gt;&lt;/span&gt; minutes

job took 8 minutes&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;for less than $1, we analyzed a 250GB dataset with python on a cluster of twelve machines. an individual query took as little as 18 seconds reading from local disk, or 80 seconds reading from s3.&lt;/p&gt;
&lt;p&gt;interestingly, this is up from 10 seconds and 60 seconds respectively in the &lt;a href="/posts/scaling-python-data-processing-vertically"&gt;vertical scaling&lt;/a&gt; post, suggesting that both network and disk performance varies with instance size.&lt;/p&gt;
&lt;p&gt;we've iterated rapidly on local code with a sample of data, and in production with all of the data. we've experimented with several options for a simple data pipeline on a large single machine and on multiple small machines. we've answered some questions, and discovered more. we did all of this simply, quickly, and for less than the cost of a cup of coffee. most importantly, it was fun.&lt;/p&gt;
&lt;p&gt;when analyzing data, it's always good to check the results with an alternate implementation. if they disagree, at least one of them is wrong. you can find alternate implementations of this analysis &lt;a href="https://github.com/nathants/s4/tree/go/examples/nyc_taxi_bsv"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;just for fun, let's try vertical and horizontal scaling together with four i3en.24xlarge. we'll be using the &lt;a href="https://github.com/nathants/bootstraps/blob/master/amis/basic.sh"&gt;basic&lt;/a&gt; ami instead of live bootstrapping.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-max-spot-price i3en.24xlarge

on demand: 10.848, spot offers 70% savings
us-east-1a 3.254400
us-east-1b 3.254400
us-east-1c 3.254400
us-east-1d 3.254400
us-east-1f 3.254400

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; start=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;date +%s&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; ids=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;aws-ec2-new --type i3en.24xlarge \&lt;/span&gt;
&lt;span class="pl-s"&gt;                          --num 4 \&lt;/span&gt;
&lt;span class="pl-s"&gt;                          --ami basic \&lt;/span&gt;
&lt;span class="pl-s"&gt;                          --profile s3-readonly \&lt;/span&gt;
&lt;span class="pl-s"&gt;                          test-machines&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;

real    4m11.740s
user    0m5.453s
sys     0m1.354s

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-ssh &lt;span class="pl-smi"&gt;$ids&lt;/span&gt; --yes --cmd &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;       while true; do&lt;/span&gt;
&lt;span class="pl-s"&gt;           sleep 1&lt;/span&gt;
&lt;span class="pl-s"&gt;           df -h | grep /mnt &amp;amp;&amp;amp; break&lt;/span&gt;
&lt;span class="pl-s"&gt;       done&lt;/span&gt;
&lt;span class="pl-s"&gt;   &lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-scp passenger_counts_inlined.py :/mnt &lt;span class="pl-smi"&gt;$ids&lt;/span&gt; --yes

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-scp combined.py :/mnt &lt;span class="pl-smi"&gt;$ids&lt;/span&gt; --yes

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; python orchestrate_combined.py &lt;span class="pl-smi"&gt;$ids&lt;/span&gt;

real    0m32.145s
user    0m1.478s
sys     0m0.572s

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; python merge_results.py &lt;span class="pl-smi"&gt;$ids&lt;/span&gt; \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; tr , &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt; &lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; sort -nrk 2 \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; head -n9 \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; column -t

real    0m2.527s
user    0m0.336s
sys     0m0.057s

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-ssh &lt;span class="pl-smi"&gt;$ids&lt;/span&gt; --yes --cmd &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;sudo poweroff&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; job took &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$((&lt;/span&gt; ($(date &lt;span class="pl-k"&gt;+%&lt;/span&gt;s) &lt;span class="pl-k"&gt;-&lt;/span&gt; &lt;span class="pl-smi"&gt;$start&lt;/span&gt;) &lt;span class="pl-k"&gt;/&lt;/span&gt; &lt;span class="pl-c1"&gt;60&lt;/span&gt; &lt;span class="pl-pds"&gt;))&lt;/span&gt;&lt;/span&gt; minutes

job took 6 minutes&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;30 seconds, interesting. it's only x2 faster, not the x4 we might expect, than the single machine used in &lt;a href="/posts/scaling-python-data-processing-vertically"&gt;vertical scaling&lt;/a&gt;. i wonder why?&lt;/p&gt;

                      
            </description>
      <guid isPermaLink="false">https://nathants.com/posts/scaling-python-data-processing-horizontally</guid>
    </item>
    <item>
      <title>refactoring common distributed data patterns into s4</title>
      <link>https://nathants.com/posts/refactoring-common-distributed-data-patterns-into-s4</link>
      <description>
                
                        
&lt;p&gt;full source code is available &lt;a href="https://github.com/nathants/posts/tree/003/003_refactoring_common_distributed_data_patterns_into_s4"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;in &lt;a href="/posts/scaling-python-data-processing-horizontally"&gt;horizontal scaling&lt;/a&gt; we manually managed distributed compute across a cluster of machines. we used ssh to execute commands. we created directories and files to hold results. we used rsync to fetch results from multiple machines and merged them locally. we manually managed parallelism in our data scripts.&lt;/p&gt;
&lt;p&gt;this wasn't particularly difficult, but neither was it important. let's refactor and build some tooling so next time we can focus more on the data and less on low level details of distributed compute.&lt;/p&gt;
&lt;p&gt;our data pipeline looked like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;fetch the dataset&lt;/li&gt;
&lt;li&gt;select the columns we need&lt;/li&gt;
&lt;li&gt;group by and count&lt;/li&gt;
&lt;li&gt;merge results&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;let's break that down a bit.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;input&lt;/th&gt;
&lt;th&gt;command&lt;/th&gt;
&lt;th&gt;output&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;files&lt;/td&gt;
&lt;td&gt;fetch&lt;/td&gt;
&lt;td&gt;files&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;files&lt;/td&gt;
&lt;td&gt;select columns&lt;/td&gt;
&lt;td&gt;files&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;files&lt;/td&gt;
&lt;td&gt;group and count&lt;/td&gt;
&lt;td&gt;files&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;files&lt;/td&gt;
&lt;td&gt;merge results&lt;/td&gt;
&lt;td&gt;file&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;it looks like we have two types of things going on.&lt;/p&gt;
&lt;p&gt;first we have a 1:1 map of input files to output files through a command. we can imagine it as:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-smi"&gt;file&lt;/span&gt; &lt;span class="pl-k"&gt;in&lt;/span&gt; inputs/&lt;span class="pl-k"&gt;*&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;do&lt;/span&gt;
    cat &lt;span class="pl-smi"&gt;$file&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; &lt;span class="pl-smi"&gt;$command&lt;/span&gt; &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; outputs/&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;basename &lt;span class="pl-smi"&gt;$file&lt;/span&gt;&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-k"&gt;done&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;second we have a n:1 map of input files to output file though a command. we can imagine it as:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;cat inputs/&lt;span class="pl-k"&gt;*&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; &lt;span class="pl-smi"&gt;$command&lt;/span&gt; &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; output&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;we don't have it in this pipeline, but we can imagine a third type as a 1:n map of input file to output files through a command:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;cat input &lt;span class="pl-k"&gt;|&lt;/span&gt; &lt;span class="pl-smi"&gt;$command&lt;/span&gt; --outdir=outputs/&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;let's code by wishful thinking. what would our pipeline look like if we had something that helped us do these three types of things? let's imagine something like s3.&lt;/p&gt;
&lt;p&gt;first we fetch the dataset. our inputs will be keys, the outputs will be the key data, and the command will be copy. first we need to put the inputs.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; prefix=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;s3://nyc-tlc/trip data&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; keys=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;aws s3 ls &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-smi"&gt;$prefix&lt;/span&gt;/&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \&lt;/span&gt;
&lt;span class="pl-s"&gt;           &lt;span class="pl-k"&gt;|&lt;/span&gt; grep yellow \&lt;/span&gt;
&lt;span class="pl-s"&gt;           &lt;span class="pl-k"&gt;|&lt;/span&gt; awk &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;{print $NF}&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-smi"&gt;key&lt;/span&gt; &lt;span class="pl-k"&gt;in&lt;/span&gt; &lt;span class="pl-smi"&gt;$key&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;do&lt;/span&gt;
      &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-smi"&gt;$prefix&lt;/span&gt;/&lt;span class="pl-smi"&gt;$key&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; aws s3 cp - s3://inputs/&lt;span class="pl-smi"&gt;$key&lt;/span&gt;
   &lt;span class="pl-k"&gt;done&lt;/span&gt;

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws s3 ls s3://inputs/ &lt;span class="pl-k"&gt;|&lt;/span&gt; head -n3

yellow_tripdata_2009-01.csv
yellow_tripdata_2009-02.csv
yellow_tripdata_2009-03.csv&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;now that we have our inputs, we can do a 1:1 map.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws s3 map \
    --in  s3://inputs/ \
    --out s3://step1/ \
    --cmd &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;cat &amp;gt; key &amp;amp;&amp;amp; aws s3 cp $(cat key) -&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;next we select the columns with a 1:1 map.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws s3 map \
    --in  s3://step1/ \
    --out s3://step2/ \
    --cmd &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;cut -d, -f1-5&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;next we group and count with a 1:1 map.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws s3 map \
    --in  s3://step2/ \
    --out s3://step3/ \
    --cmd &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;pypy3 passenger_counts_inlined.py&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;finally we merge the results with a n:1 map, and fetch the result.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws s3 map-from-n \
    --in  s3://step3/ \
    --out s3://result \
    --cmd &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;python merge_results.py&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws s3 cp s3://result -&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;let's put that all together and see what we've got.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws s3 map        --in s3://inputs/ --out s3://step1/ --cmd &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;cat &amp;gt; key &amp;amp;&amp;amp; aws s3 cp $(cat key) -&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws s3 map        --in s3://step1/  --out s3://step2/ --cmd &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;cut -d, -f1-5&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws s3 map        --in s3://step2/  --out s3://step3/ --cmd &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;pypy3 passenger_counts_inlined.py&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws s3 map-from-n --in s3://step3/  --out s3://result --cmd &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;python merge_results.py&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws s3 cp s3://result -&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;now we have a series of steps, mapping immutable inputs to immutable outputs. we have no details of infrastructure, data location, or data transfer. we can imagine taking any of these commands and running them locally to debug or optimize. this feels better than threadpools, rsync, and ssh. it's too bad none of this works.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://aws.amazon.com/s3/" rel="nofollow"&gt;s3&lt;/a&gt; is a pinnacle of modern engineering. it scales automatically, is comically durable, quite available, and significantly cheaper than &lt;a href="https://aws.amazon.com/ebs/" rel="nofollow"&gt;ebs&lt;/a&gt;. in it's &lt;a href="https://aws.amazon.com/s3/storage-classes/#General_purpose" rel="nofollow"&gt;standard&lt;/a&gt; storage class it replicates across availability zones without bandwidth charges. within the same region, bandwidth between ec2 and s3 is free.&lt;/p&gt;
&lt;p&gt;we want to use s3 for durability and scalability. we also want simple distributed compute like we imagined above. let's spin up a system to compliment s3. we'll call it  &lt;a href="https://github.com/nathants/s4"&gt;s4&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;for a moment, let's think about scope reduction and what we don't want.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;we don't want it to be highly durable or available, because we already have s3 for that.&lt;/li&gt;
&lt;li&gt;we don't want it to use complicated failure handling, because we can retry idempotent commands on immutable data.&lt;/li&gt;
&lt;li&gt;we don't want it to handle security or authentication, because those can be network level concerns.&lt;/li&gt;
&lt;li&gt;we don't want it to allow updates to data unless explicitly deleted, because immutability is a simplifying constraint.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;this narrower scope means the system is easier to &lt;a href="https://github.com/nathants/s4#api"&gt;use&lt;/a&gt;, has &lt;a href="https://github.com/nathants/s4/tree/go/s4.go"&gt;simpler implementation&lt;/a&gt;, and is more likely to be &lt;a href="https://github.com/nathants/s4/tree/go/tests"&gt;correct&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;let's give it a try. first we install &lt;a href="https://github.com/nathants/s4#install"&gt;s4&lt;/a&gt; and then spin up a &lt;a href="https://github.com/nathants/s4/tree/go/scripts/new_cluster.sh"&gt;cluster&lt;/a&gt;. we'll size the cluster the same as we did in &lt;a href="/posts/scaling-python-data-processing-horizontally"&gt;horizontal scaling&lt;/a&gt;.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; git clone https://github.com/nathants/s4

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;cd&lt;/span&gt; s4

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; python3 -m pip install -r requirements.txt &lt;span class="pl-c1"&gt;.&lt;/span&gt;

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;export&lt;/span&gt; region=us-east-1

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; name=s4-cluster

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; type=i3en.xlarge num=12 bash scripts/new_cluster.sh &lt;span class="pl-smi"&gt;$name&lt;/span&gt;

5m17.052s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;next we'll &lt;a href="https://github.com/nathants/s4/tree/go/scripts/connect_to_cluster.sh"&gt;proxy traffic&lt;/a&gt; through a machine in the cluster. assuming the security group only allows port 22, the machines are only accessible on their internal addresses. since we already have ssh setup, we'll use &lt;a href="https://github.com/sshuttle/sshuttle"&gt;sshuttle&lt;/a&gt;. run this in a second terminal, and don't forget to set region to us-east-1.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;export&lt;/span&gt; region=us-east-1

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; name=s4-cluster

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; bash scripts/connect_to_cluster.sh &lt;span class="pl-smi"&gt;$name&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;let's check the cluster &lt;a href="https://github.com/nathants/s4#s4-health"&gt;health&lt;/a&gt;.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; s4 health

healthy:   10.0.30.103:8080
healthy:   10.0.18.21:8080
healthy:   10.0.29.44:8080
healthy:   10.0.22.60:8080
healthy:   10.0.28.41:8080
healthy:   10.0.29.17:8080
healthy:   10.0.18.163:8080
healthy:   10.0.24.118:8080
healthy:   10.0.22.203:8080
healthy:   10.0.19.10:8080
healthy:   10.0.26.213:8080
healthy:   10.0.28.124:8080&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;we want to be able to place keys on machines. we'll use &lt;a href="https://github.com/nathants/s4/search?q=%22func+hash%22&amp;amp;type=Code"&gt;consistent hashing&lt;/a&gt; to automatically place or [numeric prefixes](&lt;a href="https://github.com/nathants/s4/search?q=%22func"&gt;https://github.com/nathants/s4/search?q=%22func&lt;/a&gt; KeyPrefix%22&amp;amp;type=Code) to explicitly place keys around the cluster.&lt;/p&gt;
&lt;p&gt;we want to be able to &lt;a href="https://github.com/nathants/s4#s4-cp"&gt;put&lt;/a&gt;, &lt;a href="https://github.com/nathants/s4#s4-cp"&gt;get&lt;/a&gt;, and &lt;a href="https://github.com/nathants/s4#s4-ls"&gt;list&lt;/a&gt; keys across a cluster of machines. let's try putting some data which is explicitly placed with numeric prefixes.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; input_a &lt;span class="pl-k"&gt;|&lt;/span&gt; s4 cp - s4://inputs/000_machine0

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; input_b &lt;span class="pl-k"&gt;|&lt;/span&gt; s4 cp - s4://inputs/001_machine1&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;we want to be able to map &lt;a href="https://github.com/nathants/s4#s4-map"&gt;1:1&lt;/a&gt;. let's try replacing some text.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; s4 map s4://inputs/ s4://mapped/ &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;sed s/input/output/&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-smi"&gt;key&lt;/span&gt; &lt;span class="pl-k"&gt;in&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;s4 ls -r s4://mapped/ &lt;span class="pl-k"&gt;|&lt;/span&gt; awk &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;{print $NF}&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;do&lt;/span&gt;
       &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-smi"&gt;$key&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;=&amp;gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;s4 &lt;span class="pl-c1"&gt;eval&lt;/span&gt; s4://mapped/&lt;span class="pl-smi"&gt;$key&lt;/span&gt; cat&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;
   &lt;span class="pl-k"&gt;done&lt;/span&gt;

000_machine0 =&lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; output_a
001_machine1 =&lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; output_b&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;we want to be able to map &lt;a href="https://github.com/nathants/s4#s4-map-to-n"&gt;1:n&lt;/a&gt; to shuffle data around the cluster. each input key becomes an output directory filled with keys that will be placed around the cluster according to their name. let's try duplicating some content from the first two machines in the cluster to the next two.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; s4 map-to-n s4://inputs/ s4://shuffled/ &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;       cat &amp;gt; content&lt;/span&gt;
&lt;span class="pl-s"&gt;       for i in {2..3}; do&lt;/span&gt;
&lt;span class="pl-s"&gt;           file=$(printf "%03d" $i)_machine$i&lt;/span&gt;
&lt;span class="pl-s"&gt;           echo -n "$(cat content)$i" &amp;gt; $file&lt;/span&gt;
&lt;span class="pl-s"&gt;           echo $file&lt;/span&gt;
&lt;span class="pl-s"&gt;       done&lt;/span&gt;
&lt;span class="pl-s"&gt;   &lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-smi"&gt;key&lt;/span&gt; &lt;span class="pl-k"&gt;in&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;s4 ls -r s4://shuffled/ &lt;span class="pl-k"&gt;|&lt;/span&gt; awk &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;{print $NF}&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;do&lt;/span&gt;
       &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-smi"&gt;$key&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;=&amp;gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;s4 &lt;span class="pl-c1"&gt;eval&lt;/span&gt; s4://shuffled/&lt;span class="pl-smi"&gt;$key&lt;/span&gt; cat&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;
   &lt;span class="pl-k"&gt;done&lt;/span&gt;

000_machine0/002_machine2 =&lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; input_a2
000_machine0/003_machine3 =&lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; input_a3
001_machine1/002_machine2 =&lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; input_b2
001_machine1/003_machine3 =&lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; input_b3&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;we want to be able to merge shuffled data with &lt;a href="https://github.com/nathants/s4#s4-map-from-n"&gt;n:1&lt;/a&gt; maps. let's merge the content we just duplicated.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; s4 map-from-n s4://shuffled/ s4://merged/ &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;xargs cat&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-smi"&gt;key&lt;/span&gt; &lt;span class="pl-k"&gt;in&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;s4 ls -r s4://merged/ &lt;span class="pl-k"&gt;|&lt;/span&gt; awk &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;{print $NF}&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;do&lt;/span&gt;
       &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-smi"&gt;$key&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;=&amp;gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;s4 &lt;span class="pl-c1"&gt;eval&lt;/span&gt; s4://merged/&lt;span class="pl-smi"&gt;$key&lt;/span&gt; cat&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;
   &lt;span class="pl-k"&gt;done&lt;/span&gt;

002_machine2 =&lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; input_a2 input_b2
003_machine3 =&lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; input_a3 input_b3&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;all files with the same name have been merged into a single file with that name.&lt;/p&gt;
&lt;p&gt;now that we've seen all of the maps in action, let's summarize their semantics.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;maps take directories as inputs and outputs and operate on the keys in those directories.&lt;/li&gt;
&lt;li&gt;1:1 map commands take data on stdin and emit data on stdout.&lt;/li&gt;
&lt;li&gt;1:n map commands take data on stdin, write files to disk, and emit file names on stdout.&lt;/li&gt;
&lt;li&gt;n:1 map commands take file names on stdin, and emit data on stdout.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;key names are important, since they define data placement around the cluster.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;1:1 map operates on keys on a single machine.&lt;/li&gt;
&lt;li&gt;1:n map will shuffle output keys around the cluster.&lt;/li&gt;
&lt;li&gt;n:1 map operates on keys on a single machine.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;now let's try redoing the analysis from &lt;a href="/posts/scaling-python-data-processing-horizontally"&gt;horizontal scaling&lt;/a&gt; with &lt;a href="https://github.com/nathants/s4"&gt;s4&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;we'll be working with the &lt;a href="https://registry.opendata.aws/nyc-tlc-trip-records-pds/" rel="nofollow"&gt;nyc taxi&lt;/a&gt; dataset in the aws region where it lives, us-east-1. bandwidth between ec2 and s3 is only free within the same region, so make sure you are in us-east-1 if you are following along.&lt;/p&gt;
&lt;p&gt;we'll be using some &lt;a href="https://github.com/nathants/cli-aws"&gt;aws tooling&lt;/a&gt; and the &lt;a href="https://aws.amazon.com/cli/" rel="nofollow"&gt;official aws cli&lt;/a&gt;. one could also use other tools without much trouble.&lt;/p&gt;
&lt;p&gt;we've already spun up an s4 cluster in us-east-1, but let's delete it and make a new one. clusters spin up fast and should only contain ephemeral data. they spin up even faster when using a prebuilt &lt;a href="https://github.com/nathants/bootstraps/blob/master/amis/s4.sh"&gt;ami&lt;/a&gt; instead of live bootstrapping.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;export&lt;/span&gt; region=us-east-1

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; name=s4-cluster

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-rm &lt;span class="pl-smi"&gt;$name&lt;/span&gt; --yes

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; ami=s4 type=i3en.xlarge num=12 bash scripts/new_cluster.sh &lt;span class="pl-smi"&gt;$name&lt;/span&gt;

3m43.205s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;first we deploy our code to every machine. note that we'll be referring to ec2 instances by name instead of id.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-scp passenger_counts_inlined.py :/mnt &lt;span class="pl-smi"&gt;$name&lt;/span&gt; --yes&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;now we add the s3 keys of the input data to s4 so that we can map over them.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; prefix=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;s3://nyc-tlc/trip data&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; keys=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;aws s3 ls &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-smi"&gt;$prefix&lt;/span&gt;/&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \&lt;/span&gt;
&lt;span class="pl-s"&gt;           &lt;span class="pl-k"&gt;|&lt;/span&gt; grep yellow \&lt;/span&gt;
&lt;span class="pl-s"&gt;           &lt;span class="pl-k"&gt;|&lt;/span&gt; awk &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;{print $NF}&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \&lt;/span&gt;
&lt;span class="pl-s"&gt;           &lt;span class="pl-k"&gt;|&lt;/span&gt; &lt;span class="pl-k"&gt;while&lt;/span&gt; &lt;span class="pl-c1"&gt;read&lt;/span&gt; key&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;do&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;                 &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-smi"&gt;$prefix&lt;/span&gt;/&lt;span class="pl-smi"&gt;$key&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;             done)&lt;/span&gt;
&lt;span class="pl-s"/&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-smi"&gt;$keys&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; &lt;span class="pl-k"&gt;while&lt;/span&gt; &lt;span class="pl-c1"&gt;read&lt;/span&gt; key&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;do&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;       &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-smi"&gt;$key&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;       &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-smi"&gt;$key&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; s4 cp - s4://inputs/&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;basename &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-smi"&gt;$key&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;   &lt;span class="pl-k"&gt;done&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;let's take a peek at the data.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; s4 ls s4://inputs/ &lt;span class="pl-k"&gt;|&lt;/span&gt; head -n3 &lt;span class="pl-k"&gt;|&lt;/span&gt; awk &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;{print $NF}&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

yellow_tripdata_2009-01.csv
yellow_tripdata_2009-02.csv
yellow_tripdata_2009-03.csv

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; s4 &lt;span class="pl-c1"&gt;eval&lt;/span&gt; s4://inputs/yellow_tripdata_2009-01.csv cat

s3://nyc-tlc/trip data/yellow_tripdata_2009-01.csv&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;now let's run our data pipeline.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; s4 map s4://inputs/ s4://step1/ &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;cat &amp;gt; url &amp;amp;&amp;amp; aws s3 cp "$(cat url)" -&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

1m4.920s

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; s4 map s4://step1/ s4://step2/ &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;cut -d, -f1-5&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

0m46.054s

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; s4 map s4://step2/ s4://step3/ &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;pypy3 /mnt/passenger_counts_inlined.py&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

0m20.310s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;we can't merge our results until they are all on one machine, so we need to map 1:n, where n=1, sending all results to the same machine. to do this we are putting all data into keys with the same name, which places them on the same machine.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; s4 map-to-n s4://step3/ s4://step4/ &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;cat &amp;gt; results &amp;amp;&amp;amp; echo results&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

0m1.729s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;now that all results are on the same machine, we can merge the results with a n:1 map.&lt;/p&gt;
&lt;div class="highlight highlight-source-python"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;# merge_results.py&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;sys&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;collections&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;shell&lt;/span&gt;

&lt;span class="pl-s1"&gt;result&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;collections&lt;/span&gt;.&lt;span class="pl-en"&gt;defaultdict&lt;/span&gt;(&lt;span class="pl-s1"&gt;int&lt;/span&gt;)

&lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;line&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;sys&lt;/span&gt;.&lt;span class="pl-s1"&gt;stdin&lt;/span&gt;:
    &lt;span class="pl-s1"&gt;passengers&lt;/span&gt;, &lt;span class="pl-s1"&gt;count&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;line&lt;/span&gt;.&lt;span class="pl-en"&gt;split&lt;/span&gt;(&lt;span class="pl-s"&gt;','&lt;/span&gt;)
    &lt;span class="pl-s1"&gt;result&lt;/span&gt;[&lt;span class="pl-s1"&gt;passengers&lt;/span&gt;] &lt;span class="pl-c1"&gt;+=&lt;/span&gt; &lt;span class="pl-en"&gt;int&lt;/span&gt;(&lt;span class="pl-s1"&gt;count&lt;/span&gt;)

&lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;passengers&lt;/span&gt;, &lt;span class="pl-s1"&gt;count&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;result&lt;/span&gt;.&lt;span class="pl-en"&gt;items&lt;/span&gt;():
    &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s"&gt;f'&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;passengers&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;,&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;count&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;'&lt;/span&gt;)&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-scp merge_results.py :/mnt &lt;span class="pl-smi"&gt;$name&lt;/span&gt; --yes

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; s4 map-from-n s4://step4/ s4://step5/ &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;xargs cat | python /mnt/merge_results.py&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

0m0.464s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;finally we fetch the result.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; s4 &lt;span class="pl-c1"&gt;eval&lt;/span&gt; s4://step5/results &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;       tr , ' ' &lt;span class="pl-cce"&gt;\&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;       | sort -nrk2 &lt;span class="pl-cce"&gt;\&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;       | head -n9 &lt;span class="pl-cce"&gt;\&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;       | column -t&lt;/span&gt;
&lt;span class="pl-s"&gt;   &lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;

1  1135227331
2  239684017
5  103036920
3  70434390
6  38585794
4  34074806
0  6881330
7  2040
8  1609&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;let's run at the pipeline again. note that keys cannot be updated, so before we can rerun the pipeline we have to delete intermediate results. we'll delete everything except the inputs.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; s4 rm -r s4://step

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; s4 map        s4://inputs/ s4://step1/ &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;cat &amp;gt; url &amp;amp;&amp;amp; aws s3 cp "$(cat url)" -&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

1m5.620s

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; s4 map        s4://step1/  s4://step2/ &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;cut -d, -f1-5&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

0m38.109s

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; s4 map        s4://step2/  s4://step3/ &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;pypy3 /mnt/passenger_counts_inlined.py&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

0m19.917s

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; s4 map-to-n   s4://step3/  s4://step4/ &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;cat &amp;gt; results &amp;amp;&amp;amp; echo results&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

0m1.641s

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; s4 map-from-n s4://step4/  s4://step5/ &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;xargs cat | python /mnt/merge_results.py&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

0m0.430s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;we can optimize by merging some of these steps.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; s4 rm -r s4://step

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; s4 map s4://inputs/ s4://step1/ &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;cat &amp;gt; url&lt;/span&gt;
&lt;span class="pl-s"&gt;                                         aws s3 cp "$(cat url)" - \&lt;/span&gt;
&lt;span class="pl-s"&gt;                                          | cut -d, -f1-5 \&lt;/span&gt;
&lt;span class="pl-s"&gt;                                          | pypy3 /mnt/passenger_counts_inlined.py&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

1m56.197s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;performance improves, but we can no longer measure steps independently. sometimes we should combine steps, others we should pull them apart.&lt;/p&gt;
&lt;p&gt;while we've got the cluster up, let's do one more thing. we haven't really flexed 1:n and n:1 maps properly yet, so let's do that. the taxi dataset is organized into files by date. let's reorganize it by passenger count. this will make it easier to answer questions about the trips for a given passenger count by without scanning the entire dataset.&lt;/p&gt;
&lt;p&gt;we're going to need a new data script for our 1:n map. it will partition data by passenger count into separate files. these files will be shuffled around the cluster according to their name. then we'll merge files with the same name into a single file. we're going to further partition each passenger count randomly into multiple files to more evenly spread the data around the cluster. we'll make 12 files per passenger count, the same as cluster size.&lt;/p&gt;
&lt;div class="highlight highlight-source-python"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;# partition_by_passengers.py&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;sys&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;random&lt;/span&gt;

&lt;span class="pl-s1"&gt;cluster_size&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;int&lt;/span&gt;(&lt;span class="pl-s1"&gt;sys&lt;/span&gt;.&lt;span class="pl-s1"&gt;argv&lt;/span&gt;[&lt;span class="pl-c1"&gt;1&lt;/span&gt;])

&lt;span class="pl-s1"&gt;sys&lt;/span&gt;.&lt;span class="pl-s1"&gt;stdin&lt;/span&gt;.&lt;span class="pl-en"&gt;readline&lt;/span&gt;() &lt;span class="pl-c"&gt;# skip the header&lt;/span&gt;

&lt;span class="pl-s1"&gt;files&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; {}

&lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;line&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;sys&lt;/span&gt;.&lt;span class="pl-s1"&gt;stdin&lt;/span&gt;:
    &lt;span class="pl-s1"&gt;cols&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;line&lt;/span&gt;.&lt;span class="pl-en"&gt;split&lt;/span&gt;(&lt;span class="pl-s"&gt;','&lt;/span&gt;)
    &lt;span class="pl-k"&gt;try&lt;/span&gt;:
        &lt;span class="pl-s1"&gt;passengers&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;int&lt;/span&gt;(&lt;span class="pl-s1"&gt;cols&lt;/span&gt;[&lt;span class="pl-c1"&gt;3&lt;/span&gt;])
    &lt;span class="pl-k"&gt;except&lt;/span&gt; (&lt;span class="pl-v"&gt;IndexError&lt;/span&gt;, &lt;span class="pl-v"&gt;ValueError&lt;/span&gt;):
        &lt;span class="pl-k"&gt;continue&lt;/span&gt;
    &lt;span class="pl-k"&gt;else&lt;/span&gt;:
        &lt;span class="pl-s1"&gt;randint&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;random&lt;/span&gt;.&lt;span class="pl-en"&gt;randint&lt;/span&gt;(&lt;span class="pl-c1"&gt;0&lt;/span&gt;, &lt;span class="pl-s1"&gt;cluster_size&lt;/span&gt;)
        &lt;span class="pl-s1"&gt;filename&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;f'passengers_&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;passengers&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;_&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;randint&lt;/span&gt;:03d&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;.csv'&lt;/span&gt;
        &lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-s1"&gt;filename&lt;/span&gt; &lt;span class="pl-c1"&gt;not&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;files&lt;/span&gt;:
            &lt;span class="pl-s1"&gt;files&lt;/span&gt;[&lt;span class="pl-s1"&gt;filename&lt;/span&gt;] &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;open&lt;/span&gt;(&lt;span class="pl-s1"&gt;filename&lt;/span&gt;, &lt;span class="pl-s"&gt;'w'&lt;/span&gt;)
        &lt;span class="pl-s1"&gt;files&lt;/span&gt;[&lt;span class="pl-s1"&gt;filename&lt;/span&gt;].&lt;span class="pl-en"&gt;write&lt;/span&gt;(&lt;span class="pl-s1"&gt;line&lt;/span&gt;)

&lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;name&lt;/span&gt;, &lt;span class="pl-s1"&gt;file&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;files&lt;/span&gt;.&lt;span class="pl-en"&gt;items&lt;/span&gt;():
    &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;name&lt;/span&gt;)
    &lt;span class="pl-s1"&gt;file&lt;/span&gt;.&lt;span class="pl-en"&gt;close&lt;/span&gt;()&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-scp partition_by_passengers.py :/mnt &lt;span class="pl-smi"&gt;$name&lt;/span&gt; --yes

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; s4 rm -r s4://step

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; s4 map        s4://inputs/ s4://step1/ &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;cat &amp;gt; url &amp;amp;&amp;amp; aws s3 cp "$(cat url)" -&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

1m16.529s

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; s4 map        s4://step1/  s4://step2/ &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;cut -d, -f1-5&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

0m38.528s

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; s4 map-to-n   s4://step2/  s4://step3/ &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;pypy3 /mnt/partition_by_passengers.py 12&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

2m11.914s

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; s4 map-from-n s4://step3/  s4://step4/ &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;xargs cat&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

0m25.288s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;earlier we did a 1:n map, where n=1, sending all results to a single machine. here we did a 1:n map, where n&amp;gt;1, sending results all around the cluster.&lt;/p&gt;
&lt;p&gt;earlier we followed that with a n:1 map which ran on a single machine, since only that machine had data. here we followed that with a n:1 map which ran on every machine, since every machine had data, merging the shuffled pieces of data back into single files.&lt;/p&gt;
&lt;p&gt;since we partitioned the data in a way that spread it evenly around the cluster, we &lt;a href="https://gist.github.com/nathants/fa0044092e4c098763e35326ba704769"&gt;could&lt;/a&gt; &lt;a href="https://nathants-public.s3-us-west-2.amazonaws.com/grid.gif" rel="nofollow"&gt;see&lt;/a&gt; during processing that all machines were busy and then all went idle at the same time. if we hadn't partitioned this way we likely would have seen a few machines staying busy while the rest went idle.&lt;/p&gt;
&lt;p&gt;let's take a peak at the data.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; s4 ls s4://step4/ \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; awk &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;{print $3, $4}&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; head -n3 \
    &lt;span class="pl-k"&gt;|&lt;/span&gt; column -t

29120189  passengers_0_000.csv
29084534  passengers_0_001.csv
29021334  passengers_0_002.csv

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; s4 &lt;span class="pl-c1"&gt;eval&lt;/span&gt; s4://step4/passengers_0_000.csv &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;head -n1&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;

DDS,2009-01-06 06:46:08,2009-01-06 07:03:10,0,4.2999999999999998

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; s4 &lt;span class="pl-c1"&gt;eval&lt;/span&gt; s4://step4/passengers_5_000.csv &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;head -n1&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;

VTS,2009-01-27 14:41:00,2009-01-27 14:48:00,5,1.1299999999999999

s4 &lt;span class="pl-c1"&gt;eval&lt;/span&gt; s4://step4/passengers_5_000.csv &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;cut -d, -f2 | grep -Eo '^.{4}' | sort | uniq -c | sort -nr&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;

1145682 2009
1095902 2010
1065193 2011
 927308 2012
 771382 2013
 713021 2014
 609841 2015
 521383 2016
 414996 2017
 353959 2018
 261314 2019
  43358 2020
      1 2008&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;normally at this point we'd push the results back to s3 to make them durable, but our cluster has read only access, so we won't be doing that.&lt;/p&gt;
&lt;p&gt;while we've got a cluster up, let's take a look at performance. what's the biggest and smallest file in the dataset?&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws s3 ls --recursive nyc-tlc/ &lt;span class="pl-k"&gt;|&lt;/span&gt; sort -nk3 &lt;span class="pl-k"&gt;|&lt;/span&gt; tail -n1

2016-08-15 08:50:21 2994922424 trip data/yellow_tripdata_2012-03.csv

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws s3 ls --recursive nyc-tlc/ &lt;span class="pl-k"&gt;|&lt;/span&gt; sort -nk3 &lt;span class="pl-k"&gt;|&lt;/span&gt; head -n3

2016-08-11 07:16:22          0 trip data/
2016-08-17 07:54:39          0 misc/
2016-08-17 07:57:08      12322 misc/taxi _zone_lookup.csv&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; s4 ls -r s4://step1/yellow_tripdata_2012-03.csv &lt;span class="pl-k"&gt;|&lt;/span&gt; awk &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;{print $3, $4}&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

2994922424 yellow_tripdata_2012-03.csv&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;let's copy the smallest file to s4.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws s3 cp &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;s3://nyc-tlc/misc/taxi _zone_lookup.csv&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; - &lt;span class="pl-k"&gt;|&lt;/span&gt; s4 cp - s4://small/data.csv&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;let's copy the biggest file from s3 and from s4. we'll run this test on the first machine in the cluster, since the big file doesn't live on that machine.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; id=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;aws-ec2-id &lt;span class="pl-smi"&gt;$name&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; head -n1&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-ssh &lt;span class="pl-smi"&gt;$id&lt;/span&gt; --yes --cmd &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;       time aws s3 cp "s3://nyc-tlc/trip data/yellow_tripdata_2012-03.csv" - &amp;gt;/dev/null&lt;/span&gt;
&lt;span class="pl-s"&gt;   &lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

0m15.018s

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-ssh &lt;span class="pl-smi"&gt;$id&lt;/span&gt; --yes --cmd &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;       time s4 cp s4://step1/yellow_tripdata_2012-03.csv - &amp;gt;/dev/null&lt;/span&gt;
&lt;span class="pl-s"&gt;   &lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

0m3.251s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;now let's copy the smallest file several times in a loop.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-ssh &lt;span class="pl-smi"&gt;$id&lt;/span&gt; --yes --cmd &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;       set -x&lt;/span&gt;
&lt;span class="pl-s"&gt;       time for i in {1..20}; do&lt;/span&gt;
&lt;span class="pl-s"&gt;           aws s3 cp "s3://nyc-tlc/misc/taxi _zone_lookup.csv" - &amp;gt;/dev/null&lt;/span&gt;
&lt;span class="pl-s"&gt;       done&lt;/span&gt;
&lt;span class="pl-s"&gt;   &lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

0m7.909s

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-ssh &lt;span class="pl-smi"&gt;$id&lt;/span&gt; --yes --cmd &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;       set -x&lt;/span&gt;
&lt;span class="pl-s"&gt;       time for i in {1..20}; do&lt;/span&gt;
&lt;span class="pl-s"&gt;           s4 cp "s4://small/data.csv" - &amp;gt;/dev/null&lt;/span&gt;
&lt;span class="pl-s"&gt;       done&lt;/span&gt;
&lt;span class="pl-s"&gt;   &lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

0m2.193s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;we're done for now, so let's delete the cluster.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-rm &lt;span class="pl-smi"&gt;$name&lt;/span&gt; --yes&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;clearly s3 and s4 have different performance characteristics, and if we think about their goals, we can understand why.&lt;/p&gt;
&lt;p&gt;s3 is durable, elastic, and authenticated. s4 is ephemeral, static, and unauthenticated.&lt;/p&gt;
&lt;p&gt;s3 goes slower and almost certainly won't lose data. s4 goes faster and probably won't lose data.&lt;/p&gt;
&lt;p&gt;s3 must not fail. s4 may fail and retry strategies must be considered.&lt;/p&gt;
&lt;p&gt;these two systems are perfect compliments. we want durability, but we don't need it at every step. we want distributed compute, but we don't want to manually manage the details. we want data shuffle, but we don't want complicated infrastructure or poor performance.&lt;/p&gt;
&lt;p&gt;using s4 we can focus more on our data pipelines, and less on low level details of distributed compute. our data pipelines can start, end, and checkpoint to durable data in s3. everywhere in between they can use s4 to map arbitrary commands over ephemeral immutable data in 1:1, 1:n and n:1 operations.&lt;/p&gt;
&lt;p&gt;you can find more examples of s4 &lt;a href="https://github.com/nathants/s4/tree/go/examples"&gt;here&lt;/a&gt;, where further analysis of the nyc taxi dataset is done with python and &lt;a href="https://github.com/nathants/bsv"&gt;bsv&lt;/a&gt;. to verify results and provide a performance baseline the analysis is repeated with &lt;a href="https://prestodb.io/" rel="nofollow"&gt;presto&lt;/a&gt; on &lt;a href="https://aws.amazon.com/emr/" rel="nofollow"&gt;emr&lt;/a&gt;.&lt;/p&gt;

                      
            </description>
      <guid isPermaLink="false">https://nathants.com/posts/refactoring-common-distributed-data-patterns-into-s4</guid>
    </item>
    <item>
      <title>performant batch processing with bsv s4 and presto</title>
      <link>https://nathants.com/posts/performant-batch-processing-with-bsv-s4-and-presto</link>
      <description>
                
                        
&lt;p&gt;full source code is available &lt;a href="https://github.com/nathants/posts/tree/005/005_performant_batch_processing_with_bsv_s4_and_presto"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;we looked at scaling python batch processing &lt;a href="/posts/scaling-python-data-processing-vertically"&gt;vertically&lt;/a&gt; and &lt;a href="/posts/scaling-python-data-processing-horizontally"&gt;horizontally&lt;/a&gt;. we &lt;a href="/posts/refactoring-common-distributed-data-patterns-into-s4"&gt;refactored&lt;/a&gt; the details of distributed compute out of our code. we discovered a &lt;a href="/posts/data-processing-performance-with-python-go-rust-and-c"&gt;reasonable baseline&lt;/a&gt; for data processing performance on a single cpu core.&lt;/p&gt;
&lt;p&gt;let's build on these experiences and revisit the &lt;a href="https://registry.opendata.aws/nyc-tlc-trip-records-pds/" rel="nofollow"&gt;nyc taxi&lt;/a&gt; dataset. we'll use &lt;a href="https://prestodb.io/" rel="nofollow"&gt;presto&lt;/a&gt; as a performance and correctness baseline to evaluate identical analysis with &lt;a href="https://github.com/nathants/bsv"&gt;bsv&lt;/a&gt; on a &lt;a href="https://github.com/nathants/s4"&gt;s4&lt;/a&gt; cluster.&lt;/p&gt;
&lt;p&gt;we'll be working with the &lt;a href="https://registry.opendata.aws/nyc-tlc-trip-records-pds/" rel="nofollow"&gt;nyc taxi&lt;/a&gt; dataset in the aws region where it lives, us-east-1. bandwidth between ec2 and s3 is only free within the same region, so make sure you are in us-east-1 if you are following along.&lt;/p&gt;
&lt;p&gt;we'll be using some &lt;a href="https://github.com/nathants/cli-aws"&gt;aws tooling&lt;/a&gt; and the &lt;a href="https://aws.amazon.com/cli/" rel="nofollow"&gt;official aws cli&lt;/a&gt;. one could also use other tools without much trouble.&lt;/p&gt;
&lt;p&gt;we're going to only use the first 5 columns, since they are consistent across dataset. we'll create two tables so we can transform the data from csv into orc and get decent performance.&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;--&lt;/span&gt; schema.hql&lt;/span&gt;
CREATE EXTERNAL TABLE IF NOT EXISTS &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;`&lt;/span&gt;taxi_csv&lt;span class="pl-pds"&gt;`&lt;/span&gt;&lt;/span&gt; (
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;`&lt;/span&gt;vendor&lt;span class="pl-pds"&gt;`&lt;/span&gt;&lt;/span&gt;     string,
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;`&lt;/span&gt;pickup&lt;span class="pl-pds"&gt;`&lt;/span&gt;&lt;/span&gt;     &lt;span class="pl-k"&gt;timestamp&lt;/span&gt;,
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;`&lt;/span&gt;dropoff&lt;span class="pl-pds"&gt;`&lt;/span&gt;&lt;/span&gt;    &lt;span class="pl-k"&gt;timestamp&lt;/span&gt;,
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;`&lt;/span&gt;passengers&lt;span class="pl-pds"&gt;`&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;integer&lt;/span&gt;,
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;`&lt;/span&gt;distance&lt;span class="pl-pds"&gt;`&lt;/span&gt;&lt;/span&gt;   double
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;,&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
STORED &lt;span class="pl-k"&gt;AS&lt;/span&gt; TEXTFILE
LOCATION &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;/taxi_csv/&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
tblproperties(&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;skip.header.line.count&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-k"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;1&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;);

CREATE EXTERNAL TABLE IF NOT EXISTS &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;`&lt;/span&gt;taxi&lt;span class="pl-pds"&gt;`&lt;/span&gt;&lt;/span&gt; (
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;`&lt;/span&gt;vendor&lt;span class="pl-pds"&gt;`&lt;/span&gt;&lt;/span&gt;     string,
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;`&lt;/span&gt;pickup&lt;span class="pl-pds"&gt;`&lt;/span&gt;&lt;/span&gt;     &lt;span class="pl-k"&gt;timestamp&lt;/span&gt;,
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;`&lt;/span&gt;dropoff&lt;span class="pl-pds"&gt;`&lt;/span&gt;&lt;/span&gt;    &lt;span class="pl-k"&gt;timestamp&lt;/span&gt;,
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;`&lt;/span&gt;passengers&lt;span class="pl-pds"&gt;`&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;integer&lt;/span&gt;,
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;`&lt;/span&gt;distance&lt;span class="pl-pds"&gt;`&lt;/span&gt;&lt;/span&gt;   double
)
STORED &lt;span class="pl-k"&gt;AS&lt;/span&gt; ORC
LOCATION &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;/taxi/&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;let's spin up an &lt;a href="https://aws.amazon.com/emr/" rel="nofollow"&gt;emr&lt;/a&gt; cluster with &lt;a href="https://hive.apache.org/" rel="nofollow"&gt;hive&lt;/a&gt; and &lt;a href="https://prestodb.io/" rel="nofollow"&gt;presto&lt;/a&gt;. we'll size it the same as in &lt;a href="/posts/scaling-python-data-processing-horizontally"&gt;horizontal scaling&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;if you haven't used &lt;a href="https://aws.amazon.com/emr/" rel="nofollow"&gt;emr&lt;/a&gt; before you may need to create some &lt;a href="https://github.com/nathants/cli-aws/blob/master/aws-iam/aws-iam-ensure-common-roles"&gt;default iam roles&lt;/a&gt;, then we &lt;a href="https://github.com/nathants/cli-aws/blob/master/aws-emr/aws-emr-new"&gt;spin up&lt;/a&gt; the cluster.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;export&lt;/span&gt; region=us-east-1

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-iam-ensure-common-roles

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; id=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;aws-emr-new --count 12 \&lt;/span&gt;
&lt;span class="pl-s"&gt;                    --type i3en.2xlarge \&lt;/span&gt;
&lt;span class="pl-s"&gt;                    --applications hive,presto \&lt;/span&gt;
&lt;span class="pl-s"&gt;                    test-cluster&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; aws-emr-wait-for-state &lt;span class="pl-smi"&gt;$id&lt;/span&gt; --state running

7m37.834s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;then we fetch the dataset.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; aws-emr-ssh &lt;span class="pl-smi"&gt;$id&lt;/span&gt; --cmd &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;       s3-dist-cp --src="s3://nyc-tlc/trip data/" \&lt;/span&gt;
&lt;span class="pl-s"&gt;                  --srcPattern=".*yellow.*" \&lt;/span&gt;
&lt;span class="pl-s"&gt;                  --dest=/taxi_csv/&lt;/span&gt;
&lt;span class="pl-s"&gt;   &lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

2m52.909s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;then we create the tables and translate csv to orc.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-emr-hive -i &lt;span class="pl-smi"&gt;$id&lt;/span&gt; schema.hql

0m9.091s&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;--&lt;/span&gt; csv_to_orq.pql&lt;/span&gt;
&lt;span class="pl-k"&gt;INSERT INTO&lt;/span&gt; taxi
&lt;span class="pl-k"&gt;SELECT&lt;/span&gt; &lt;span class="pl-k"&gt;*&lt;/span&gt;
&lt;span class="pl-k"&gt;FROM&lt;/span&gt; taxi_csv;&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-emr-presto -i &lt;span class="pl-smi"&gt;$id&lt;/span&gt; csv_to_orc.pql

2m48.524s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;now that we have a cluster with data, we can do our analysis. let's ask a few of questions of different types.&lt;/p&gt;
&lt;p&gt;grouping and counting.&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;--&lt;/span&gt; count_rides_by_passengers.pql&lt;/span&gt;
&lt;span class="pl-k"&gt;SELECT&lt;/span&gt; passengers, &lt;span class="pl-c1"&gt;count&lt;/span&gt;(&lt;span class="pl-k"&gt;*&lt;/span&gt;) &lt;span class="pl-k"&gt;as&lt;/span&gt; cnt
&lt;span class="pl-k"&gt;FROM&lt;/span&gt; taxi
&lt;span class="pl-k"&gt;GROUP BY&lt;/span&gt; passengers
&lt;span class="pl-k"&gt;ORDER BY&lt;/span&gt; cnt &lt;span class="pl-k"&gt;desc&lt;/span&gt;
&lt;span class="pl-k"&gt;LIMIT&lt;/span&gt; &lt;span class="pl-c1"&gt;9&lt;/span&gt;;&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-emr-presto -i &lt;span class="pl-smi"&gt;$id&lt;/span&gt; count_rides_by_passengers.pql

          1 &lt;span class="pl-k"&gt;|&lt;/span&gt; 1135227331
          2 &lt;span class="pl-k"&gt;|&lt;/span&gt;  239684017
          5 &lt;span class="pl-k"&gt;|&lt;/span&gt;  103036920
          3 &lt;span class="pl-k"&gt;|&lt;/span&gt;   70434390
          6 &lt;span class="pl-k"&gt;|&lt;/span&gt;   38585794
          4 &lt;span class="pl-k"&gt;|&lt;/span&gt;   34074806
          0 &lt;span class="pl-k"&gt;|&lt;/span&gt;    6881330
       NULL &lt;span class="pl-k"&gt;|&lt;/span&gt;     527580
          7 &lt;span class="pl-k"&gt;|&lt;/span&gt;       2040

0m5.775s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;more grouping and counting.&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;--&lt;/span&gt; count_rides_by_date.pql&lt;/span&gt;
&lt;span class="pl-k"&gt;SELECT&lt;/span&gt; YEAR(pickup), MONTH(pickup), &lt;span class="pl-c1"&gt;count&lt;/span&gt;(&lt;span class="pl-k"&gt;*&lt;/span&gt;) &lt;span class="pl-k"&gt;as&lt;/span&gt; cnt
&lt;span class="pl-k"&gt;FROM&lt;/span&gt; taxi
&lt;span class="pl-k"&gt;GROUP BY&lt;/span&gt; YEAR(pickup), MONTH(pickup)
&lt;span class="pl-k"&gt;ORDER BY&lt;/span&gt; cnt &lt;span class="pl-k"&gt;desc&lt;/span&gt;
&lt;span class="pl-k"&gt;LIMIT&lt;/span&gt; &lt;span class="pl-c1"&gt;9&lt;/span&gt;;&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-emr-presto -i &lt;span class="pl-smi"&gt;$id&lt;/span&gt; count_rides_by_date.pql

  2012 &lt;span class="pl-k"&gt;|&lt;/span&gt;     3 &lt;span class="pl-k"&gt;|&lt;/span&gt; 16146923
  2011 &lt;span class="pl-k"&gt;|&lt;/span&gt;     3 &lt;span class="pl-k"&gt;|&lt;/span&gt; 16066350
  2013 &lt;span class="pl-k"&gt;|&lt;/span&gt;     3 &lt;span class="pl-k"&gt;|&lt;/span&gt; 15749228
  2011 &lt;span class="pl-k"&gt;|&lt;/span&gt;    10 &lt;span class="pl-k"&gt;|&lt;/span&gt; 15707756
  2009 &lt;span class="pl-k"&gt;|&lt;/span&gt;    10 &lt;span class="pl-k"&gt;|&lt;/span&gt; 15604551
  2012 &lt;span class="pl-k"&gt;|&lt;/span&gt;     5 &lt;span class="pl-k"&gt;|&lt;/span&gt; 15567525
  2011 &lt;span class="pl-k"&gt;|&lt;/span&gt;     5 &lt;span class="pl-k"&gt;|&lt;/span&gt; 15554868
  2010 &lt;span class="pl-k"&gt;|&lt;/span&gt;     9 &lt;span class="pl-k"&gt;|&lt;/span&gt; 15540209
  2010 &lt;span class="pl-k"&gt;|&lt;/span&gt;     5 &lt;span class="pl-k"&gt;|&lt;/span&gt; 15481351

0m10.556s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;grouping and accumulating.&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;--&lt;/span&gt; sum_distance_by_date.pql&lt;/span&gt;
&lt;span class="pl-k"&gt;SELECT&lt;/span&gt; YEAR(pickup), MONTH(pickup), cast(floor(&lt;span class="pl-c1"&gt;sum&lt;/span&gt;(distance)) &lt;span class="pl-k"&gt;as&lt;/span&gt; &lt;span class="pl-k"&gt;bigint&lt;/span&gt;) &lt;span class="pl-k"&gt;as&lt;/span&gt; dst
&lt;span class="pl-k"&gt;FROM&lt;/span&gt; taxi
&lt;span class="pl-k"&gt;GROUP BY&lt;/span&gt; YEAR(pickup), MONTH(pickup)
&lt;span class="pl-k"&gt;ORDER BY&lt;/span&gt; dst &lt;span class="pl-k"&gt;desc&lt;/span&gt;
&lt;span class="pl-k"&gt;LIMIT&lt;/span&gt; &lt;span class="pl-c1"&gt;9&lt;/span&gt;;&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-emr-presto -i &lt;span class="pl-smi"&gt;$id&lt;/span&gt; sum_distance_by_date.pql

  2013 &lt;span class="pl-k"&gt;|&lt;/span&gt;     8 &lt;span class="pl-k"&gt;|&lt;/span&gt; 975457587
  2015 &lt;span class="pl-k"&gt;|&lt;/span&gt;     4 &lt;span class="pl-k"&gt;|&lt;/span&gt; 403568758
  2010 &lt;span class="pl-k"&gt;|&lt;/span&gt;     3 &lt;span class="pl-k"&gt;|&lt;/span&gt; 372299513
  2015 &lt;span class="pl-k"&gt;|&lt;/span&gt;    11 &lt;span class="pl-k"&gt;|&lt;/span&gt; 303443064
  2010 &lt;span class="pl-k"&gt;|&lt;/span&gt;     2 &lt;span class="pl-k"&gt;|&lt;/span&gt; 216050426
  2015 &lt;span class="pl-k"&gt;|&lt;/span&gt;     3 &lt;span class="pl-k"&gt;|&lt;/span&gt; 210197223
  2015 &lt;span class="pl-k"&gt;|&lt;/span&gt;     5 &lt;span class="pl-k"&gt;|&lt;/span&gt; 179394357
  2015 &lt;span class="pl-k"&gt;|&lt;/span&gt;     1 &lt;span class="pl-k"&gt;|&lt;/span&gt; 171590254
  2015 &lt;span class="pl-k"&gt;|&lt;/span&gt;     6 &lt;span class="pl-k"&gt;|&lt;/span&gt; 145792590

0m9.844s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;finding large values.&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;--&lt;/span&gt; top_n_by_distance.pql&lt;/span&gt;
&lt;span class="pl-k"&gt;SELECT&lt;/span&gt; cast(floor(distance) &lt;span class="pl-k"&gt;as&lt;/span&gt; &lt;span class="pl-k"&gt;bigint&lt;/span&gt;)
&lt;span class="pl-k"&gt;FROM&lt;/span&gt; taxi
&lt;span class="pl-k"&gt;ORDER BY&lt;/span&gt; distance &lt;span class="pl-k"&gt;desc&lt;/span&gt;
&lt;span class="pl-k"&gt;LIMIT&lt;/span&gt; &lt;span class="pl-c1"&gt;9&lt;/span&gt;;&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-emr-presto -i &lt;span class="pl-smi"&gt;$id&lt;/span&gt; top_n_by_distance.pql

 198623013
  59016609
  19072628
  16201631
  15700000
  15420061
  15420004
  15331800
  15328400

0m5.916s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;distributed sort.&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;--&lt;/span&gt; sort_by_distance.hql&lt;/span&gt;
CREATE EXTERNAL TABLE &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;`&lt;/span&gt;sorted&lt;span class="pl-pds"&gt;`&lt;/span&gt;&lt;/span&gt; (
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;`&lt;/span&gt;distance&lt;span class="pl-pds"&gt;`&lt;/span&gt;&lt;/span&gt; double
)
STORED &lt;span class="pl-k"&gt;AS&lt;/span&gt; ORC
LOCATION &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;/sorted/&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;;&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;--&lt;/span&gt; sort_by_distance.pql&lt;/span&gt;
&lt;span class="pl-k"&gt;INSERT INTO&lt;/span&gt; sorted
&lt;span class="pl-k"&gt;SELECT&lt;/span&gt; distance
&lt;span class="pl-k"&gt;FROM&lt;/span&gt; taxi
&lt;span class="pl-k"&gt;ORDER BY&lt;/span&gt; distance &lt;span class="pl-k"&gt;desc&lt;/span&gt;;&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-emr-hive   -i &lt;span class="pl-smi"&gt;$id&lt;/span&gt; sort_by_distance.hql

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-emr-presto -i &lt;span class="pl-smi"&gt;$id&lt;/span&gt; sort_by_distance.pql

9m44.334s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;finally we shutdown the cluster.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-emr-rm &lt;span class="pl-smi"&gt;$id&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;now let's redo the analysis with &lt;a href="https://github.com/nathants/bsv"&gt;bsv&lt;/a&gt; and &lt;a href="https://github.com/nathants/s4"&gt;s4&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;first we need to install &lt;a href="https://github.com/nathants/s4"&gt;s4&lt;/a&gt; and &lt;a href="https://github.com/nathants/s4/tree/go/scripts/new_cluster.sh"&gt;spin up a cluster&lt;/a&gt;. we're going to use an &lt;a href="https://github.com/nathants/bootstraps/blob/master/amis/s4.sh"&gt;ami&lt;/a&gt; instead of live bootstrapping to save time.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; git clone https://github.com/nathants/s4

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;cd&lt;/span&gt; s4

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; python3 -m pip install -r requirements.txt &lt;span class="pl-c1"&gt;.&lt;/span&gt;

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;export&lt;/span&gt; region=us-east-1

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; name=s4-cluster

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; type=i3en.2xlarge ami=s4 num=12 bash scripts/new_cluster.sh &lt;span class="pl-smi"&gt;$name&lt;/span&gt;

3m41.060s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;next we'll &lt;a href="https://github.com/nathants/s4/tree/go/scripts/connect_to_cluster.sh"&gt;proxy traffic&lt;/a&gt; through a machine in the cluster. assuming the security group only allows port 22, the machines are only accessible on their internal addresses. since we already have ssh setup, we'll use &lt;a href="https://github.com/sshuttle/sshuttle"&gt;sshuttle&lt;/a&gt;. run this in a second terminal, and don't forget to set region to us-east-1.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;export&lt;/span&gt; region=us-east-1

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; name=s4-cluster

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; bash scripts/connect_to_cluster.sh &lt;span class="pl-smi"&gt;$name&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;let's check the cluster &lt;a href="https://github.com/nathants/s4#s4-health"&gt;health&lt;/a&gt;.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; s4 health

healthy:   10.0.3.111:8080
healthy:   10.0.2.192:8080
healthy:   10.0.14.51:8080
healthy:   10.0.9.243:8080
healthy:   10.0.15.97:8080
healthy:   10.0.14.223:8080
healthy:   10.0.15.25:8080
healthy:   10.0.5.197:8080
healthy:   10.0.15.201:8080
healthy:   10.0.7.71:8080
healthy:   10.0.5.249:8080
healthy:   10.0.14.19:8080&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;now we fetch the dataset and convert it to bsv.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; schema.sh&lt;/span&gt;
&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#!&lt;/span&gt;/bin/bash&lt;/span&gt;
&lt;span class="pl-c1"&gt;set&lt;/span&gt; -euo pipefail

prefix=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;s3://nyc-tlc/trip data&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

keys=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;aws s3 ls &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-smi"&gt;$prefix&lt;/span&gt;/&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \&lt;/span&gt;
&lt;span class="pl-s"&gt;        &lt;span class="pl-k"&gt;|&lt;/span&gt; grep yellow \&lt;/span&gt;
&lt;span class="pl-s"&gt;        &lt;span class="pl-k"&gt;|&lt;/span&gt; awk &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;{print $NF}&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \&lt;/span&gt;
&lt;span class="pl-s"&gt;        &lt;span class="pl-k"&gt;|&lt;/span&gt; &lt;span class="pl-k"&gt;while&lt;/span&gt; &lt;span class="pl-c1"&gt;read&lt;/span&gt; key&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;do&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;           &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-smi"&gt;$prefix&lt;/span&gt;/&lt;span class="pl-smi"&gt;$key&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;          done)&lt;/span&gt;
&lt;span class="pl-s"/&gt;
&lt;span class="pl-s"&gt;i=0&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-smi"&gt;$keys&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; &lt;span class="pl-k"&gt;while&lt;/span&gt; &lt;span class="pl-c1"&gt;read&lt;/span&gt; key&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;do&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;    &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-smi"&gt;$key&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;    num=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;printf &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;%03d&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-smi"&gt;$i&lt;/span&gt;&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;    yearmonth=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;echo &lt;span class="pl-smi"&gt;$key&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; tr -dc 0-9 &lt;span class="pl-k"&gt;|&lt;/span&gt; tail -c6&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;    &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-smi"&gt;$key&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; s4 cp - s4://inputs/&lt;span class="pl-smi"&gt;${num}&lt;/span&gt;_&lt;span class="pl-smi"&gt;${yearmonth}&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;    i=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$((&lt;/span&gt;i&lt;span class="pl-k"&gt;+&lt;/span&gt;&lt;span class="pl-c1"&gt;1&lt;/span&gt;&lt;span class="pl-pds"&gt;))&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-k"&gt;done&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"/&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-c1"&gt;set&lt;/span&gt; -x&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-k"&gt;time&lt;/span&gt; s4 map-to-n s4://inputs/ s4://columns/ &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;    cat &amp;gt; url&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;    aws s3 cp "$(cat url)" - \&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;     | tail -n+2 \&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;     | bsv \&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;     | bschema *,*,*,a:i64,a:f64,... --filter \&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;     | bunzip $filename&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;let's break down what's going on here.&lt;/p&gt;
&lt;p&gt;first we find all the s3 keys of the dataset.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;prefix=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;s3://nyc-tlc/trip data&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

keys=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;aws s3 ls &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-smi"&gt;$prefix&lt;/span&gt;/&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \&lt;/span&gt;
&lt;span class="pl-s"&gt;        &lt;span class="pl-k"&gt;|&lt;/span&gt; grep yellow \&lt;/span&gt;
&lt;span class="pl-s"&gt;        &lt;span class="pl-k"&gt;|&lt;/span&gt; awk &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;{print $NF}&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \&lt;/span&gt;
&lt;span class="pl-s"&gt;        &lt;span class="pl-k"&gt;|&lt;/span&gt; &lt;span class="pl-k"&gt;while&lt;/span&gt; &lt;span class="pl-c1"&gt;read&lt;/span&gt; key&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;do&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;           &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-smi"&gt;$prefix&lt;/span&gt;/&lt;span class="pl-smi"&gt;$key&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-k"&gt;;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;          done)&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;then we put those keys into s4. since there aren't many keys, we're using numeric prefixes here to ensure the keys are spread evenly across the cluster.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;i=0
&lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-smi"&gt;$keys&lt;/span&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; &lt;span class="pl-k"&gt;while&lt;/span&gt; &lt;span class="pl-c1"&gt;read&lt;/span&gt; key&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;do&lt;/span&gt;
    &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-smi"&gt;$key&lt;/span&gt;
    num=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;printf &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;%03d&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-smi"&gt;$i&lt;/span&gt;&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;
    yearmonth=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;echo &lt;span class="pl-smi"&gt;$key&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; tr -dc 0-9 &lt;span class="pl-k"&gt;|&lt;/span&gt; tail -c6&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;
    &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-smi"&gt;$key&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; s4 cp - s4://inputs/&lt;span class="pl-smi"&gt;${num}&lt;/span&gt;_&lt;span class="pl-smi"&gt;${yearmonth}&lt;/span&gt;
    i=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$((&lt;/span&gt;i&lt;span class="pl-k"&gt;+&lt;/span&gt;&lt;span class="pl-c1"&gt;1&lt;/span&gt;&lt;span class="pl-pds"&gt;))&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-k"&gt;done&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;then we fetch the dataset and convert it to bsv.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;time&lt;/span&gt; s4 map-to-n s4://inputs/ s4://columns/ &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;    cat &amp;gt; url&lt;/span&gt;
&lt;span class="pl-s"&gt;    aws s3 cp "$(cat url)" - \&lt;/span&gt;
&lt;span class="pl-s"&gt;     | tail -n+2 \&lt;/span&gt;
&lt;span class="pl-s"&gt;     | bsv \&lt;/span&gt;
&lt;span class="pl-s"&gt;     | bschema *,*,*,a:i64,a:f64,... --filter \&lt;/span&gt;
&lt;span class="pl-s"&gt;     | bunzip $filename&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;let's break that one down a bit more.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;we use &lt;a href="https://github.com/nathants/s4#s4-map-to-n"&gt;map-to-n&lt;/a&gt; because our pipeline emits file names instead of data.&lt;/li&gt;
&lt;li&gt;fetch the data.&lt;/li&gt;
&lt;li&gt;skip the csv header.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/nathants/bsv#bsv"&gt;bsv&lt;/a&gt; converts csv to bsv.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/nathants/bsv#bschema"&gt;bschema&lt;/a&gt; filters for rows with at least 5 columns and discards any with less.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/nathants/bsv#bschema"&gt;bschema&lt;/a&gt; keeps the first 5 columns of valid rows.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/nathants/bsv#bschema"&gt;bschema&lt;/a&gt; converts column 4 and 5 from ascii to numerics.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/nathants/bsv#bunzip"&gt;bunzip&lt;/a&gt; splits a single stream of 5 columns into 5 streams of 1 column and emits their file names. the original file name is used as prefix.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;let's run it.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; bash schema.sh

1m11.860s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;now that we have a cluster with data, we can do our analysis.&lt;/p&gt;
&lt;p&gt;grouping and counting.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; count_rides_by_passengers.sh&lt;/span&gt;

s4 map-to-n s4://columns/&lt;span class="pl-k"&gt;*&lt;/span&gt;/&lt;span class="pl-k"&gt;*&lt;/span&gt;_4 s4://tmp/01/ \
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;bcounteach-hash \&lt;/span&gt;
&lt;span class="pl-s"&gt;             | bpartition 1&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

s4 map-from-n s4://tmp/01/ s4://tmp/02/ \
              &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;xargs cat \&lt;/span&gt;
&lt;span class="pl-s"&gt;               | bsumeach-hash i64 \&lt;/span&gt;
&lt;span class="pl-s"&gt;               | bschema i64:a,i64:a \&lt;/span&gt;
&lt;span class="pl-s"&gt;               | csv&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

s4 &lt;span class="pl-c1"&gt;eval&lt;/span&gt; s4://tmp/02/0 \
        &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;tr , " " \&lt;/span&gt;
&lt;span class="pl-s"&gt;         | sort -nrk2 \&lt;/span&gt;
&lt;span class="pl-s"&gt;         | head -n9&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;let's break that down a bit.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;we &lt;a href="https://github.com/nathants/s4#s4-rm"&gt;s4 rm&lt;/a&gt; because we need blank scratch space.&lt;/li&gt;
&lt;li&gt;we &lt;a href="https://github.com/nathants/s4#s4-map-to-n"&gt;s4 map-to-n&lt;/a&gt; on a single column, use &lt;a href="https://github.com/nathants/bsv#bcounteach-hash"&gt;bcounteach-hash&lt;/a&gt; to count the values, then &lt;a href="https://github.com/nathants/bsv#bpartition"&gt;bpartition&lt;/a&gt; by 1 sending all results from around the cluster to a single machine.&lt;/li&gt;
&lt;li&gt;we &lt;a href="https://github.com/nathants/s4#s4-map-from-n"&gt;s4 map-from-n&lt;/a&gt; to merge the results. &lt;code&gt;xargs cat&lt;/code&gt; turns file names into data, &lt;a href="https://github.com/nathants/bsv#bsumeach-hash"&gt;bsumeach-hash&lt;/a&gt; merges the counts, then &lt;a href="https://github.com/nathants/bsv#bschema"&gt;bschema&lt;/a&gt; converts numerics back to ascii, and &lt;a href="https://github.com/nathants/bsv#csv"&gt;csv&lt;/a&gt; converts the result to csv.&lt;/li&gt;
&lt;li&gt;we &lt;a href="https://github.com/nathants/s4#s4-eval"&gt;s4 eval&lt;/a&gt; to fetch the result with &lt;code&gt;tr&lt;/code&gt;, &lt;code&gt;sort&lt;/code&gt;, and &lt;code&gt;head&lt;/code&gt; for formatting.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;let's run it.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; bash count_rides_by_passengers.sh

1 1135227331
2 239684017
5 103036920
3 70434390
6 38585794
4 34074806
0 7408814
7 2040
8 1609

0m2.616s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;more grouping and counting.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; count_rides_by_date.sh&lt;/span&gt;

s4 map-to-n s4://columns/&lt;span class="pl-k"&gt;*&lt;/span&gt;/&lt;span class="pl-k"&gt;*&lt;/span&gt;_2 s4://tmp/01/ \
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;bschema 7* \&lt;/span&gt;
&lt;span class="pl-s"&gt;             | bcounteach-hash \&lt;/span&gt;
&lt;span class="pl-s"&gt;             | bpartition 1&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

s4 map-from-n s4://tmp/01/ s4://tmp/02/ \
              &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;xargs cat \&lt;/span&gt;
&lt;span class="pl-s"&gt;               | bsumeach-hash i64 \&lt;/span&gt;
&lt;span class="pl-s"&gt;               | bschema *,i64:a \&lt;/span&gt;
&lt;span class="pl-s"&gt;               | csv&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

s4 &lt;span class="pl-c1"&gt;eval&lt;/span&gt; s4://tmp/02/0 \
        &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;tr , " " \&lt;/span&gt;
&lt;span class="pl-s"&gt;         | sort -nrk2 \&lt;/span&gt;
&lt;span class="pl-s"&gt;         | head -n9&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;let's break that down a bit.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;we &lt;a href="https://github.com/nathants/s4#s4-rm"&gt;s4 rm&lt;/a&gt; because we need blank scratch space.&lt;/li&gt;
&lt;li&gt;we &lt;a href="https://github.com/nathants/s4#s4-map-to-n"&gt;s4 map-to-n&lt;/a&gt; on a single column, use &lt;a href="https://github.com/nathants/bsv#bschema"&gt;bschema&lt;/a&gt; to select the first 7 bytes, use &lt;a href="https://github.com/nathants/bsv#bcounteach-hash"&gt;bcounteach-hash&lt;/a&gt; to count the values, then &lt;a href="https://github.com/nathants/bsv#bpartition"&gt;bpartition&lt;/a&gt; by 1 sending all results from around the cluster to a single machine.&lt;/li&gt;
&lt;li&gt;we &lt;a href="https://github.com/nathants/s4#s4-map-from-n"&gt;s4 map-from-n&lt;/a&gt; to merge the results. &lt;code&gt;xargs cat&lt;/code&gt; turns file names into data, &lt;a href="https://github.com/nathants/bsv#bsumeach-hash"&gt;bsumeach-hash&lt;/a&gt; merges the counts, then &lt;a href="https://github.com/nathants/bsv#bschema"&gt;bschema&lt;/a&gt; converts numerics back to ascii, and &lt;a href="https://github.com/nathants/bsv#csv"&gt;csv&lt;/a&gt; converts the result to csv.&lt;/li&gt;
&lt;li&gt;we &lt;a href="https://github.com/nathants/s4#s4-eval"&gt;s4 eval&lt;/a&gt; to fetch the result with &lt;code&gt;tr&lt;/code&gt;, &lt;code&gt;sort&lt;/code&gt;, and &lt;code&gt;head&lt;/code&gt; for formatting.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;let's run it.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; bash count_rides_by_date.sh

2012-03 16146923
2011-03 16066350
2013-03 15749228
2011-10 15707756
2009-10 15604551
2012-05 15567525
2011-05 15554868
2010-09 15540209
2010-05 15481351

0m3.399s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;grouping and accumulating.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; sum_distance_by_date.sh&lt;/span&gt;

s4 map-from-n s4://columns/ s4://tmp/01/ \
              &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;bzip 2,5 \&lt;/span&gt;
&lt;span class="pl-s"&gt;               | bschema 7*,8 \&lt;/span&gt;
&lt;span class="pl-s"&gt;               | bsumeach-hash f64&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

s4 map-to-n s4://tmp/01/ s4://tmp/02/ \
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;bpartition 1&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

s4 map-from-n s4://tmp/02/ s4://tmp/03/ \
              &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;xargs cat \&lt;/span&gt;
&lt;span class="pl-s"&gt;               | bsumeach-hash f64 \&lt;/span&gt;
&lt;span class="pl-s"&gt;               | bschema 7,f64:a \&lt;/span&gt;
&lt;span class="pl-s"&gt;               | csv&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

s4 &lt;span class="pl-c1"&gt;eval&lt;/span&gt; s4://tmp/03/0 \
        &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;tr , " " \&lt;/span&gt;
&lt;span class="pl-s"&gt;         | sort -nrk2 \&lt;/span&gt;
&lt;span class="pl-s"&gt;         | head -n9&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;let's break that down a bit.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;we &lt;a href="https://github.com/nathants/s4#s4-rm"&gt;s4 rm&lt;/a&gt; because we need blank scratch space.&lt;/li&gt;
&lt;li&gt;we &lt;a href="https://github.com/nathants/s4#s4-map-from-n"&gt;s4 map-from-n&lt;/a&gt; to &lt;a href="https://github.com/nathants/bsv#bzip"&gt;bzip&lt;/a&gt; together columns 2 and 5, then use &lt;a href="https://github.com/nathants/bsv#bschema"&gt;bschema&lt;/a&gt; to select the first 7 bytes of column 1, convert column 2 to numerics, then &lt;a href="https://github.com/nathants/bsv#bsumeach-hash"&gt;bsumeach-hash&lt;/a&gt; to sum column 2 by column 1.&lt;/li&gt;
&lt;li&gt;we &lt;a href="https://github.com/nathants/s4#s4-map-to-n"&gt;s4 map-to-n&lt;/a&gt; to &lt;a href="https://github.com/nathants/bsv#bpartition"&gt;bpartition&lt;/a&gt; by 1 sending all results from around the cluster to a single machine.&lt;/li&gt;
&lt;li&gt;we &lt;a href="https://github.com/nathants/s4#s4-map-from-n"&gt;s4 map-from-n&lt;/a&gt; to merge the results. &lt;code&gt;xargs cat&lt;/code&gt; turns file names into data, &lt;a href="https://github.com/nathants/bsv#bsumeach-hash"&gt;bsumeach-hash&lt;/a&gt; merges the sums, then &lt;a href="https://github.com/nathants/bsv#bschema"&gt;bschema&lt;/a&gt; converts numerics back to ascii, and &lt;a href="https://github.com/nathants/bsv#csv"&gt;csv&lt;/a&gt; converts the result to csv.&lt;/li&gt;
&lt;li&gt;we &lt;a href="https://github.com/nathants/s4#s4-eval"&gt;s4 eval&lt;/a&gt; to fetch the result with &lt;code&gt;tr&lt;/code&gt;, &lt;code&gt;sort&lt;/code&gt;, and &lt;code&gt;head&lt;/code&gt; for formatting.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;let's run it.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; bash sum_distance_by_date.sh

2013-08 975457587.2201815
2015-04 403568758.3299783
2010-03 372299513.2798572
2015-11 303443064.4099629
2010-02 216050426.449974
2015-03 210197223.1599888
2015-05 179394357.3799431
2015-01 171590254.990021
2015-06 145792590.1599617

0m7.130s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;finding large values.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; top_n_by_distance.sh&lt;/span&gt;

s4 map s4://columns/&lt;span class="pl-k"&gt;*&lt;/span&gt;/&lt;span class="pl-k"&gt;*&lt;/span&gt;_5 s4://tmp/01/ \
       &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;btopn 9 f64&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

s4 map-from-n s4://tmp/01/ s4://tmp/02/ \
              &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;bmerge -r f64&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

s4 map-to-n s4://tmp/02/ s4://tmp/03/ \
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;bpartition 1&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

s4 map-from-n s4://tmp/03/ s4://tmp/04/ \
              &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;bmerge -r f64 \&lt;/span&gt;
&lt;span class="pl-s"&gt;               | bhead 9 \&lt;/span&gt;
&lt;span class="pl-s"&gt;               | bschema f64:a \&lt;/span&gt;
&lt;span class="pl-s"&gt;               | csv&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

s4 &lt;span class="pl-c1"&gt;eval&lt;/span&gt; s4://tmp/04/0 \
        &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;cat&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;let's break that down a bit.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;we &lt;a href="https://github.com/nathants/s4#s4-rm"&gt;s4 rm&lt;/a&gt; because we need blank scratch space.&lt;/li&gt;
&lt;li&gt;we &lt;a href="https://github.com/nathants/s4#s4-map"&gt;s4 map&lt;/a&gt; to &lt;a href="https://github.com/nathants/bsv#btopn"&gt;btopn&lt;/a&gt; over column 5, accumulating the top 9 values.&lt;/li&gt;
&lt;li&gt;we &lt;a href="https://github.com/nathants/s4#s4-map-from-n"&gt;s4 map-from-n&lt;/a&gt; to &lt;a href="https://github.com/nathants/bsv#bmerge"&gt;bmerge&lt;/a&gt; all results into a single result per machine.&lt;/li&gt;
&lt;li&gt;we &lt;a href="https://github.com/nathants/s4#s4-map-to-n"&gt;s4 map-to-n&lt;/a&gt; to &lt;a href="https://github.com/nathants/bsv#bpartition"&gt;bpartition&lt;/a&gt; by 1 sending all results from around the cluster to a single machine.&lt;/li&gt;
&lt;li&gt;we &lt;a href="https://github.com/nathants/s4#s4-map-from-n"&gt;s4 map-from-n&lt;/a&gt; to merge the results. &lt;a href="https://github.com/nathants/bsv#bmerge"&gt;bmerge&lt;/a&gt; combines the results, &lt;a href="https://github.com/nathants/bsv#bhead"&gt;bhead&lt;/a&gt; takes the top 9, then &lt;a href="https://github.com/nathants/bsv#bschema"&gt;bschema&lt;/a&gt; converts numerics back to ascii, and &lt;a href="https://github.com/nathants/bsv#csv"&gt;csv&lt;/a&gt; converts the result to csv.&lt;/li&gt;
&lt;li&gt;we &lt;a href="https://github.com/nathants/s4#s4-eval"&gt;s4 eval&lt;/a&gt; to fetch the result with &lt;code&gt;tr&lt;/code&gt;, &lt;code&gt;sort&lt;/code&gt;, and &lt;code&gt;head&lt;/code&gt; for formatting.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;let's run it.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; bash top_n_by_distance.sh

198623013.6
59016609.3
19072628.8
16201631.4
15700000
15420061
15420004.5
15331800
15328400

0m2.832s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;distributed sort.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; sort_by_distance.sh&lt;/span&gt;

s4 map s4://columns/&lt;span class="pl-k"&gt;*&lt;/span&gt;/&lt;span class="pl-k"&gt;*&lt;/span&gt;_5 s4://tmp/01/ \
      &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;bsort -r f64&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

s4 map-from-n s4://tmp/01/ s4://tmp/02/ \
              &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;bmerge -r f64&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

s4 map-to-n s4://tmp/02/ s4://tmp/03/ \
            &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;bpartition -l 1&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

s4 map-from-n s4://tmp/03/ s4://tmp/04/ \
              &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;bmerge -lr f64 \&lt;/span&gt;
&lt;span class="pl-s"&gt;               | blz4&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

s4 &lt;span class="pl-c1"&gt;eval&lt;/span&gt; s4://tmp/04/0
        &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;blz4d \&lt;/span&gt;
&lt;span class="pl-s"&gt;         | bschema f64:a \&lt;/span&gt;
&lt;span class="pl-s"&gt;         | csv&lt;/span&gt;
&lt;span class="pl-s"&gt;         | head -n9&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;let's break that down a bit.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;we &lt;a href="https://github.com/nathants/s4#s4-rm"&gt;s4 rm&lt;/a&gt; because we need blank scratch space.&lt;/li&gt;
&lt;li&gt;we &lt;a href="https://github.com/nathants/s4#s4-map"&gt;s4 map&lt;/a&gt; to &lt;a href="https://github.com/nathants/bsv#bsort"&gt;bsort&lt;/a&gt; column 5.&lt;/li&gt;
&lt;li&gt;we &lt;a href="https://github.com/nathants/s4#s4-map-from-n"&gt;s4 map-from-n&lt;/a&gt; to &lt;a href="https://github.com/nathants/bsv#bmerge"&gt;bmerge&lt;/a&gt; all results into a single result per machine.&lt;/li&gt;
&lt;li&gt;we &lt;a href="https://github.com/nathants/s4#s4-map-to-n"&gt;s4 map-to-n&lt;/a&gt; to &lt;a href="https://github.com/nathants/bsv#bpartition"&gt;bpartition&lt;/a&gt; by 1 sending all results from around the cluster to a single machine.&lt;/li&gt;
&lt;li&gt;we &lt;a href="https://github.com/nathants/s4#s4-map-from-n"&gt;s4 map-from-n&lt;/a&gt; to merge the results. &lt;a href="https://github.com/nathants/bsv#bmerge"&gt;bmerge&lt;/a&gt; combines the results.&lt;/li&gt;
&lt;li&gt;we &lt;a href="https://github.com/nathants/s4#s4-eval"&gt;s4 eval&lt;/a&gt; to fetch the the first few rows with &lt;code&gt;tr&lt;/code&gt;, &lt;code&gt;sort&lt;/code&gt;, and &lt;code&gt;head&lt;/code&gt; for formatting.&lt;/li&gt;
&lt;li&gt;we use lz4 compression at several steps to mitigate iowait.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;let's run it.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; bash sort_by_distance.sh

2m10.216s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;we're done for now, so let's delete the cluster.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-rm &lt;span class="pl-smi"&gt;$name&lt;/span&gt; --yes&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;let's put our results in a table.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;query&lt;/th&gt;
&lt;th&gt;presto seconds&lt;/th&gt;
&lt;th&gt;s4 seconds&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;count rides by passengers&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/nathants/posts/blob/005/005_performant_batch_processing_with_bsv_s4_and_presto/count_rides_by_passengers.pql"&gt;6&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/nathants/posts/blob/005/005_performant_batch_processing_with_bsv_s4_and_presto/count_rides_by_passengers.sh"&gt;3&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;count rides by date&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/nathants/posts/blob/005/005_performant_batch_processing_with_bsv_s4_and_presto/count_rides_by_date.pql"&gt;11&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/nathants/posts/blob/005/005_performant_batch_processing_with_bsv_s4_and_presto/count_rides_by_date.sh"&gt;3&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;sum distance by date&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/nathants/posts/blob/005/005_performant_batch_processing_with_bsv_s4_and_presto/sum_distance_by_date.pql"&gt;10&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/nathants/posts/blob/005/005_performant_batch_processing_with_bsv_s4_and_presto/sum_distance_by_date.sh"&gt;7&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;top n by distance&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/nathants/posts/blob/005/005_performant_batch_processing_with_bsv_s4_and_presto/top_n_by_distance.pql"&gt;6&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/nathants/posts/blob/005/005_performant_batch_processing_with_bsv_s4_and_presto/top_n_by_distance.sh"&gt;3&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;distributed sort&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/nathants/posts/blob/005/005_performant_batch_processing_with_bsv_s4_and_presto/sort_by_distance.pql"&gt;584&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/nathants/posts/blob/005/005_performant_batch_processing_with_bsv_s4_and_presto/sort_by_distance.sh"&gt;130&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;so &lt;a href="https://github.com/nathants/s4"&gt;s4&lt;/a&gt; and &lt;a href="https://github.com/nathants/bsv"&gt;bsv&lt;/a&gt; exceeds our performance baseline. we could use it for batch processing. should we? it depends.&lt;/p&gt;
&lt;p&gt;let's look again at one of the queries.&lt;/p&gt;
&lt;div class="highlight highlight-source-sql"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;--&lt;/span&gt; sort_by_distance.pql&lt;/span&gt;
&lt;span class="pl-k"&gt;INSERT INTO&lt;/span&gt; sorted
&lt;span class="pl-k"&gt;SELECT&lt;/span&gt; distance
&lt;span class="pl-k"&gt;FROM&lt;/span&gt; taxi
&lt;span class="pl-k"&gt;ORDER BY&lt;/span&gt; distance &lt;span class="pl-k"&gt;desc&lt;/span&gt;;&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; sort_by_distance.sh&lt;/span&gt;
s4 map        s4://columns/&lt;span class="pl-k"&gt;*&lt;/span&gt;/&lt;span class="pl-k"&gt;*&lt;/span&gt;_5 s4://tmp/01/ &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;bsort -r f64&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
s4 map-from-n s4://tmp/01/       s4://tmp/02/ &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;bmerge -r f64&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
s4 map-to-n   s4://tmp/02/       s4://tmp/03/ &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;bpartition -l 1&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
s4 map-from-n s4://tmp/03/       s4://tmp/04/ &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;bmerge -lr f64 | blz4&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;the presto query is high level. it expresses what we want to do, not how to do it.&lt;/p&gt;
&lt;p&gt;the s4 query is low level. it expresses how to do it, which if correct, results in what we want.&lt;/p&gt;
&lt;p&gt;the presto query will be automatically transformed into executable steps by a query planner.&lt;/p&gt;
&lt;p&gt;the s4 query is the executable steps, manually planned.&lt;/p&gt;
&lt;p&gt;the presto query is difficult to extend in arbitrary ways.&lt;/p&gt;
&lt;p&gt;the s4 query is easy to extend in arbitrary ways. any executable or shell snippet can be inserted into the pipeline of an existing step or as a new step.&lt;/p&gt;
&lt;p&gt;the presto query has implicit intermediate results, which are not accessible.&lt;/p&gt;
&lt;p&gt;the s4 query has explicit intermediate results, which are accessible.&lt;/p&gt;
&lt;p&gt;the presto query has multiple implicit steps which are difficult to analyze and measure independently.&lt;/p&gt;
&lt;p&gt;the s4 query has multiple explicit steps which are easy to analyze and measure independently. in fact, we omitted it from the results before, but the s4 query timed each step.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; bash sort_by_distance.sh

+ s4 map &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;s4://columns/*/*_5&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; s4://tmp/01/ &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;bsort -r f64&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
ok ok ok ok ok ok ok ok ok ok ok ok
0m21.215s

+ s4 map-from-n s4://tmp/01/ s4://tmp/02/ &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;bmerge -r f64&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
ok ok ok ok ok ok ok ok ok ok ok ok
0m1.815s

+ s4 map-to-n s4://tmp/02/ s4://tmp/03/ &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;bpartition -l 1&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
ok ok ok ok ok ok ok ok ok ok ok ok
0m1.432s

+ s4 map-from-n s4://tmp/03/ s4://tmp/04/ &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;bmerge -lr f64 | blz4&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
ok
1m43.728s

2m10.216s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;as we might expect, the final merge on a single machine is slow. surprisingly, the merge and shuffle steps were very fast. i wonder how much time shuffle took for presto?&lt;/p&gt;
&lt;p&gt;&lt;a href="https://prestodb.io/" rel="nofollow"&gt;presto&lt;/a&gt; is excellent, and significantly faster than the &lt;a href="https://hive.apache.org/" rel="nofollow"&gt;previous generation&lt;/a&gt;. it should be used, at a minimum, to check the correctness of your batch processing.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/nathants/s4"&gt;s4&lt;/a&gt; and &lt;a href="https://github.com/nathants/bsv"&gt;bsv&lt;/a&gt; are primitives for distributed data processing. they are low level, high performance, and flexible. they should be used, at a minimum, to establish a performance baseline.&lt;/p&gt;

                      
            </description>
      <guid isPermaLink="false">https://nathants.com/posts/performant-batch-processing-with-bsv-s4-and-presto</guid>
    </item>
    <item>
      <title>optimizing a bsv data processing pipeline</title>
      <link>https://nathants.com/posts/optimizing-a-bsv-data-processing-pipeline</link>
      <description>
                
                        
&lt;p&gt;full source code is available &lt;a href="https://github.com/nathants/posts/tree/006/006_optimizing_a_bsv_data_processing_pipeline"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;in &lt;a href="/posts/performant-batch-processing-with-bsv-s4-and-presto"&gt;performant batch processing&lt;/a&gt; we composed &lt;a href="https://github.com/nathants/bsv#tools"&gt;simple tools&lt;/a&gt; into data pipelines. there are many benefits to this. simple tools are easier to write, test, and audit. they can even be shell snippets or existing unix utilities. they can be written in any language and rebuilt as needed. simple tools can compose into arbitrarily complex pipelines, and if something is out of reach you can always add another &lt;a href="https://github.com/nathants/bsv#bquantile-sketch"&gt;simple&lt;/a&gt; &lt;a href="https://github.com/nathants/bsv#bquantile-merge"&gt;tool&lt;/a&gt;. simple tools can even be &lt;a href="/posts/data-processing-performance-with-python-go-rust-and-c"&gt;performant&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;there is a cost to composing simple tools into data pipelines. primarily this cost is serialization and copies. &lt;a href="https://github.com/nathants/bsv#layout"&gt;efficient data formats&lt;/a&gt; and &lt;a href="https://github.com/nathants/bsv#install"&gt;increased pipe sizes&lt;/a&gt; mitigate this, but don't eliminate it.&lt;/p&gt;
&lt;p&gt;let's install &lt;a href="https://github.com/nathants/bsv#install"&gt;bsv&lt;/a&gt; then measure the cost.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; _gen_bsv 8 12000000 &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; /tmp/data.bsv

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ls -lh /tmp/data.bsv &lt;span class="pl-k"&gt;|&lt;/span&gt; awk &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;{print $5}&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
1.1G

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; cat /tmp/data.bsv &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt;/dev/null
0m0.124s

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; cat /tmp/data.bsv &lt;span class="pl-k"&gt;|&lt;/span&gt; cat &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt;/dev/null
0m0.302s

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; cat /tmp/data.bsv &lt;span class="pl-k"&gt;|&lt;/span&gt; cat &lt;span class="pl-k"&gt;|&lt;/span&gt; cat &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt;/dev/null
0m0.439s

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; cat /tmp/data.bsv &lt;span class="pl-k"&gt;|&lt;/span&gt; cat &lt;span class="pl-k"&gt;|&lt;/span&gt; cat &lt;span class="pl-k"&gt;|&lt;/span&gt; cat &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt;/dev/null
0m0.537s

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; bcopy &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;/tmp/data.bsv &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt;/dev/null
0m0.890s

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; bcopy &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;/tmp/data.bsv &lt;span class="pl-k"&gt;|&lt;/span&gt; bcopy &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt;/dev/null
0m1.137s

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; bcopy &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;/tmp/data.bsv &lt;span class="pl-k"&gt;|&lt;/span&gt; bcopy &lt;span class="pl-k"&gt;|&lt;/span&gt; bcopy &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt;/dev/null
0m1.228s

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; bcopy &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;/tmp/data.bsv &lt;span class="pl-k"&gt;|&lt;/span&gt; bcopy &lt;span class="pl-k"&gt;|&lt;/span&gt; bcopy &lt;span class="pl-k"&gt;|&lt;/span&gt; bcopy &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt;/dev/null
0m1.432s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;so even when we just copy bytes with cat, we can see that as the pipeline grows, time goes up. the effect is even greater when parsing and serialization is performed at each step with &lt;a href="https://github.com/nathants/bsv/blob/master/src/bcopy.c"&gt;bcopy&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;when we are doing &lt;a href="/posts/refactoring-common-distributed-data-patterns-into-s4"&gt;distributed compute&lt;/a&gt; there will be serialization. it's required before data can go over the network. for convenience, we use it between every process in the pipelines we compose to simplify their interface. the benefit is convenience, the cost is performance. this convenience helps us to quickly prototype pipelines and integrate new tools. once our pipelines have stabilized, we can optimize it out.&lt;/p&gt;
&lt;p&gt;first we need to install &lt;a href="https://github.com/nathants/s4"&gt;s4&lt;/a&gt; and &lt;a href="https://github.com/nathants/s4/tree/go/scripts/new_cluster.sh"&gt;spin up a cluster&lt;/a&gt;. we're going to use an &lt;a href="https://github.com/nathants/bootstraps/blob/master/amis/s4.sh"&gt;ami&lt;/a&gt; instead of live bootstrapping to save time.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; git clone https://github.com/nathants/s4

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;cd&lt;/span&gt; s4

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; python3 -m pip install -r requirements.txt &lt;span class="pl-c1"&gt;.&lt;/span&gt;

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;export&lt;/span&gt; region=us-east-1

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; name=s4-cluster

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; type=i3en.2xlarge ami=s4 num=12 bash scripts/new_cluster.sh &lt;span class="pl-smi"&gt;$name&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;next we'll &lt;a href="https://github.com/nathants/s4/tree/go/scripts/connect_to_cluster.sh"&gt;proxy traffic&lt;/a&gt; through a machine in the cluster. assuming the security group only allows port 22, the machines are only accessible on their internal addresses. since we already have ssh setup, we'll use &lt;a href="https://github.com/sshuttle/sshuttle"&gt;sshuttle&lt;/a&gt;. run this in a second terminal, and don't forget to set region to us-east-1.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;export&lt;/span&gt; region=us-east-1

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; name=s4-cluster

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; bash scripts/connect_to_cluster.sh &lt;span class="pl-smi"&gt;$name&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;let's take a look at one of the queries from &lt;a href="https://nathants.com/posts/performant-batch-processing-with-bsv-s4-and-presto" rel="nofollow"&gt;performant batch processing&lt;/a&gt;.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; sum_distance_by_date.sh&lt;/span&gt;
s4 map-from-n s4://columns/ s4://tmp/01/   &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;bzip 2,5 | bschema 7*,8 | bsumeach-hash f64&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
s4 map-to-n   s4://tmp/01/  s4://tmp/02/   &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;bpartition 1&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
s4 map-from-n s4://tmp/02/  s4://tmp/03/   &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;xargs cat | bsumeach-hash f64 | bschema 7,f64:a | csv&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
s4 &lt;span class="pl-c1"&gt;eval&lt;/span&gt;       s4://tmp/03/0                &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;tr , " " | sort -nrk2 | head -n9&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;let's run it and see how long each step takes.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; bash schema.sh

1m9.272s

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; bash sum_distance_by_date.sh

+ s4 map-from-n s4://columns/ s4://tmp/01/ &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;bzip 2,5 | bschema 7*,8 | bsumeach-hash f64&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
0m5.718s

+ s4 map-to-n s4://tmp/01/ s4://tmp/02/ &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;bpartition 1&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
0m1.427s

+ s4 map-from-n s4://tmp/02/ s4://tmp/03/ &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;xargs cat | bsumeach-hash f64 | bschema 7,f64:a | csv&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
0m0.349s

+ s4 &lt;span class="pl-c1"&gt;eval&lt;/span&gt; s4://tmp/03/0 &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;tr , " " | sort -nrk2 | head -n9&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
0m0.161s

0m7.655s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;the majority of runtime is in the first step. let's try to replace that pipeline with a single executable. we'll base it off &lt;a href="https://github.com/nathants/bsv/blob/master/src/bzip.c"&gt;bzip.c&lt;/a&gt;, and then insert functionality from &lt;a href="https://github.com/nathants/bsv/blob/master/src/bschema.c"&gt;bschema.c&lt;/a&gt; and &lt;a href="https://github.com/nathants/bsv/blob/master/src/bsumeach_hash.c"&gt;bsumeach_hash.c&lt;/a&gt;. let's look at the diff of our new &lt;a href="https://github.com/nathants/posts/blob/006/006_optimizing_a_bsv_data_processing_pipeline/step1.c"&gt;step1.c&lt;/a&gt; against the original &lt;a href="https://github.com/nathants/bsv/blob/master/src/bzip.c"&gt;bzip.c&lt;/a&gt;.&lt;/p&gt;
&lt;div class="highlight highlight-source-c"&gt;&lt;pre&gt;&lt;span class="pl-s1"&gt;diff&lt;/span&gt; &lt;span class="pl-c1"&gt;--&lt;/span&gt;&lt;span class="pl-s1"&gt;git&lt;/span&gt; &lt;span class="pl-s1"&gt;a&lt;/span&gt;/~/&lt;span class="pl-s1"&gt;repos&lt;/span&gt;/&lt;span class="pl-s1"&gt;bsv&lt;/span&gt;/&lt;span class="pl-s1"&gt;src&lt;/span&gt;/&lt;span class="pl-s1"&gt;bzip&lt;/span&gt;.&lt;span class="pl-c1"&gt;c&lt;/span&gt; &lt;span class="pl-s1"&gt;b&lt;/span&gt;/&lt;span class="pl-s1"&gt;step1&lt;/span&gt;.&lt;span class="pl-c1"&gt;c&lt;/span&gt;
&lt;span class="pl-s1"&gt;index&lt;/span&gt; &lt;span class="pl-s1"&gt;d393f10&lt;/span&gt;.&lt;span class="pl-c1"&gt;.4e12b7a&lt;/span&gt; &lt;span class="pl-c1"&gt;100644&lt;/span&gt;
&lt;span class="pl-c1"&gt;--&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt; &lt;span class="pl-s1"&gt;a&lt;/span&gt;/~/&lt;span class="pl-s1"&gt;repos&lt;/span&gt;/&lt;span class="pl-s1"&gt;bsv&lt;/span&gt;/&lt;span class="pl-s1"&gt;src&lt;/span&gt;/&lt;span class="pl-s1"&gt;bzip&lt;/span&gt;.&lt;span class="pl-c1"&gt;c&lt;/span&gt;
&lt;span class="pl-c1"&gt;++&lt;/span&gt;&lt;span class="pl-c1"&gt;+&lt;/span&gt; &lt;span class="pl-s1"&gt;b&lt;/span&gt;/&lt;span class="pl-s1"&gt;step1&lt;/span&gt;.&lt;span class="pl-c1"&gt;c&lt;/span&gt;
@@ &lt;span class="pl-c1"&gt;-3&lt;/span&gt;,&lt;span class="pl-c1"&gt;6&lt;/span&gt; &lt;span class="pl-c1"&gt;+&lt;/span&gt;&lt;span class="pl-c1"&gt;3&lt;/span&gt;,&lt;span class="pl-c1"&gt;7&lt;/span&gt; @@
 &lt;span class="pl-k"&gt;#include&lt;/span&gt; &lt;span class="pl-s"&gt;"load.h"&lt;/span&gt;
 &lt;span class="pl-k"&gt;#include&lt;/span&gt; &lt;span class="pl-s"&gt;"array.h"&lt;/span&gt;
 &lt;span class="pl-k"&gt;#include&lt;/span&gt; &lt;span class="pl-s"&gt;"dump.h"&lt;/span&gt;
&lt;span class="pl-c1"&gt;+&lt;/span&gt;&lt;span class="pl-k"&gt;#include&lt;/span&gt; &lt;span class="pl-s"&gt;"hashmap.h"&lt;/span&gt;

 &lt;span class="pl-k"&gt;#define&lt;/span&gt; &lt;span class="pl-c1"&gt;DESCRIPTION&lt;/span&gt; "combine single column inputs into a multi column output\n\n"
 &lt;span class="pl-k"&gt;#define&lt;/span&gt; &lt;span class="pl-c1"&gt;USAGE&lt;/span&gt; "ls column_* | bzip [COL1,...COLN] [-l|--lz4]\n\n"
@@ &lt;span class="pl-c1"&gt;-86&lt;/span&gt;,&lt;span class="pl-c1"&gt;6&lt;/span&gt; &lt;span class="pl-c1"&gt;+&lt;/span&gt;&lt;span class="pl-c1"&gt;87&lt;/span&gt;,&lt;span class="pl-c1"&gt;13&lt;/span&gt; @@ &lt;span class="pl-smi"&gt;int&lt;/span&gt; &lt;span class="pl-en"&gt;main&lt;/span&gt;(&lt;span class="pl-smi"&gt;int&lt;/span&gt; &lt;span class="pl-s1"&gt;argc&lt;/span&gt;, &lt;span class="pl-smi"&gt;char&lt;/span&gt; &lt;span class="pl-c1"&gt;*&lt;/span&gt;&lt;span class="pl-c1"&gt;*&lt;/span&gt;&lt;span class="pl-s1"&gt;argv&lt;/span&gt;) {
     &lt;span class="pl-c"&gt;// setup output&lt;/span&gt;
     &lt;span class="pl-smi"&gt;writebuf_t&lt;/span&gt; &lt;span class="pl-s1"&gt;wbuf&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;wbuf_init&lt;/span&gt;((&lt;span class="pl-smi"&gt;FILE&lt;/span&gt;&lt;span class="pl-c1"&gt;*&lt;/span&gt;[]){&lt;span class="pl-s1"&gt;stdout&lt;/span&gt;}, &lt;span class="pl-c1"&gt;1&lt;/span&gt;, false);

&lt;span class="pl-c1"&gt;+&lt;/span&gt;    &lt;span class="pl-c"&gt;// bsumeach-hash state&lt;/span&gt;
&lt;span class="pl-c1"&gt;+&lt;/span&gt;    &lt;span class="pl-s1"&gt;u8&lt;/span&gt; &lt;span class="pl-c1"&gt;*&lt;/span&gt;&lt;span class="pl-s1"&gt;key&lt;/span&gt;;
&lt;span class="pl-c1"&gt;+&lt;/span&gt;    &lt;span class="pl-s1"&gt;void&lt;/span&gt;&lt;span class="pl-c1"&gt;*&lt;/span&gt; &lt;span class="pl-s1"&gt;element&lt;/span&gt;;
&lt;span class="pl-c1"&gt;+&lt;/span&gt;    &lt;span class="pl-s1"&gt;struct&lt;/span&gt; &lt;span class="pl-smi"&gt;hashmap_s&lt;/span&gt; &lt;span class="pl-s1"&gt;hashmap&lt;/span&gt;;
&lt;span class="pl-c1"&gt;+&lt;/span&gt;    &lt;span class="pl-s1"&gt;f64&lt;/span&gt; &lt;span class="pl-c1"&gt;*&lt;/span&gt;&lt;span class="pl-s1"&gt;sum_f64&lt;/span&gt;;
&lt;span class="pl-c1"&gt;+&lt;/span&gt;    &lt;span class="pl-en"&gt;ASSERT&lt;/span&gt;(&lt;span class="pl-c1"&gt;0&lt;/span&gt; &lt;span class="pl-c1"&gt;==&lt;/span&gt; &lt;span class="pl-en"&gt;hashmap_create&lt;/span&gt;(&lt;span class="pl-c1"&gt;2&lt;/span&gt;, &lt;span class="pl-c1"&gt;&amp;amp;&lt;/span&gt;&lt;span class="pl-s1"&gt;hashmap&lt;/span&gt;), &lt;span class="pl-s"&gt;"fatal: hashmap init\n"&lt;/span&gt;);
&lt;span class="pl-c1"&gt;+&lt;/span&gt;
     &lt;span class="pl-c"&gt;// process input row by row&lt;/span&gt;
     &lt;span class="pl-en"&gt;while&lt;/span&gt; (&lt;span class="pl-c1"&gt;1&lt;/span&gt;) {
         &lt;span class="pl-k"&gt;for&lt;/span&gt; (&lt;span class="pl-smi"&gt;i32&lt;/span&gt; &lt;span class="pl-s1"&gt;i&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;0&lt;/span&gt;; &lt;span class="pl-s1"&gt;i&lt;/span&gt; &lt;span class="pl-c1"&gt;&amp;lt;&lt;/span&gt; &lt;span class="pl-en"&gt;ARRAY_SIZE&lt;/span&gt;(&lt;span class="pl-s1"&gt;selected&lt;/span&gt;); &lt;span class="pl-s1"&gt;i&lt;/span&gt;&lt;span class="pl-c1"&gt;++&lt;/span&gt;) {
@@ &lt;span class="pl-c1"&gt;-99&lt;/span&gt;,&lt;span class="pl-c1"&gt;8&lt;/span&gt; &lt;span class="pl-c1"&gt;+&lt;/span&gt;&lt;span class="pl-c1"&gt;107&lt;/span&gt;,&lt;span class="pl-c1"&gt;36&lt;/span&gt; @@ &lt;span class="pl-smi"&gt;int&lt;/span&gt; &lt;span class="pl-en"&gt;main&lt;/span&gt;(&lt;span class="pl-smi"&gt;int&lt;/span&gt; &lt;span class="pl-s1"&gt;argc&lt;/span&gt;, &lt;span class="pl-smi"&gt;char&lt;/span&gt; &lt;span class="pl-c1"&gt;*&lt;/span&gt;&lt;span class="pl-c1"&gt;*&lt;/span&gt;&lt;span class="pl-s1"&gt;argv&lt;/span&gt;) {
             &lt;span class="pl-en"&gt;ASSERT&lt;/span&gt;(&lt;span class="pl-en"&gt;memcmp&lt;/span&gt;(&lt;span class="pl-s1"&gt;stops&lt;/span&gt;, &lt;span class="pl-s1"&gt;do_stop&lt;/span&gt;, &lt;span class="pl-en"&gt;ARRAY_SIZE&lt;/span&gt;(&lt;span class="pl-s1"&gt;selected&lt;/span&gt;) &lt;span class="pl-c1"&gt;*&lt;/span&gt; &lt;span class="pl-k"&gt;sizeof&lt;/span&gt;(&lt;span class="pl-s1"&gt;i32&lt;/span&gt;)) &lt;span class="pl-c1"&gt;==&lt;/span&gt; &lt;span class="pl-c1"&gt;0&lt;/span&gt;, &lt;span class="pl-s"&gt;"fatal: all columns didn't end at the same length\n"&lt;/span&gt;);
             &lt;span class="pl-k"&gt;break&lt;/span&gt;;
         }
&lt;span class="pl-c1"&gt;-&lt;/span&gt;        &lt;span class="pl-en"&gt;dump&lt;/span&gt;(&lt;span class="pl-c1"&gt;&amp;amp;&lt;/span&gt;&lt;span class="pl-s1"&gt;wbuf&lt;/span&gt;, &lt;span class="pl-c1"&gt;&amp;amp;&lt;/span&gt;&lt;span class="pl-s1"&gt;new&lt;/span&gt;, &lt;span class="pl-c1"&gt;0&lt;/span&gt;);
&lt;span class="pl-c1"&gt;+&lt;/span&gt;
&lt;span class="pl-c1"&gt;+&lt;/span&gt;        &lt;span class="pl-c"&gt;// bschema 7*,*&lt;/span&gt;
&lt;span class="pl-c1"&gt;+&lt;/span&gt;        &lt;span class="pl-s1"&gt;new&lt;/span&gt;.&lt;span class="pl-c1"&gt;sizes&lt;/span&gt;[&lt;span class="pl-c1"&gt;0&lt;/span&gt;] &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;7&lt;/span&gt;;
&lt;span class="pl-c1"&gt;+&lt;/span&gt;
&lt;span class="pl-c1"&gt;+&lt;/span&gt;        &lt;span class="pl-c"&gt;// bsumeach-hash f64&lt;/span&gt;
&lt;span class="pl-c1"&gt;+&lt;/span&gt;        &lt;span class="pl-en"&gt;ASSERT&lt;/span&gt;(&lt;span class="pl-s1"&gt;new&lt;/span&gt;.&lt;span class="pl-c1"&gt;max&lt;/span&gt; &amp;gt;= &lt;span class="pl-c1"&gt;1&lt;/span&gt;, &lt;span class="pl-s"&gt;"fatal: need at least 2 columns\n"&lt;/span&gt;);
&lt;span class="pl-c1"&gt;+&lt;/span&gt;        &lt;span class="pl-en"&gt;ASSERT&lt;/span&gt;(&lt;span class="pl-c1"&gt;8&lt;/span&gt; &lt;span class="pl-c1"&gt;==&lt;/span&gt; &lt;span class="pl-s1"&gt;new&lt;/span&gt;.&lt;span class="pl-c1"&gt;sizes&lt;/span&gt;[&lt;span class="pl-c1"&gt;1&lt;/span&gt;], &lt;span class="pl-s"&gt;"fatal: bad data size\n"&lt;/span&gt;);
&lt;span class="pl-c1"&gt;+&lt;/span&gt;        &lt;span class="pl-en"&gt;if&lt;/span&gt; (&lt;span class="pl-s1"&gt;element&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;hashmap_get&lt;/span&gt;(&lt;span class="pl-c1"&gt;&amp;amp;&lt;/span&gt;&lt;span class="pl-s1"&gt;hashmap&lt;/span&gt;, &lt;span class="pl-s1"&gt;new&lt;/span&gt;.&lt;span class="pl-c1"&gt;columns&lt;/span&gt;[&lt;span class="pl-c1"&gt;0&lt;/span&gt;], &lt;span class="pl-s1"&gt;new&lt;/span&gt;.&lt;span class="pl-c1"&gt;sizes&lt;/span&gt;[&lt;span class="pl-c1"&gt;0&lt;/span&gt;])) {
&lt;span class="pl-c1"&gt;+&lt;/span&gt;            &lt;span class="pl-c1"&gt;*&lt;/span&gt;(&lt;span class="pl-smi"&gt;f64&lt;/span&gt;&lt;span class="pl-c1"&gt;*&lt;/span&gt;)&lt;span class="pl-s1"&gt;element&lt;/span&gt; &lt;span class="pl-c1"&gt;+=&lt;/span&gt; &lt;span class="pl-c1"&gt;*&lt;/span&gt;(&lt;span class="pl-smi"&gt;f64&lt;/span&gt;&lt;span class="pl-c1"&gt;*&lt;/span&gt;)&lt;span class="pl-s1"&gt;new&lt;/span&gt;.&lt;span class="pl-c1"&gt;columns&lt;/span&gt;[&lt;span class="pl-c1"&gt;1&lt;/span&gt;];
&lt;span class="pl-c1"&gt;+&lt;/span&gt;        } &lt;span class="pl-s1"&gt;else&lt;/span&gt; {
&lt;span class="pl-c1"&gt;+&lt;/span&gt;            &lt;span class="pl-en"&gt;MALLOC&lt;/span&gt;(&lt;span class="pl-s1"&gt;key&lt;/span&gt;, &lt;span class="pl-s1"&gt;new&lt;/span&gt;.&lt;span class="pl-c1"&gt;sizes&lt;/span&gt;[&lt;span class="pl-c1"&gt;0&lt;/span&gt;]);
&lt;span class="pl-c1"&gt;+&lt;/span&gt;            &lt;span class="pl-en"&gt;strncpy&lt;/span&gt;(&lt;span class="pl-s1"&gt;key&lt;/span&gt;, &lt;span class="pl-s1"&gt;new&lt;/span&gt;.&lt;span class="pl-c1"&gt;columns&lt;/span&gt;[&lt;span class="pl-c1"&gt;0&lt;/span&gt;], &lt;span class="pl-s1"&gt;new&lt;/span&gt;.&lt;span class="pl-c1"&gt;sizes&lt;/span&gt;[&lt;span class="pl-c1"&gt;0&lt;/span&gt;]);
&lt;span class="pl-c1"&gt;+&lt;/span&gt;            &lt;span class="pl-en"&gt;MALLOC&lt;/span&gt;(&lt;span class="pl-s1"&gt;sum_f64&lt;/span&gt;, &lt;span class="pl-k"&gt;sizeof&lt;/span&gt;(&lt;span class="pl-s1"&gt;f64&lt;/span&gt;)); &lt;span class="pl-c1"&gt;*&lt;/span&gt;&lt;span class="pl-s1"&gt;sum_f64&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;*&lt;/span&gt;(&lt;span class="pl-smi"&gt;f64&lt;/span&gt;&lt;span class="pl-c1"&gt;*&lt;/span&gt;)&lt;span class="pl-s1"&gt;new&lt;/span&gt;.&lt;span class="pl-c1"&gt;columns&lt;/span&gt;[&lt;span class="pl-c1"&gt;1&lt;/span&gt;];
&lt;span class="pl-c1"&gt;+&lt;/span&gt;            &lt;span class="pl-en"&gt;ASSERT&lt;/span&gt;(&lt;span class="pl-c1"&gt;0&lt;/span&gt; &lt;span class="pl-c1"&gt;==&lt;/span&gt; &lt;span class="pl-en"&gt;hashmap_put&lt;/span&gt;(&lt;span class="pl-c1"&gt;&amp;amp;&lt;/span&gt;&lt;span class="pl-s1"&gt;hashmap&lt;/span&gt;, &lt;span class="pl-s1"&gt;key&lt;/span&gt;, &lt;span class="pl-s1"&gt;new&lt;/span&gt;.&lt;span class="pl-c1"&gt;sizes&lt;/span&gt;[&lt;span class="pl-c1"&gt;0&lt;/span&gt;], &lt;span class="pl-s1"&gt;sum_f64&lt;/span&gt;), &lt;span class="pl-s"&gt;"fatal: hashmap put\n"&lt;/span&gt;);
&lt;span class="pl-c1"&gt;+&lt;/span&gt;        }
&lt;span class="pl-c1"&gt;+&lt;/span&gt;
     }
&lt;span class="pl-c1"&gt;+&lt;/span&gt;
&lt;span class="pl-c1"&gt;+&lt;/span&gt;    &lt;span class="pl-c"&gt;// bsumeach-hash f64 dump results&lt;/span&gt;
&lt;span class="pl-c1"&gt;+&lt;/span&gt;    &lt;span class="pl-en"&gt;for&lt;/span&gt; (&lt;span class="pl-s1"&gt;i32&lt;/span&gt; &lt;span class="pl-s1"&gt;i&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;0&lt;/span&gt;; &lt;span class="pl-s1"&gt;i&lt;/span&gt; &lt;span class="pl-c1"&gt;&amp;lt;&lt;/span&gt; &lt;span class="pl-s1"&gt;hashmap&lt;/span&gt;.&lt;span class="pl-c1"&gt;table_size&lt;/span&gt;; &lt;span class="pl-s1"&gt;i&lt;/span&gt;&lt;span class="pl-c1"&gt;++&lt;/span&gt;) {
&lt;span class="pl-c1"&gt;+&lt;/span&gt;        &lt;span class="pl-en"&gt;if&lt;/span&gt; (&lt;span class="pl-s1"&gt;hashmap&lt;/span&gt;.&lt;span class="pl-c1"&gt;data&lt;/span&gt;[&lt;span class="pl-s1"&gt;i&lt;/span&gt;].&lt;span class="pl-c1"&gt;in_use&lt;/span&gt;) {
&lt;span class="pl-c1"&gt;+&lt;/span&gt;            &lt;span class="pl-s1"&gt;row&lt;/span&gt;.&lt;span class="pl-c1"&gt;max&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;;
&lt;span class="pl-c1"&gt;+&lt;/span&gt;            &lt;span class="pl-s1"&gt;row&lt;/span&gt;.&lt;span class="pl-c1"&gt;columns&lt;/span&gt;[&lt;span class="pl-c1"&gt;0&lt;/span&gt;] &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;hashmap&lt;/span&gt;.&lt;span class="pl-c1"&gt;data&lt;/span&gt;[&lt;span class="pl-s1"&gt;i&lt;/span&gt;].&lt;span class="pl-c1"&gt;key&lt;/span&gt;;
&lt;span class="pl-c1"&gt;+&lt;/span&gt;            &lt;span class="pl-s1"&gt;row&lt;/span&gt;.&lt;span class="pl-c1"&gt;sizes&lt;/span&gt;[&lt;span class="pl-c1"&gt;0&lt;/span&gt;] &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;hashmap&lt;/span&gt;.&lt;span class="pl-c1"&gt;data&lt;/span&gt;[&lt;span class="pl-s1"&gt;i&lt;/span&gt;].&lt;span class="pl-c1"&gt;key_len&lt;/span&gt;;
&lt;span class="pl-c1"&gt;+&lt;/span&gt;            &lt;span class="pl-s1"&gt;row&lt;/span&gt;.&lt;span class="pl-c1"&gt;columns&lt;/span&gt;[&lt;span class="pl-c1"&gt;1&lt;/span&gt;] &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;hashmap&lt;/span&gt;.&lt;span class="pl-c1"&gt;data&lt;/span&gt;[&lt;span class="pl-s1"&gt;i&lt;/span&gt;].&lt;span class="pl-c1"&gt;data&lt;/span&gt;;
&lt;span class="pl-c1"&gt;+&lt;/span&gt;            &lt;span class="pl-s1"&gt;row&lt;/span&gt;.&lt;span class="pl-c1"&gt;sizes&lt;/span&gt;[&lt;span class="pl-c1"&gt;1&lt;/span&gt;] &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-k"&gt;sizeof&lt;/span&gt;(&lt;span class="pl-s1"&gt;f64&lt;/span&gt;);
&lt;span class="pl-c1"&gt;+&lt;/span&gt;            &lt;span class="pl-en"&gt;dump&lt;/span&gt;(&lt;span class="pl-c1"&gt;&amp;amp;&lt;/span&gt;&lt;span class="pl-s1"&gt;wbuf&lt;/span&gt;, &lt;span class="pl-c1"&gt;&amp;amp;&lt;/span&gt;&lt;span class="pl-s1"&gt;row&lt;/span&gt;, &lt;span class="pl-c1"&gt;0&lt;/span&gt;);
&lt;span class="pl-c1"&gt;+&lt;/span&gt;        }
&lt;span class="pl-c1"&gt;+&lt;/span&gt;    }
&lt;span class="pl-c1"&gt;+&lt;/span&gt;
     &lt;span class="pl-en"&gt;dump_flush&lt;/span&gt;(&lt;span class="pl-c1"&gt;&amp;amp;&lt;/span&gt;&lt;span class="pl-s1"&gt;wbuf&lt;/span&gt;, &lt;span class="pl-c1"&gt;0&lt;/span&gt;);

 }&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;let's ship it to the cluster and compile it.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-scp step1.c &lt;span class="pl-c1"&gt;:&lt;/span&gt; s4-cluster --yes

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-ssh s4-cluster -yc &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;sudo gcc -Ibsv/util -Ibsv/vendor -flto -O3 -march=native -mtune=native -lm -o /usr/local/bin/step1 step1.c bsv/vendor/lz4.c&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;our optimized query looks like this.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; sum_distance_by_date.sh&lt;/span&gt;
s4 map-from-n s4://columns/ s4://tmp/01/   &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;step1 2,5&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
s4 map-to-n   s4://tmp/01/  s4://tmp/02/   &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;bpartition 1&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
s4 map-from-n s4://tmp/02/  s4://tmp/03/   &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;xargs cat | bsumeach-hash f64 | bschema 7,f64:a | csv&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
s4 &lt;span class="pl-c1"&gt;eval&lt;/span&gt;       s4://tmp/03/0                &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;tr , " " | sort -nrk2 | head -n9&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;let's run it.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; bash sum_distance_by_date_optimized.sh

+ s4 map-from-n s4://columns/ s4://tmp/01/ &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;step1 2,5&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
0m2.034s

+ s4 map-to-n s4://tmp/01/ s4://tmp/02/ &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;bpartition 1&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
0m1.334s

+ s4 map-from-n s4://tmp/02/ s4://tmp/03/ &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;xargs cat | bsumeach-hash f64 | bschema 7,f64:a | csv&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
0m0.336s

+ s4 &lt;span class="pl-c1"&gt;eval&lt;/span&gt; s4://tmp/03/0 &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;tr , " " | sort -nrk2 | head -n9&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
real    0m0.161s

0m3.866s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;looks like step1 went from 6 to 2 seconds and the whole query went from 8 to 4 seconds.&lt;/p&gt;
&lt;p&gt;we're done for now, so let's delete the cluster.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; aws-ec2-rm &lt;span class="pl-smi"&gt;$name&lt;/span&gt; --yes&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;composing data pipelines from simple tools is an effective way to rapidly prototype.&lt;/p&gt;
&lt;p&gt;reusing the same serialization between local and distributed processes we can build and use tools that don't care whether data is coming from or going to a file, a pipe, or a socket.&lt;/p&gt;
&lt;p&gt;once our prototypes have stabilized, we can optimize them by collapsing pipelines into a single executable.&lt;/p&gt;

                      
            </description>
      <guid isPermaLink="false">https://nathants.com/posts/optimizing-a-bsv-data-processing-pipeline</guid>
    </item>
    <item>
      <title>data processing performance with python go rust and c</title>
      <link>https://nathants.com/posts/data-processing-performance-with-python-go-rust-and-c</link>
      <description>
                
                        
&lt;p&gt;full source code is available &lt;a href="https://github.com/nathants/posts/tree/004/004_data_processing_performance_with_python_go_rust_and_c"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;performance is important, and yet our intuition about it is often wrong. &lt;a href="/posts/refactoring-common-distributed-data-patterns-into-s4"&gt;previously&lt;/a&gt; we deployed optimized python across a cluster of machines to analyze the &lt;a href="https://registry.opendata.aws/nyc-tlc-trip-records-pds/" rel="nofollow"&gt;nyc taxi&lt;/a&gt; dataset. how was its performance?&lt;/p&gt;
&lt;p&gt;let's try to discover a reasonable baseline for data processing performance and build intuition that can guide our decisions. we'll do this by experimenting with simple transformations of generated data using various formats, techniques, and languages on a single cpu core.&lt;/p&gt;
&lt;p&gt;whether we are configuring and using off the shelf software or building bespoke systems, we need the ability to intuit problems and detect low hanging fruit.&lt;/p&gt;
&lt;p&gt;we'll say that our data is a sequence of rows, that a row is made of 8 columns, and that a column is a random dictionary word.&lt;/p&gt;
&lt;p&gt;we'll generate our dataset as csv with the following &lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/gen_csv.py"&gt;script&lt;/a&gt;.&lt;/p&gt;
&lt;div class="highlight highlight-source-python"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;# gen_csv.py&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;sys&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;random&lt;/span&gt;

&lt;span class="pl-s1"&gt;words&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; [
    ...
]

&lt;span class="pl-s1"&gt;num_rows&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;int&lt;/span&gt;(&lt;span class="pl-s1"&gt;sys&lt;/span&gt;.&lt;span class="pl-s1"&gt;argv&lt;/span&gt;[&lt;span class="pl-c1"&gt;1&lt;/span&gt;])

&lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;_&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-en"&gt;range&lt;/span&gt;(&lt;span class="pl-s1"&gt;num_rows&lt;/span&gt;):
    &lt;span class="pl-s1"&gt;row&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; [&lt;span class="pl-s1"&gt;random&lt;/span&gt;.&lt;span class="pl-en"&gt;choice&lt;/span&gt;(&lt;span class="pl-s1"&gt;words&lt;/span&gt;) &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;_&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-en"&gt;range&lt;/span&gt;(&lt;span class="pl-c1"&gt;8&lt;/span&gt;)]
    &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s"&gt;','&lt;/span&gt;.&lt;span class="pl-en"&gt;join&lt;/span&gt;(&lt;span class="pl-s1"&gt;row&lt;/span&gt;))&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;our first transformation will be selecting a subset of columns.&lt;/p&gt;
&lt;p&gt;let's try python.&lt;/p&gt;
&lt;div class="highlight highlight-source-python"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;# select.py&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;sys&lt;/span&gt;

&lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;line&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-s1"&gt;sys&lt;/span&gt;.&lt;span class="pl-s1"&gt;stdin&lt;/span&gt;:
    &lt;span class="pl-s1"&gt;columns&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;line&lt;/span&gt;.&lt;span class="pl-en"&gt;split&lt;/span&gt;(&lt;span class="pl-s"&gt;','&lt;/span&gt;)
    &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s"&gt;f'&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;columns&lt;/span&gt;[&lt;span class="pl-c1"&gt;2&lt;/span&gt;]&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;,&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;columns&lt;/span&gt;[&lt;span class="pl-c1"&gt;6&lt;/span&gt;]&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;'&lt;/span&gt;)&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;first we need some data.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; pypy3 gen_csv.py 1000000 &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; /tmp/data.csv

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ls -lh /tmp/data.csv &lt;span class="pl-k"&gt;|&lt;/span&gt; awk &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;{print $5}&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

72M&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;we're gonna need more data.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; pypy3 gen_csv.py 15000000 &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; /tmp/data.csv

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ls -lh /tmp/data.csv &lt;span class="pl-k"&gt;|&lt;/span&gt; awk &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;{print $5}&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;

1.1G&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;that'll do. now let's try our selection. we'll make sure a subset of the result is sane, then we'll check the hash of the entire result using &lt;a href="https://www.archlinux.org/packages/community/x86_64/xxhash/" rel="nofollow"&gt;xxhsum&lt;/a&gt;. all other runs we'll discard the output and time execution.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; python select.py &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;/tmp/data.csv &lt;span class="pl-k"&gt;|&lt;/span&gt; head -n3

epigram,Madeleine
strategies,briefed
Doritos,putsch

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; python select.py &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;/tmp/data.csv &lt;span class="pl-k"&gt;|&lt;/span&gt; xxhsum

12927f314ca6e9eb&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;seems sane. let's time it.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; python select.py &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;/tmp/data.csv &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt;/dev/null

real    0m10.076s
user    0m9.779s
sys     0m0.200s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;let's try coreutils &lt;a href="https://github.com/coreutils/coreutils/blob/master/src/cut.c"&gt;cut&lt;/a&gt;.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; cut -d, -f3,7 &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;/tmp/data.csv &lt;span class="pl-k"&gt;|&lt;/span&gt; xxhsum
12927f314ca6e9eb

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; cut -d, -f3,7 &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;/tmp/data.csv &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt;/dev/null

real    0m3.534s
user    0m3.341s
sys     0m0.159s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;faster. we may need to look at compiled languages for a reasonable baseline.&lt;/p&gt;
&lt;p&gt;let's optimize by avoiding allocations and doing as little work as possible. we'll pull rows off a buffered reader, setup columns as offsets into that buffer, and access columns by slicing the row data.&lt;/p&gt;
&lt;p&gt;let's try &lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/select.go"&gt;go&lt;/a&gt;.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; go build -o select_go select.go

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ./select_go &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;/tmp/data.csv &lt;span class="pl-k"&gt;|&lt;/span&gt; xxhsum

12927f314ca6e9eb

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; ./select_go &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;/tmp/data.csv &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt;/dev/null

real    0m2.832s
user    0m2.559s
sys     0m0.312s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;faster than cut. this is progress.&lt;/p&gt;
&lt;p&gt;let's try &lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/select.rs"&gt;rust&lt;/a&gt;.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; rustc -O -o select_rust select.rs

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ./select_rust &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;/tmp/data.csv &lt;span class="pl-k"&gt;|&lt;/span&gt; xxhsum

12927f314ca6e9eb

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; ./select_rust &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;/tmp/data.csv &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt;/dev/null

real    0m2.602s
user    0m2.491s
sys     0m0.110s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;pretty much the same. let's try &lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/select.c"&gt;c&lt;/a&gt;. we'll grab a few header only dependencies for &lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/utils/csv.h"&gt;csv parsing&lt;/a&gt; and &lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/utils/write_simple.h"&gt;buffered writing&lt;/a&gt;.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; gcc -Iutils -O3 -flto -march=native -mtune=native -o select_c select.c

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ./select_c &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;/tmp/data.csv &lt;span class="pl-k"&gt;|&lt;/span&gt; xxhsum

12927f314ca6e9eb

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; ./select_c &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;/tmp/data.csv &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt;/dev/null

real    0m2.716s
user    0m2.569s
sys     0m0.120s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;so &lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/select.rs"&gt;rust&lt;/a&gt;, &lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/select.go"&gt;go&lt;/a&gt;, and &lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/select.c"&gt;c&lt;/a&gt; are very similar. we may have established a baseline when working with csv.&lt;/p&gt;
&lt;p&gt;let's try a similar optimization with &lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/select_inlined.py"&gt;pypy&lt;/a&gt;.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; pypy select_inlined.py &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;/tmp/data.csv &lt;span class="pl-k"&gt;|&lt;/span&gt; xxhsum

12927f314ca6e9eb

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; pypy select_inlined.py &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;/tmp/data.csv &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt;/dev/null

real    0m4.491s
user    0m4.293s
sys     0m0.170s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;not bad.&lt;/p&gt;
&lt;p&gt;let's try using &lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/psv/psv.go"&gt;protobuf&lt;/a&gt; and &lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/psv/select.go"&gt;go&lt;/a&gt;. we'll call the data format psv.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; (cd psv &lt;span class="pl-k"&gt;&amp;amp;&amp;amp;&lt;/span&gt; protoc -I=row --go_out=row row/row.proto)

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; (cd psv &lt;span class="pl-k"&gt;&amp;amp;&amp;amp;&lt;/span&gt; go build -o psv psv.go)

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; (cd psv &lt;span class="pl-k"&gt;&amp;amp;&amp;amp;&lt;/span&gt; go build -o &lt;span class="pl-k"&gt;select&lt;/span&gt; &lt;span class="pl-smi"&gt;select.go)&lt;/span&gt;

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ./psv/psv &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;/tmp/data.csv &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt;/tmp/data.psv

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ./psv/select &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;/tmp/data.psv &lt;span class="pl-k"&gt;|&lt;/span&gt; xxhsum

12927f314ca6e9eb  stdin

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; ./psv/select &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;/tmp/data.psv &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt;/dev/null

real    0m10.424s
user    0m10.465s
sys     0m0.251s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;interesting. slower than naive python and csv.&lt;/p&gt;
&lt;p&gt;is reading and writing data to some format a majority of the work?&lt;/p&gt;
&lt;p&gt;let's think about our optimized code from before. our representation of a row is 3 pieces of data. a byte array of content, an array of column start positions, and an array of column sizes. writing a row as csv was easy, but &lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/utils/csv.h"&gt;reading&lt;/a&gt; was hard.&lt;/p&gt;
&lt;p&gt;what if we made it easier? all we want is an array of bytes and two int arrays.&lt;/p&gt;
&lt;p&gt;let's let a row written as bytes be:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the max zero based index&lt;/li&gt;
&lt;li&gt;the column sizes&lt;/li&gt;
&lt;li&gt;the column data&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;|&lt;/span&gt; u16:max &lt;span class="pl-k"&gt;|&lt;/span&gt; u16:size &lt;span class="pl-k"&gt;|&lt;/span&gt; ... &lt;span class="pl-k"&gt;|&lt;/span&gt; u8[]:column &lt;span class="pl-k"&gt;|&lt;/span&gt; ... &lt;span class="pl-k"&gt;|&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;this should be easy to write, and more importantly easy to read. we can read max, which tells us how many sizes to read. from the sizes we can reconstruct the offsets and the size of the row data. we can then read the row data, and access the columns by offset and size.&lt;/p&gt;
&lt;p&gt;our optimized code also buffered reads and writes into large chunks.&lt;/p&gt;
&lt;p&gt;let's let a chunk of rows written as bytes be:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;size&lt;/li&gt;
&lt;li&gt;data&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;|&lt;/span&gt; i32:size &lt;span class="pl-k"&gt;|&lt;/span&gt; u8[]:row &lt;span class="pl-k"&gt;|&lt;/span&gt; ... &lt;span class="pl-k"&gt;|&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;let's constrain a chunk to only contain complete rows and be smaller than some maximum size.&lt;/p&gt;
&lt;p&gt;we'll call this format bsv. we'll implement buffered &lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/utils/read.h"&gt;reading&lt;/a&gt; and &lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/utils/write.h"&gt;writing&lt;/a&gt; of chunks, as well as &lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/utils/load.h"&gt;loading&lt;/a&gt; and &lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/utils/dump.h"&gt;dumping&lt;/a&gt; of &lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/utils/row.h"&gt;rows&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;let's implement our transformation using bsv in &lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/bsv/select.c"&gt;c&lt;/a&gt;.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; gcc -Iutils -O3 -flto -march=native -mtune=native -o bsv/bsv bsv/bsv.c

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ./bsv/bsv &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;/tmp/data.csv &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt;/tmp/data.bsv

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; gcc -Iutils -O3 -flto -march=native -mtune=native -o bsv/select bsv/select.c

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ./bsv/select &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;/tmp/data.bsv &lt;span class="pl-k"&gt;|&lt;/span&gt; xxhsum

12927f314ca6e9eb

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; ./bsv/select &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;/tmp/data.bsv &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt;/dev/null

real    0m0.479s
user    0m0.339s
sys     0m0.140s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;we've processed the same data, and system time has been fairly consistent, but user time has varied significantly.&lt;/p&gt;
&lt;p&gt;let's try a second transformation where we reverse the columns of every row.&lt;/p&gt;
&lt;p&gt;we'll implement it with csv in &lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/reverse.py"&gt;python&lt;/a&gt;, &lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/reverse_inlined.py"&gt;pypy&lt;/a&gt; and &lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/reverse.c"&gt;c&lt;/a&gt;, then with bsv in &lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/bsv/reverse.c"&gt;c&lt;/a&gt;.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; python reverse.py &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;/tmp/data.csv &lt;span class="pl-k"&gt;|&lt;/span&gt; xxhsum

e221974c95d356f9

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; python reverse.py &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;/tmp/data.csv &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt;/dev/null

real    0m13.915s
user    0m13.743s
sys     0m0.170s

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; pypy3 reverse_inlined.py &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;/tmp/data.csv &lt;span class="pl-k"&gt;|&lt;/span&gt; xxhsum

e221974c95d356f9

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; pypy3 reverse_inlined.py &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;/tmp/data.csv &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt;/dev/null

real    0m6.141s
user    0m5.880s
sys     0m0.220s

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; gcc -Iutils -O3 -flto -march=native -mtune=native -o reverse reverse.c

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ./reverse &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;/tmp/data.csv &lt;span class="pl-k"&gt;|&lt;/span&gt; xxhsum

e221974c95d356f9

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; ./reverse &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;/tmp/data.csv &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt;/dev/null

real    0m2.890s
user    0m2.719s
sys     0m0.170s

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; gcc -Iutils -O3 -flto -march=native -mtune=native -o bsv/reverse bsv/reverse.c

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ./bsv/reverse &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;/tmp/data.bsv &lt;span class="pl-k"&gt;|&lt;/span&gt; xxhsum

e221974c95d356f9

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; ./bsv/reverse &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;/tmp/data.bsv &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt;/dev/null

real    0m1.052s
user    0m0.891s
sys     0m0.161s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;let's try a third transformation where we count every column where the first character of the first column is "f".&lt;/p&gt;
&lt;p&gt;we'll implement it with csv in &lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/count.py"&gt;python&lt;/a&gt;, &lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/count_inlined.py"&gt;pypy&lt;/a&gt; and &lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/count.c"&gt;c&lt;/a&gt;, then with bsv in &lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/bsv/count.c"&gt;c&lt;/a&gt;.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; python count.py &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;/tmp/data.csv

467002

real    0m6.385s
user    0m6.223s
sys     0m0.160s

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; pypy3 count_inlined.py &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;/tmp/data.csv

467002

real    0m3.147s
user    0m2.938s
sys     0m0.180s

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; gcc -Iutils -O3 -flto -march=native -mtune=native -o count count.c

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; ./count &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;/tmp/data.csv

467002

real    0m2.367s
user    0m2.245s
sys     0m0.121s

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; gcc -Iutils -O3 -flto -march=native -mtune=native -o bsv/count bsv/count.c

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;time&lt;/span&gt; bsv/count &lt;span class="pl-k"&gt;&amp;lt;&lt;/span&gt;/tmp/data.bsv

467002

real    0m0.260s
user    0m0.135s
sys     0m0.125s&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;in transformations 2 and 3 we again see significant variance in user time.&lt;/p&gt;
&lt;p&gt;let's put our user time results in a table.&lt;/p&gt;
&lt;p&gt;first we have our select transformation, which outputs 25% of its input.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;format&lt;/th&gt;
&lt;th&gt;language&lt;/th&gt;
&lt;th&gt;user seconds&lt;/th&gt;
&lt;th&gt;gigabytes / second&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;psv&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/psv/select.go"&gt;go&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;10.4&lt;/td&gt;
&lt;td&gt;0.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;csv&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/select.py"&gt;python&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;9.7&lt;/td&gt;
&lt;td&gt;0.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;csv&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/select_inlined.py"&gt;pypy&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;4.3&lt;/td&gt;
&lt;td&gt;0.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;csv&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/select.go"&gt;go&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;2.6&lt;/td&gt;
&lt;td&gt;0.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;csv&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/select.c"&gt;c&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;2.6&lt;/td&gt;
&lt;td&gt;0.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;csv&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/select.rs"&gt;rust&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;2.5&lt;/td&gt;
&lt;td&gt;0.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;bsv&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/bsv/select.c"&gt;c&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;0.3&lt;/td&gt;
&lt;td&gt;3.3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;second we have our reverse transformation, which outputs 100% of its input.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;format&lt;/th&gt;
&lt;th&gt;language&lt;/th&gt;
&lt;th&gt;user seconds&lt;/th&gt;
&lt;th&gt;gigabytes / second&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;csv&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/reverse.py"&gt;python&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;13.7&lt;/td&gt;
&lt;td&gt;0.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;csv&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/reverse_inlined.py"&gt;pypy&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;5.8&lt;/td&gt;
&lt;td&gt;0.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;csv&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/reverse.c"&gt;c&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;2.7&lt;/td&gt;
&lt;td&gt;0.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;bsv&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/bsv/reverse.c"&gt;c&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;0.9&lt;/td&gt;
&lt;td&gt;1.1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;third we have our count transformation, which outputs &amp;lt;0.001% of its input.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;format&lt;/th&gt;
&lt;th&gt;language&lt;/th&gt;
&lt;th&gt;user seconds&lt;/th&gt;
&lt;th&gt;gigabytes / second&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;csv&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/count.py"&gt;python&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;6.2&lt;/td&gt;
&lt;td&gt;0.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;csv&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/count_inlined.py"&gt;pypy&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;2.9&lt;/td&gt;
&lt;td&gt;0.3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;csv&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/count.c"&gt;c&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;2.2&lt;/td&gt;
&lt;td&gt;0.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;bsv&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/bsv/count.c"&gt;c&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;0.1&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;interesting. let's take a closer look at the csv and bsv results for c based on the ratio of inputs to outputs.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;inputs / outputs&lt;/th&gt;
&lt;th&gt;format&lt;/th&gt;
&lt;th&gt;language&lt;/th&gt;
&lt;th&gt;user seconds&lt;/th&gt;
&lt;th&gt;gigabytes / second&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1 / 1&lt;/td&gt;
&lt;td&gt;csv&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/reverse.c"&gt;c&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;2.7&lt;/td&gt;
&lt;td&gt;0.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4 / 1&lt;/td&gt;
&lt;td&gt;csv&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/select.c"&gt;c&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;2.6&lt;/td&gt;
&lt;td&gt;0.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1000 / 1&lt;/td&gt;
&lt;td&gt;csv&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/count.c"&gt;c&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;2.2&lt;/td&gt;
&lt;td&gt;0.5&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;inputs / outputs&lt;/th&gt;
&lt;th&gt;format&lt;/th&gt;
&lt;th&gt;language&lt;/th&gt;
&lt;th&gt;user seconds&lt;/th&gt;
&lt;th&gt;gigabytes / second&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1 / 1&lt;/td&gt;
&lt;td&gt;bsv&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/bsv/reverse.c"&gt;c&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;0.9&lt;/td&gt;
&lt;td&gt;1.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4 / 1&lt;/td&gt;
&lt;td&gt;bsv&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/bsv/select.c"&gt;c&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;0.3&lt;/td&gt;
&lt;td&gt;3.3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1000 / 1&lt;/td&gt;
&lt;td&gt;bsv&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/bsv/count.c"&gt;c&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;0.1&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;now this is interesting. when dealing with csv, the ratio of inputs to outputs has almost no impact on performance. when dealing with bsv the impact is x3 at each step. this suggests that for csv, parsing the input dominates, while for bsv, writing the output dominates. this asks an interesting question, how can we optimize output? for simplicity, the bsv code is outputting csv. it may be worth experimenting with other output formats, but we'll skip that for now.&lt;/p&gt;
&lt;p&gt;do we have enough information to establish a baseline? perhaps.&lt;/p&gt;
&lt;p&gt;we've seen &lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/select.py"&gt;python&lt;/a&gt; process csv and &lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/psv/select.go"&gt;go&lt;/a&gt; process protobuf at 100 megabytes / second.&lt;/p&gt;
&lt;p&gt;we've seen &lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/select.c"&gt;c&lt;/a&gt;, &lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/select.go"&gt;go&lt;/a&gt;, and &lt;a href="https://github.com/nathants/posts/blob/004/004_data_processing_performance_with_python_go_rust_and_c/select.rs"&gt;rust&lt;/a&gt; process csv at 400 megabytes / second.&lt;/p&gt;
&lt;p&gt;we've seen &lt;a href="https://github.com/nathants/posts/tree/004/004_data_processing_performance_with_python_go_rust_and_c/bsv"&gt;c&lt;/a&gt; process bsv at 1-10 gigabytes / second.&lt;/p&gt;
&lt;p&gt;why don't we start with the following baseline. we'll think of it as napkin math.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;category&lt;/th&gt;
&lt;th&gt;rate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;slow&lt;/td&gt;
&lt;td&gt;  &amp;lt;=100 megabytes / second / cpu core&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;decent&lt;/td&gt;
&lt;td&gt;    ~500 megabytes / second / cpu core&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;fast&lt;/td&gt;
&lt;td&gt;&amp;gt;=1000 megabytes / second / cpu core&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;as we do data processing, either by configuring and using off the shelf software, or by building bespoke systems, we can keep these rates in mind.&lt;/p&gt;
&lt;p&gt;if you are interested in bsv, you can find it &lt;a href="https://github.com/nathants/bsv"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;for further experimentation with go, rust, and c, look &lt;a href="https://github.com/nathants/bsv/tree/master/experiments"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;for examples of applying bsv to distributed compute, look &lt;a href="https://github.com/nathants/s4/tree/go/examples/nyc_taxi_bsv"&gt;here&lt;/a&gt;.&lt;/p&gt;

                      
            </description>
      <guid isPermaLink="false">https://nathants.com/posts/data-processing-performance-with-python-go-rust-and-c</guid>
    </item>
    <item>
      <title>s4</title>
      <link>https://nathants.com/projects/s4</link>
      <description>
                
&lt;h2 id="why"&gt;&lt;a class="heading-link" href="#why"&gt;why&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;s3 is awesome, but can be expensive, slow, and doesn't expose data local compute or efficient shuffle.&lt;/p&gt;
&lt;h2 id="what"&gt;&lt;a class="heading-link" href="#what"&gt;what&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;an s3 cli compatible storage cluster that is cheap and fast, with data local compute and efficient shuffle.&lt;/p&gt;
&lt;p&gt;data local compute maps arbitrary commands over immutable keys in 1:1, n:1 and 1:n operations.&lt;/p&gt;
&lt;p&gt;data shuffle is implicit in 1:n mappings.&lt;/p&gt;
&lt;p&gt;server placement is based on the hash of basename or a numeric prefix.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;key&lt;/th&gt;
&lt;th&gt;method&lt;/th&gt;
&lt;th&gt;placement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;s4://bucket/dir/name.txt&lt;/td&gt;
&lt;td&gt;int(hash("name.txt"))&lt;/td&gt;
&lt;td&gt;?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;s4://bucket/dir/000_bucket0.txt&lt;/td&gt;
&lt;td&gt;int("000")&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;s4://bucket/dir/000&lt;/td&gt;
&lt;td&gt;int("000")&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;keys are strongly consistent and cannot be updated unless first deleted.&lt;/p&gt;
&lt;h2 id="when"&gt;&lt;a class="heading-link" href="#when"&gt;when&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;use this for efficiently processing ephemeral data.&lt;/p&gt;
&lt;p&gt;keep durable inputs, outputs, and checkpoints in s3.&lt;/p&gt;
&lt;h2 id="how"&gt;&lt;a class="heading-link" href="#how"&gt;how&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;a ring of servers store files on disk.&lt;/p&gt;
&lt;p&gt;a metadata controller on each server orchestrates out of process operations for data transfer and local compute.&lt;/p&gt;
&lt;p&gt;a cli client coordinates cluster activity.&lt;/p&gt;
&lt;h2 id="non-goals"&gt;&lt;a class="heading-link" href="#non-goals"&gt;non goals&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;high availability. every key lives on one and only one server.&lt;/p&gt;
&lt;p&gt;high durability. data lives on a single disk, and is as durable as that disk.&lt;/p&gt;
&lt;p&gt;security. data transfers are checked for integrity, but not encrypted. service access is unauthenticated. secure the network with &lt;a href="https://www.wireguard.com/" rel="nofollow"&gt;wireguard&lt;/a&gt; if needed.&lt;/p&gt;
&lt;p&gt;fine granularity. data should be medium to coarse granularity.&lt;/p&gt;
&lt;p&gt;safety for all inputs. service access should be considered to be at the level of root ssh. any user input should be escaped for shell.&lt;/p&gt;
&lt;p&gt;cluster resizing. clusters should be short lived and data ephemeral. instead of resizing create a new cluster.&lt;/p&gt;
&lt;p&gt;pagination of list results. data layout and partitioning must be considered.&lt;/p&gt;
&lt;h2 id="install"&gt;&lt;a class="heading-link" href="#install"&gt;install&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;go install:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;go install github.com/nathants/s4/cmd/s4@latest
go install github.com/nathants/s4/cmd/s4_server@latest
sudo mv -f &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;go env GOPATH&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;/bin/s4 /usr/local/bin/s4
sudo mv -f &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;go env GOPATH&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;/bin/s4_server /usr/local/bin/s4-server&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;git clone:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;git clone https://github.com/nathants/s4
&lt;span class="pl-c1"&gt;cd&lt;/span&gt; s4
git clone go
make -j
sudo mv -fv bin/s4 bin/s4-server /usr/local/bin/&lt;/pre&gt;&lt;/div&gt;
&lt;h2 id="test"&gt;&lt;a class="heading-link" href="#test"&gt;test&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; tox&lt;/pre&gt;&lt;/div&gt;
&lt;h2 id="automatic-deployment"&gt;&lt;a class="heading-link" href="#automatic-deployment"&gt;automatic deployment&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-c1"&gt;cd&lt;/span&gt; s4
name=s4-cluster
bash scripts/new_cluster.sh &lt;span class="pl-smi"&gt;$name&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h2 id="manual-deployment"&gt;&lt;a class="heading-link" href="#manual-deployment"&gt;manual deployment&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;deploy&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;ssh &lt;span class="pl-smi"&gt;$server1&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;curl -s https://raw.githubusercontent.com/nathants/s4/go/scripts/install_archlinux.sh | bash&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;
ssh &lt;span class="pl-smi"&gt;$server2&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;curl -s https://raw.githubusercontent.com/nathants/s4/go/scripts/install_archlinux.sh | bash&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;configure&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-smi"&gt;$server1&lt;/span&gt;:8080 &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt;  &lt;span class="pl-k"&gt;~&lt;/span&gt;/.s4.conf
&lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-smi"&gt;$server2&lt;/span&gt;:8080 &lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;~&lt;/span&gt;/.s4.conf
scp &lt;span class="pl-k"&gt;~&lt;/span&gt;/.s4.conf &lt;span class="pl-smi"&gt;$server1&lt;/span&gt;:
scp &lt;span class="pl-k"&gt;~&lt;/span&gt;/.s4.conf &lt;span class="pl-smi"&gt;$server2&lt;/span&gt;:&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;start&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;ssh &lt;span class="pl-smi"&gt;$server1&lt;/span&gt; s4-server
ssh &lt;span class="pl-smi"&gt;$server2&lt;/span&gt; s4-server&lt;/pre&gt;&lt;/div&gt;
&lt;h2 id="usage"&gt;&lt;a class="heading-link" href="#usage"&gt;usage&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-c1"&gt;echo&lt;/span&gt; hello world &lt;span class="pl-k"&gt;|&lt;/span&gt; s4 cp - s4://bucket/data.txt
s4 cp s4://bucket/data.txt -
s4 ls s4://bucket --recursive
s4 --help&lt;/pre&gt;&lt;/div&gt;
&lt;h2 id="examples"&gt;&lt;a class="heading-link" href="#examples"&gt;examples&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://github.com/nathants/s4/blob/go/examples/nyc_taxi_bsv"&gt;structured analysis of nyc taxi data with bsv and hive&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/nathants/s4/blob/go/examples/nyc_taxi_python"&gt;adhoc exploration of nyc taxi data with python&lt;/a&gt;&lt;/p&gt;
&lt;h2 id="related-projects"&gt;&lt;a class="heading-link" href="#related-projects"&gt;related projects&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://github.com/nathants/bsv"&gt;bsv&lt;/a&gt; - a simple and efficient data format for easily manipulating chunks of rows of columns while minimizing allocations and copies.&lt;/p&gt;
&lt;h2 id="related-posts"&gt;&lt;a class="heading-link" href="#related-posts"&gt;related posts&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://nathants.com/posts/optimizing-a-bsv-data-processing-pipeline" rel="nofollow"&gt;optimizing a bsv data processing pipeline&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://nathants.com/posts/performant-batch-processing-with-bsv-s4-and-presto" rel="nofollow"&gt;performant batch processing with bsv, s4, and presto&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://nathants.com/posts/discovering-a-baseline-for-data-processing-performance" rel="nofollow"&gt;discovering a baseline for data processing performance&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://nathants.com/posts/refactoring-common-distributed-data-patterns-into-s4" rel="nofollow"&gt;refactoring common distributed data patterns into s4&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://nathants.com/posts/scaling-python-data-processing-horizontally" rel="nofollow"&gt;scaling python data processing horizontally&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://nathants.com/posts/scaling-python-data-processing-vertically" rel="nofollow"&gt;scaling python data processing vertically&lt;/a&gt;&lt;/p&gt;
&lt;h2 id="api"&gt;&lt;a class="heading-link" href="#api"&gt;api&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#s4-rm"&gt;s4 rm&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;delete data from s4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#s4-eval"&gt;s4 eval&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;eval a bash cmd with key data as stdin&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#s4-ls"&gt;s4 ls&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;list keys&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#s4-cp"&gt;s4 cp&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;copy data to or from s4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#s4-map"&gt;s4 map&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;process data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#s4-map-to-n"&gt;s4 map-to-n&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;shuffle data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#s4-map-from-n"&gt;s4 map-from-n&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;merge shuffled data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#s4-config"&gt;s4 config&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;list the server addresses&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#s4-health"&gt;s4 health&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;health check every server&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id="usage-1"&gt;&lt;a class="heading-link" href="#usage-1"&gt;usage&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;h3 id="s4-rm"&gt;&lt;a class="heading-link" href="#s4-rm"&gt;s4 rm&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;usage: s4 rm [-h] [-r] prefix

    delete data from s4.

    - recursive to delete directories.


positional arguments:
  prefix           -

optional arguments:
  -h       show this help message and exit
  -r       False
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id="s4-eval"&gt;&lt;a class="heading-link" href="#s4-eval"&gt;s4 eval&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;usage: s4 eval [-h] key cmd

    eval a bash cmd with key data as stdin


positional arguments:
  key         -
  cmd         -

optional arguments:
  -h  show this help message and exit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id="s4-ls"&gt;&lt;a class="heading-link" href="#s4-ls"&gt;s4 ls&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;usage: s4 ls [-h] [-r] [prefix]

    list keys


positional arguments:
  prefix           -

optional arguments:
  -h, --help       show this help message and exit
  -r, --recursive  False
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id="s4-cp"&gt;&lt;a class="heading-link" href="#s4-cp"&gt;s4 cp&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;usage: s4 cp [-h] [-r] src dst

    copy data to or from s4.

    - paths can be:
      - remote:       "s4://bucket/key.txt"
      - local:        "./dir/key.txt"
      - stdin/stdout: "-"
    - use recursive to copy directories.
    - keys cannot be updated, but can be deleted and recreated.
    - note: to copy from s4, the local machine must be reachable by the cluster, otherwise use `s4 eval`.


positional arguments:
  src              -
  dst              -

optional arguments:
  -h       show this help message and exit
  -r       False
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id="s4-map"&gt;&lt;a class="heading-link" href="#s4-map"&gt;s4 map&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;usage: s4 map [-h] indir outdir cmd

    process data.

    - map a bash cmd 1:1 over every key in indir putting result in outdir.
    - cmd receives data via stdin and returns data via stdout.
    - every key in indir will create a key with the same name in outdir.
    - indir will be listed recursively to find keys to map.


positional arguments:
  indir       -
  outdir      -
  cmd         -

optional arguments:
  -h  show this help message and exit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id="s4-map-to-n"&gt;&lt;a class="heading-link" href="#s4-map-to-n"&gt;s4 map-to-n&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;usage: s4 map-to-n [-h] indir outdir cmd

    shuffle data.

    - map a bash cmd 1:n over every key in indir putting results in outdir.
    - cmd receives data via stdin, writes files to disk, and returns file paths via stdout.
    - every key in indir will create a directory with the same name in outdir.
    - outdir directories contain zero or more files output by cmd.
    - cmd runs in a tempdir which is deleted on completion.


positional arguments:
  indir       -
  outdir      -
  cmd         -

optional arguments:
  -h  show this help message and exit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id="s4-map-from-n"&gt;&lt;a class="heading-link" href="#s4-map-from-n"&gt;s4 map-from-n&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;usage: s4 map-from-n [-h] indir outdir cmd

    merge shuffled data.

    - map a bash cmd n:1 over every key in indir putting result in outdir.
    - indir will be listed recursively to find keys to map.
    - cmd receives file paths via stdin and returns data via stdout.
    - each cmd receives all keys with the same name or numeric prefix
    - output name is that name


positional arguments:
  indir       -
  outdir      -
  cmd         -

optional arguments:
  -h  show this help message and exit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id="s4-config"&gt;&lt;a class="heading-link" href="#s4-config"&gt;s4 config&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;usage: s4 config [-h]

    list the server addresses


optional arguments:
  -h  show this help message and exit
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id="s4-health"&gt;&lt;a class="heading-link" href="#s4-health"&gt;s4 health&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;usage: s4 health [-h]

    health check every server


optional arguments:
  -h  show this help message and exit
&lt;/code&gt;&lt;/pre&gt;

                      
            </description>
      <guid isPermaLink="false">https://nathants.com/projects/s4</guid>
    </item>
    <item>
      <title>runclj</title>
      <link>https://nathants.com/projects/runclj</link>
      <description>
                
&lt;h2 id="what"&gt;&lt;a class="heading-link" href="#what"&gt;what&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;tooling around single file clojurescript programs running on node or the browser.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;runclj ./rotate_the_logs.cljs&lt;/pre&gt;&lt;/div&gt;
&lt;h2 id="how"&gt;&lt;a class="heading-link" href="#how"&gt;how&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;transparently converts a single clojurescript file into a temporary shadow-cljs project in &lt;code&gt;.shadow-cljs/&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="installation"&gt;&lt;a class="heading-link" href="#installation"&gt;installation&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;get the dependencies:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;bash&lt;/li&gt;
&lt;li&gt;python3&lt;/li&gt;
&lt;li&gt;java&lt;/li&gt;
&lt;li&gt;node&lt;/li&gt;
&lt;li&gt;npm&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;then:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;git clone https://github.com/nathants/runclj
sudo mv runclj/bin/&lt;span class="pl-k"&gt;*&lt;/span&gt; runclj/bin/.shadow-cljs /usr/local/bin&lt;/pre&gt;&lt;/div&gt;
&lt;h2 id="examples"&gt;&lt;a class="heading-link" href="#examples"&gt;examples&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/nathants/runclj/blob/master/examples/hello.cljs"&gt;hello.cljs&lt;/a&gt; - node hello world&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/nathants/runclj/blob/master/examples/shell.cljs"&gt;shell.cljs&lt;/a&gt; - node with subprocess, user prompts, and core.async&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/nathants/runclj/blob/master/examples/server.cljs"&gt;server.cljs&lt;/a&gt; - node express server&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/nathants/runclj/blob/master/examples/client.cljs"&gt;client.cljs&lt;/a&gt; - client app with material ui (&lt;a href="https://nathants.com/client.cljs/" rel="nofollow"&gt;demo&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/nathants/runclj/blob/master/examples/macros.cljs"&gt;macros.cljs&lt;/a&gt; - node hello world with macros&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="usage"&gt;&lt;a class="heading-link" href="#usage"&gt;usage&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;runclj program.cljs&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;declare clojure and npm dependencies with meta-data at the start of the file.&lt;/p&gt;
&lt;div class="highlight highlight-source-clojure"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#!&lt;/span&gt;/usr/bin/env runclj&lt;/span&gt;
^{&lt;span class="pl-c1"&gt;:runclj&lt;/span&gt; {&lt;span class="pl-c1"&gt;:npm&lt;/span&gt; [[express &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;4.16.3&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;]]
           &lt;span class="pl-c1"&gt;:deps&lt;/span&gt; [[prismatic/schema &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;1.1.3&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;]]}}
(&lt;span class="pl-k"&gt;ns&lt;/span&gt; &lt;span class="pl-e"&gt;main&lt;/span&gt;
  (&lt;span class="pl-c1"&gt;:require&lt;/span&gt; [schema.core &lt;span class="pl-c1"&gt;:as&lt;/span&gt; schema &lt;span class="pl-c1"&gt;:include-macros&lt;/span&gt; &lt;span class="pl-c1"&gt;true&lt;/span&gt;]))&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;run the program.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;runclj program.cljs&lt;/code&gt;&lt;/p&gt;
&lt;h2 id="browser-dev-workflow"&gt;&lt;a class="heading-link" href="#browser-dev-workflow"&gt;browser dev workflow&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;runclj-watch examples/client.cljs&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;open browser to &lt;code&gt;localhost:8000&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;edit &lt;code&gt;client.cljs&lt;/code&gt; and browser auto reloads&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;optionally connect to repl on &lt;code&gt;localhost:3333&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="server-dev-workflow"&gt;&lt;a class="heading-link" href="#server-dev-workflow"&gt;server dev workflow&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;runclj examples/server.cljs&lt;/code&gt;&lt;/p&gt;
&lt;h2 id="deployment"&gt;&lt;a class="heading-link" href="#deployment"&gt;deployment&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;runclj-tar program.cljs &lt;span class="pl-k"&gt;|&lt;/span&gt; ssh server tar xf -&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;then either install &lt;code&gt;runclj&lt;/code&gt; and run normally &lt;code&gt;runclj program.cljs&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;or run manually:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;browser: &lt;code&gt;cd .shadow-cljs/program.cljs/public &amp;amp;&amp;amp; python3 -m http.server&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;server: &lt;code&gt;cd .shadow-cljs/program.cljs/ &amp;amp;&amp;amp; node public/main.js&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="outgrowing-runclj"&gt;&lt;a class="heading-link" href="#outgrowing-runclj"&gt;outgrowing runclj&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;if your project grows too large for a single file, copy the generated shadow-cljs project and continue working on that directly.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;runclj program.cljs&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;cp -r .shadow-cljs/program.cljs program&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;alternatively lift and shift to a project template like &lt;a href="https://github.com/nathants/aws-gocljs"&gt;aws-gocljs&lt;/a&gt;.&lt;/p&gt;

                      
            </description>
      <guid isPermaLink="false">https://nathants.com/projects/runclj</guid>
    </item>
    <item>
      <title>render</title>
      <link>https://nathants.com/projects/render</link>
      <description>
                
&lt;h2 id="why"&gt;&lt;a class="heading-link" href="#why"&gt;why&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;minimal web content should be easy.&lt;/p&gt;
&lt;h2 id="what"&gt;&lt;a class="heading-link" href="#what"&gt;what&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;a tool for rendering a single github markdown file to a single html file.&lt;/p&gt;
&lt;h2 id="install"&gt;&lt;a class="heading-link" href="#install"&gt;install&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;python3 -m pip install git+https://github.com/nathants/render&lt;/pre&gt;&lt;/div&gt;
&lt;h2 id="usage"&gt;&lt;a class="heading-link" href="#usage"&gt;usage&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;render readme.md \
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;render&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;render github markdown to html&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;github.com/nathants/render&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \
    &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;https://github.com/nathants/render&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \
    &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; index.html

python3 -m http.server &lt;span class="pl-k"&gt;&amp;amp;&lt;/span&gt;

firefox localhost:8000&lt;/pre&gt;&lt;/div&gt;
&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt; render -h
usage: render [-h] [-f FOOTER] [-c CSS] github-markdown title subtitle header-link-name header-link-href

    render github markdown to html


positional arguments:
  github-markdown       -
  title                 -
  subtitle              -
  header-link-name      -
  header-link-href      -

optional arguments:
  -h, --help            show this help message and exit
  -f FOOTER, --footer FOOTER
                        ''
  -c CSS, --css CSS     ''
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id="examples"&gt;&lt;a class="heading-link" href="#examples"&gt;examples&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;all of &lt;a href="https://nathants.com" rel="nofollow"&gt;nathants.com&lt;/a&gt; is created with render.&lt;/p&gt;

                      
            </description>
      <guid isPermaLink="false">https://nathants.com/projects/render</guid>
    </item>
    <item>
      <title>py-webengine</title>
      <link>https://nathants.com/projects/py-webengine</link>
      <description>
                
&lt;h2 id="why"&gt;&lt;a class="heading-link" href="#why"&gt;why&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;browser testing is annoying, brittle, and slow.&lt;/p&gt;
&lt;h2 id="what"&gt;&lt;a class="heading-link" href="#what"&gt;what&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;easy and fun browser testing from python with pyqt6-webengine.&lt;/p&gt;
&lt;h2 id="how"&gt;&lt;a class="heading-link" href="#how"&gt;how&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;execute javascript.&lt;/p&gt;
&lt;p&gt;inspect network requests.&lt;/p&gt;
&lt;p&gt;send native mouse and keyboard input.&lt;/p&gt;
&lt;p&gt;wait for values to show up on screen.&lt;/p&gt;
&lt;p&gt;make assertions.&lt;/p&gt;
&lt;h2 id="install"&gt;&lt;a class="heading-link" href="#install"&gt;install&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;mac&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;brew install python3
brew install qt6
python3 -m pip install git+https://github.com/nathants/py-webengine&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;linux&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;apt-get install -y qt6-webengine-dev                                &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; debian/ubuntu&lt;/span&gt;
pacman -S qt6-webengine                                             &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; arch&lt;/span&gt;
python3 -m pip install git+https://github.com/nathants/py-webengine &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; all platforms&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;docker&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;docker build -t py-webengine &lt;span class="pl-c1"&gt;.&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h2 id="api"&gt;&lt;a class="heading-link" href="#api"&gt;api&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight highlight-source-python"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;class&lt;/span&gt; &lt;span class="pl-s1"&gt;webengine&lt;/span&gt;.&lt;span class="pl-v"&gt;Thread&lt;/span&gt;:

    &lt;span class="pl-s1"&gt;action_delay_seconds&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;.01&lt;/span&gt; &lt;span class="pl-c"&gt;# seconds between browser actions&lt;/span&gt;

    &lt;span class="pl-s1"&gt;timeout_seconds&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;10&lt;/span&gt; &lt;span class="pl-c"&gt;# maximum seconds to wait_attr()&lt;/span&gt;

    &lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;js&lt;/span&gt;(&lt;span class="pl-s1"&gt;self&lt;/span&gt;, &lt;span class="pl-s1"&gt;code&lt;/span&gt;):
        &lt;span class="pl-s"&gt;"execute javascript and return the result as a string"&lt;/span&gt;

    &lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;click&lt;/span&gt;(&lt;span class="pl-s1"&gt;self&lt;/span&gt;, &lt;span class="pl-s1"&gt;selector&lt;/span&gt;):
        &lt;span class="pl-s"&gt;"send native mouse input at the center of the first element matching selector"&lt;/span&gt;

    &lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;type&lt;/span&gt;(&lt;span class="pl-s1"&gt;self&lt;/span&gt;, &lt;span class="pl-s1"&gt;value&lt;/span&gt;):
        &lt;span class="pl-s"&gt;"send native keyboard input, one character at a time"&lt;/span&gt;

    &lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;enter&lt;/span&gt;(&lt;span class="pl-s1"&gt;self&lt;/span&gt;):
        &lt;span class="pl-s"&gt;"send native keyboard input enter"&lt;/span&gt;

    &lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;attr&lt;/span&gt;(&lt;span class="pl-s1"&gt;self&lt;/span&gt;, &lt;span class="pl-s1"&gt;selector&lt;/span&gt;, &lt;span class="pl-s1"&gt;attr&lt;/span&gt;):
        &lt;span class="pl-s"&gt;"return the list of the attribute for all elements matching selector"&lt;/span&gt;

    &lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;wait_attr&lt;/span&gt;(&lt;span class="pl-s1"&gt;self&lt;/span&gt;, &lt;span class="pl-s1"&gt;selector&lt;/span&gt;, &lt;span class="pl-s1"&gt;attr&lt;/span&gt;, &lt;span class="pl-s1"&gt;value&lt;/span&gt;):
        &lt;span class="pl-s"&gt;"wait for the attribute of all elements matching a selector have the given value"&lt;/span&gt;

    &lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;load&lt;/span&gt;(&lt;span class="pl-s1"&gt;self&lt;/span&gt;, &lt;span class="pl-s1"&gt;url&lt;/span&gt;):
        &lt;span class="pl-s"&gt;"load url"&lt;/span&gt;

    &lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;screenshot&lt;/span&gt;(&lt;span class="pl-s1"&gt;self&lt;/span&gt;, &lt;span class="pl-s1"&gt;path&lt;/span&gt;):
        &lt;span class="pl-s"&gt;"save a png or jpg at path"&lt;/span&gt;

    &lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;run&lt;/span&gt;(&lt;span class="pl-s1"&gt;self&lt;/span&gt;):
        &lt;span class="pl-s"&gt;"run the main method"&lt;/span&gt;

    &lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;main&lt;/span&gt;(&lt;span class="pl-s1"&gt;self&lt;/span&gt;):
        &lt;span class="pl-s"&gt;"implement this method as your test"&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h2 id="usage"&gt;&lt;a class="heading-link" href="#usage"&gt;usage&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;subclass &lt;code&gt;webengine.Thread&lt;/code&gt; and implement &lt;code&gt;main()&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-python"&gt;&lt;pre&gt;&lt;span class="pl-s1"&gt;host&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;'http://localhost:8000'&lt;/span&gt;

&lt;span class="pl-k"&gt;class&lt;/span&gt; &lt;span class="pl-v"&gt;Main&lt;/span&gt;(&lt;span class="pl-s1"&gt;webengine&lt;/span&gt;.&lt;span class="pl-v"&gt;Thread&lt;/span&gt;):

    &lt;span class="pl-s1"&gt;action_delay_seconds&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;.025&lt;/span&gt;

    &lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;main&lt;/span&gt;(&lt;span class="pl-s1"&gt;self&lt;/span&gt;):

        &lt;span class="pl-c"&gt;# wait for http server to come up and the site to load properly&lt;/span&gt;
        &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;_&lt;/span&gt; &lt;span class="pl-c1"&gt;in&lt;/span&gt; &lt;span class="pl-en"&gt;range&lt;/span&gt;(&lt;span class="pl-c1"&gt;100&lt;/span&gt;):
            &lt;span class="pl-k"&gt;try&lt;/span&gt;:
                &lt;span class="pl-s1"&gt;self&lt;/span&gt;.&lt;span class="pl-en"&gt;load&lt;/span&gt;(&lt;span class="pl-s1"&gt;host&lt;/span&gt;)
                &lt;span class="pl-s1"&gt;self&lt;/span&gt;.&lt;span class="pl-en"&gt;wait_attr&lt;/span&gt;(&lt;span class="pl-s"&gt;'a'&lt;/span&gt;, &lt;span class="pl-s"&gt;'innerText'&lt;/span&gt;, [&lt;span class="pl-s"&gt;'home'&lt;/span&gt;, &lt;span class="pl-s"&gt;'files'&lt;/span&gt;, &lt;span class="pl-s"&gt;'api'&lt;/span&gt;, &lt;span class="pl-s"&gt;'websocket'&lt;/span&gt;])
            &lt;span class="pl-k"&gt;except&lt;/span&gt;:
                &lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s"&gt;'wait for site to be ready'&lt;/span&gt;)
                &lt;span class="pl-s1"&gt;time&lt;/span&gt;.&lt;span class="pl-en"&gt;sleep&lt;/span&gt;(&lt;span class="pl-c1"&gt;.1&lt;/span&gt;)
            &lt;span class="pl-k"&gt;else&lt;/span&gt;:
                &lt;span class="pl-k"&gt;break&lt;/span&gt;
        &lt;span class="pl-k"&gt;else&lt;/span&gt;:
            &lt;span class="pl-k"&gt;assert&lt;/span&gt; &lt;span class="pl-c1"&gt;False&lt;/span&gt;

        &lt;span class="pl-c"&gt;# load the site&lt;/span&gt;
        &lt;span class="pl-s1"&gt;self&lt;/span&gt;.&lt;span class="pl-en"&gt;load&lt;/span&gt;(&lt;span class="pl-s1"&gt;host&lt;/span&gt;)

        &lt;span class="pl-c"&gt;# click on files and check contents&lt;/span&gt;
        &lt;span class="pl-s1"&gt;self&lt;/span&gt;.&lt;span class="pl-en"&gt;click&lt;/span&gt;(&lt;span class="pl-s"&gt;'a#files'&lt;/span&gt;)
        &lt;span class="pl-s1"&gt;self&lt;/span&gt;.&lt;span class="pl-en"&gt;wait_attr&lt;/span&gt;(&lt;span class="pl-s"&gt;"#content p"&lt;/span&gt;, &lt;span class="pl-s"&gt;'innerText'&lt;/span&gt;, [&lt;span class="pl-s"&gt;"files"&lt;/span&gt;])&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;then invoke your test:&lt;/p&gt;
&lt;div class="highlight highlight-source-python"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;test&lt;/span&gt;():
    &lt;span class="pl-c"&gt;# build your webapp&lt;/span&gt;
    &lt;span class="pl-s1"&gt;subprocess&lt;/span&gt;.&lt;span class="pl-en"&gt;check_call&lt;/span&gt;(&lt;span class="pl-s"&gt;'gunzip --force --keep index.html.gz'&lt;/span&gt;, &lt;span class="pl-s1"&gt;shell&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;)
    &lt;span class="pl-c"&gt;# run your webapp&lt;/span&gt;
    &lt;span class="pl-s1"&gt;server&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;subprocess&lt;/span&gt;.&lt;span class="pl-v"&gt;Popen&lt;/span&gt;(&lt;span class="pl-s"&gt;'python3 -m http.server'&lt;/span&gt;, &lt;span class="pl-s1"&gt;shell&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;)
    &lt;span class="pl-k"&gt;try&lt;/span&gt;:
        &lt;span class="pl-c"&gt;# run webengine&lt;/span&gt;
        &lt;span class="pl-s1"&gt;webengine&lt;/span&gt;.&lt;span class="pl-en"&gt;run_thread&lt;/span&gt;(&lt;span class="pl-v"&gt;Main&lt;/span&gt;, &lt;span class="pl-s1"&gt;devtools&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;'horizontal'&lt;/span&gt;)
    &lt;span class="pl-k"&gt;finally&lt;/span&gt;:
        &lt;span class="pl-c"&gt;# stop webapp&lt;/span&gt;
        &lt;span class="pl-s1"&gt;server&lt;/span&gt;.&lt;span class="pl-en"&gt;terminate&lt;/span&gt;()

&lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-s1"&gt;__name__&lt;/span&gt; &lt;span class="pl-c1"&gt;==&lt;/span&gt; &lt;span class="pl-s"&gt;'__main__'&lt;/span&gt;:
    &lt;span class="pl-s1"&gt;os&lt;/span&gt;.&lt;span class="pl-en"&gt;chdir&lt;/span&gt;(&lt;span class="pl-s1"&gt;os&lt;/span&gt;.&lt;span class="pl-s1"&gt;path&lt;/span&gt;.&lt;span class="pl-en"&gt;dirname&lt;/span&gt;(&lt;span class="pl-s1"&gt;os&lt;/span&gt;.&lt;span class="pl-s1"&gt;path&lt;/span&gt;.&lt;span class="pl-en"&gt;abspath&lt;/span&gt;(&lt;span class="pl-s1"&gt;__file__&lt;/span&gt;)))
    &lt;span class="pl-s1"&gt;sys&lt;/span&gt;.&lt;span class="pl-en"&gt;exit&lt;/span&gt;(&lt;span class="pl-s1"&gt;pytest&lt;/span&gt;.&lt;span class="pl-en"&gt;main&lt;/span&gt;([&lt;span class="pl-s"&gt;'test.py'&lt;/span&gt;, &lt;span class="pl-s"&gt;'-svvx'&lt;/span&gt;, &lt;span class="pl-s"&gt;'--tb'&lt;/span&gt;, &lt;span class="pl-s"&gt;'native'&lt;/span&gt;]))&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;to leave the browser open, insert somewhere in your test:&lt;/p&gt;
&lt;div class="highlight highlight-source-python"&gt;&lt;pre&gt;&lt;span class="pl-s1"&gt;time&lt;/span&gt;.&lt;span class="pl-en"&gt;sleep&lt;/span&gt;(&lt;span class="pl-c1"&gt;1000&lt;/span&gt;)&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;to drop into a python repl, first install &lt;a href="https://github.com/gotcha/ipdb"&gt;ipdb&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;python3 -m pip install ipdb&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;then insert somewhere in your test:&lt;/p&gt;
&lt;div class="highlight highlight-source-python"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;ipdb&lt;/span&gt;; &lt;span class="pl-s1"&gt;ipdb&lt;/span&gt;.&lt;span class="pl-en"&gt;set_trace&lt;/span&gt;()&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;run x11 docker:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;docker run \
    -h &lt;span class="pl-smi"&gt;$HOSTNAME&lt;/span&gt; \
    -e XAUTHORITY=/code/.Xauthority \
    -v &lt;span class="pl-smi"&gt;$HOME&lt;/span&gt;/.Xauthority:/code/.Xauthority \
    -v /tmp/.X11-unix:/tmp/.X11-unix \
    -e DISPLAY \
    --ipc host \
    -v &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;pwd&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;/example:/example \
    py-webengine \
    sh -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;python3 /example/test.py&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;run headless docker:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;docker run \
    -it \
    -v &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;pwd&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;/example:/example \
    py-webengine \
    sh -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;xvfb-run python3 /example/test.py&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;docker example:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; docker run -it --rm py-webengine

&lt;span class="pl-c1"&gt;wait&lt;/span&gt; for: a innerText [&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;home&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;, &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;files&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;, &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;api&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;, &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;websocket&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;]
&lt;span class="pl-c1"&gt;wait&lt;/span&gt; for: a href [&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;http://localhost:8000/#/home&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;, &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;http://localhost:8000/#/files&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;, &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;http://localhost:8000/#/api&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;, &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;http://localhost:8000/#/websocket&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;]
&lt;span class="pl-c1"&gt;wait&lt;/span&gt; for: &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt;content innerText ['home']&lt;/span&gt;
&lt;span class="pl-c1"&gt;wait&lt;/span&gt; for: &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt;content p innerText ['files']&lt;/span&gt;
&lt;span class="pl-c1"&gt;wait&lt;/span&gt; for: &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt;content p innerText 'predicate(x)'&lt;/span&gt;
&lt;span class="pl-c1"&gt;wait&lt;/span&gt; for: &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt;content p innerText ['a', 'b', 'c', 'Enter']&lt;/span&gt;
&lt;span class="pl-c1"&gt;wait&lt;/span&gt; for: &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt;content innerText ['home']&lt;/span&gt;
PASSED&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;see the &lt;a href="https://github.com/nathants/py-webengine/blob/master/example/"&gt;example&lt;/a&gt; for detailed usage.&lt;/p&gt;
&lt;p&gt;in the example we will test the frontend from &lt;a href="https://github.com/nathants/aws-gocljs"&gt;aws-gocljs&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;a live demo of that site is &lt;a href="https://gocljs.nathants.com" rel="nofollow"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="demo"&gt;&lt;a class="heading-link" href="#demo"&gt;demo&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;a target="_blank" rel="noopener noreferrer" href="https://github.com/nathants/py-webengine/raw/master/demo.gif"&gt;&lt;img src="https://github.com/nathants/py-webengine/raw/master/demo.gif" alt="" style="max-width: 100%;"/&gt;&lt;/a&gt;&lt;/p&gt;

                      
            </description>
      <guid isPermaLink="false">https://nathants.com/projects/py-webengine</guid>
    </item>
    <item>
      <title>py-web</title>
      <link>https://nathants.com/projects/py-web</link>
      <description>
                
&lt;h2 id="why"&gt;&lt;a class="heading-link" href="#why"&gt;why&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;almost always, http servers should be simple and easy.&lt;/p&gt;
&lt;h2 id="what"&gt;&lt;a class="heading-link" href="#what"&gt;what&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;an http library wrapping &lt;a href="http://www.tornadoweb.org/en/latest/" rel="nofollow"&gt;tornado&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="install"&gt;&lt;a class="heading-link" href="#install"&gt;install&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;git clone https://github.com/nathants/py-web
&lt;span class="pl-c1"&gt;cd&lt;/span&gt; py-web
pip install -r requirements.txt &lt;span class="pl-c1"&gt;.&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h2 id="http-example"&gt;&lt;a class="heading-link" href="#http-example"&gt;http example&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight highlight-source-python"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;#!/usr/bin/env python3&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;logging&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;tornado&lt;/span&gt;.&lt;span class="pl-s1"&gt;ioloop&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;web&lt;/span&gt;

&lt;span class="pl-s1"&gt;logging&lt;/span&gt;.&lt;span class="pl-en"&gt;basicConfig&lt;/span&gt;(&lt;span class="pl-s1"&gt;level&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;'INFO'&lt;/span&gt;)

&lt;span class="pl-k"&gt;async&lt;/span&gt; &lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;handler&lt;/span&gt;(&lt;span class="pl-s1"&gt;request&lt;/span&gt;: &lt;span class="pl-s1"&gt;web&lt;/span&gt;.&lt;span class="pl-v"&gt;Request&lt;/span&gt;) &lt;span class="pl-c1"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="pl-s1"&gt;web&lt;/span&gt;.&lt;span class="pl-v"&gt;Response&lt;/span&gt;:
    &lt;span class="pl-s1"&gt;size&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;int&lt;/span&gt;(&lt;span class="pl-s1"&gt;request&lt;/span&gt;[&lt;span class="pl-s"&gt;'query'&lt;/span&gt;].&lt;span class="pl-en"&gt;get&lt;/span&gt;(&lt;span class="pl-s"&gt;'size'&lt;/span&gt;, &lt;span class="pl-c1"&gt;0&lt;/span&gt;))
    &lt;span class="pl-s1"&gt;token&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;request&lt;/span&gt;[&lt;span class="pl-s"&gt;'kwargs'&lt;/span&gt;][&lt;span class="pl-s"&gt;'token'&lt;/span&gt;]
    &lt;span class="pl-k"&gt;return&lt;/span&gt; {&lt;span class="pl-s"&gt;'code'&lt;/span&gt;: &lt;span class="pl-c1"&gt;200&lt;/span&gt;, &lt;span class="pl-s"&gt;'body'&lt;/span&gt;: &lt;span class="pl-s"&gt;f'&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;token&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt; size: &lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;size&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;'&lt;/span&gt;}

&lt;span class="pl-k"&gt;async&lt;/span&gt; &lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;fallback_handler&lt;/span&gt;(&lt;span class="pl-s1"&gt;request&lt;/span&gt;: &lt;span class="pl-s1"&gt;web&lt;/span&gt;.&lt;span class="pl-v"&gt;Request&lt;/span&gt;) &lt;span class="pl-c1"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="pl-s1"&gt;web&lt;/span&gt;.&lt;span class="pl-v"&gt;Response&lt;/span&gt;:
    &lt;span class="pl-s1"&gt;route&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;request&lt;/span&gt;[&lt;span class="pl-s"&gt;'args'&lt;/span&gt;][&lt;span class="pl-c1"&gt;0&lt;/span&gt;]
    &lt;span class="pl-k"&gt;return&lt;/span&gt; {&lt;span class="pl-s"&gt;'code'&lt;/span&gt;: &lt;span class="pl-c1"&gt;200&lt;/span&gt;, &lt;span class="pl-s"&gt;'body'&lt;/span&gt;: &lt;span class="pl-s"&gt;f'no such route: /&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;route&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;, try: /hello/xyz?size=123'&lt;/span&gt;}

&lt;span class="pl-s1"&gt;routes&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; [(&lt;span class="pl-s"&gt;'/hello/:token'&lt;/span&gt;, {&lt;span class="pl-s"&gt;'get'&lt;/span&gt;: &lt;span class="pl-s1"&gt;handler&lt;/span&gt;}),
          (&lt;span class="pl-s"&gt;'/(.*)'&lt;/span&gt;,         {&lt;span class="pl-s"&gt;'get'&lt;/span&gt;: &lt;span class="pl-s1"&gt;fallback_handler&lt;/span&gt;})]

&lt;span class="pl-s1"&gt;app&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;web&lt;/span&gt;.&lt;span class="pl-en"&gt;app&lt;/span&gt;(&lt;span class="pl-s1"&gt;routes&lt;/span&gt;)
&lt;span class="pl-s1"&gt;server&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;tornado&lt;/span&gt;.&lt;span class="pl-s1"&gt;httpserver&lt;/span&gt;.&lt;span class="pl-v"&gt;HTTPServer&lt;/span&gt;(&lt;span class="pl-s1"&gt;app&lt;/span&gt;)
&lt;span class="pl-s1"&gt;server&lt;/span&gt;.&lt;span class="pl-en"&gt;bind&lt;/span&gt;(&lt;span class="pl-c1"&gt;8080&lt;/span&gt;)
&lt;span class="pl-s1"&gt;server&lt;/span&gt;.&lt;span class="pl-en"&gt;start&lt;/span&gt;(&lt;span class="pl-c1"&gt;0&lt;/span&gt;)
&lt;span class="pl-s1"&gt;tornado&lt;/span&gt;.&lt;span class="pl-s1"&gt;ioloop&lt;/span&gt;.&lt;span class="pl-v"&gt;IOLoop&lt;/span&gt;.&lt;span class="pl-en"&gt;current&lt;/span&gt;().&lt;span class="pl-en"&gt;start&lt;/span&gt;()&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;$ curl localhost:8080/hello/world&lt;span class="pl-k"&gt;?&lt;/span&gt;size=3
world size: 3&lt;/pre&gt;&lt;/div&gt;
&lt;h2 id="https-example"&gt;&lt;a class="heading-link" href="#https-example"&gt;https example&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight highlight-source-python"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;#!/usr/bin/env python3&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;logging&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;tornado&lt;/span&gt;.&lt;span class="pl-s1"&gt;ioloop&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;web&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;ssl&lt;/span&gt;
&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;subprocess&lt;/span&gt;

&lt;span class="pl-s1"&gt;check_call&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-k"&gt;lambda&lt;/span&gt; &lt;span class="pl-c1"&gt;*&lt;/span&gt;&lt;span class="pl-s1"&gt;a&lt;/span&gt;: &lt;span class="pl-s1"&gt;subprocess&lt;/span&gt;.&lt;span class="pl-en"&gt;check_call&lt;/span&gt;(&lt;span class="pl-s"&gt;' '&lt;/span&gt;.&lt;span class="pl-en"&gt;join&lt;/span&gt;(&lt;span class="pl-en"&gt;map&lt;/span&gt;(&lt;span class="pl-s1"&gt;str&lt;/span&gt;, &lt;span class="pl-s1"&gt;a&lt;/span&gt;)), &lt;span class="pl-s1"&gt;shell&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;True&lt;/span&gt;, &lt;span class="pl-s1"&gt;executable&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;'/bin/bash'&lt;/span&gt;, &lt;span class="pl-s1"&gt;stderr&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s1"&gt;subprocess&lt;/span&gt;.&lt;span class="pl-v"&gt;STDOUT&lt;/span&gt;)

&lt;span class="pl-s1"&gt;logging&lt;/span&gt;.&lt;span class="pl-en"&gt;basicConfig&lt;/span&gt;(&lt;span class="pl-s1"&gt;level&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s"&gt;'INFO'&lt;/span&gt;)

&lt;span class="pl-k"&gt;async&lt;/span&gt; &lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;handler&lt;/span&gt;(&lt;span class="pl-s1"&gt;request&lt;/span&gt;: &lt;span class="pl-s1"&gt;web&lt;/span&gt;.&lt;span class="pl-v"&gt;Request&lt;/span&gt;) &lt;span class="pl-c1"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="pl-s1"&gt;web&lt;/span&gt;.&lt;span class="pl-v"&gt;Response&lt;/span&gt;:
    &lt;span class="pl-s1"&gt;size&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;int&lt;/span&gt;(&lt;span class="pl-s1"&gt;request&lt;/span&gt;[&lt;span class="pl-s"&gt;'query'&lt;/span&gt;].&lt;span class="pl-en"&gt;get&lt;/span&gt;(&lt;span class="pl-s"&gt;'size'&lt;/span&gt;, &lt;span class="pl-c1"&gt;0&lt;/span&gt;))
    &lt;span class="pl-s1"&gt;token&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;request&lt;/span&gt;[&lt;span class="pl-s"&gt;'kwargs'&lt;/span&gt;][&lt;span class="pl-s"&gt;'token'&lt;/span&gt;]
    &lt;span class="pl-k"&gt;return&lt;/span&gt; {&lt;span class="pl-s"&gt;'code'&lt;/span&gt;: &lt;span class="pl-c1"&gt;200&lt;/span&gt;, &lt;span class="pl-s"&gt;'body'&lt;/span&gt;: &lt;span class="pl-s"&gt;f'&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;token&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt; size: &lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;size&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;'&lt;/span&gt;}

&lt;span class="pl-k"&gt;async&lt;/span&gt; &lt;span class="pl-k"&gt;def&lt;/span&gt; &lt;span class="pl-en"&gt;fallback_handler&lt;/span&gt;(&lt;span class="pl-s1"&gt;request&lt;/span&gt;: &lt;span class="pl-s1"&gt;web&lt;/span&gt;.&lt;span class="pl-v"&gt;Request&lt;/span&gt;) &lt;span class="pl-c1"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="pl-s1"&gt;web&lt;/span&gt;.&lt;span class="pl-v"&gt;Response&lt;/span&gt;:
    &lt;span class="pl-s1"&gt;route&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;request&lt;/span&gt;[&lt;span class="pl-s"&gt;'args'&lt;/span&gt;][&lt;span class="pl-c1"&gt;0&lt;/span&gt;]
    &lt;span class="pl-k"&gt;return&lt;/span&gt; {&lt;span class="pl-s"&gt;'code'&lt;/span&gt;: &lt;span class="pl-c1"&gt;200&lt;/span&gt;, &lt;span class="pl-s"&gt;'body'&lt;/span&gt;: &lt;span class="pl-s"&gt;f'no such route: /&lt;span class="pl-s1"&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-s1"&gt;route&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;&lt;/span&gt;, try: /hello/XYZ'&lt;/span&gt;}

&lt;span class="pl-en"&gt;check_call&lt;/span&gt;(&lt;span class="pl-s"&gt;'openssl req -x509 -nodes -newkey rsa:4096 -keyout ssl.key -out ssl.crt -days 9999 -subj "/CN=localhost/O=Fake Name/C=US"'&lt;/span&gt;)
&lt;span class="pl-s1"&gt;options&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;ssl&lt;/span&gt;.&lt;span class="pl-en"&gt;create_default_context&lt;/span&gt;(&lt;span class="pl-s1"&gt;ssl&lt;/span&gt;.&lt;span class="pl-v"&gt;Purpose&lt;/span&gt;.&lt;span class="pl-v"&gt;CLIENT_AUTH&lt;/span&gt;)
&lt;span class="pl-s1"&gt;options&lt;/span&gt;.&lt;span class="pl-en"&gt;load_cert_chain&lt;/span&gt;(&lt;span class="pl-s"&gt;'ssl.crt'&lt;/span&gt;, &lt;span class="pl-s"&gt;'ssl.key'&lt;/span&gt;)

&lt;span class="pl-s1"&gt;routes&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; [(&lt;span class="pl-s"&gt;'/hello/:token'&lt;/span&gt;, {&lt;span class="pl-s"&gt;'get'&lt;/span&gt;: &lt;span class="pl-s1"&gt;handler&lt;/span&gt;}),
          (&lt;span class="pl-s"&gt;'/(.*)'&lt;/span&gt;,         {&lt;span class="pl-s"&gt;'get'&lt;/span&gt;: &lt;span class="pl-s1"&gt;fallback_handler&lt;/span&gt;})]

&lt;span class="pl-s1"&gt;app&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;web&lt;/span&gt;.&lt;span class="pl-en"&gt;app&lt;/span&gt;(&lt;span class="pl-s1"&gt;routes&lt;/span&gt;)
&lt;span class="pl-s1"&gt;server&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;tornado&lt;/span&gt;.&lt;span class="pl-s1"&gt;httpserver&lt;/span&gt;.&lt;span class="pl-v"&gt;HTTPServer&lt;/span&gt;(&lt;span class="pl-s1"&gt;app&lt;/span&gt;, &lt;span class="pl-s1"&gt;ssl_options&lt;/span&gt;&lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-s1"&gt;options&lt;/span&gt;)
&lt;span class="pl-s1"&gt;server&lt;/span&gt;.&lt;span class="pl-en"&gt;bind&lt;/span&gt;(&lt;span class="pl-c1"&gt;8080&lt;/span&gt;)
&lt;span class="pl-s1"&gt;server&lt;/span&gt;.&lt;span class="pl-en"&gt;start&lt;/span&gt;(&lt;span class="pl-c1"&gt;0&lt;/span&gt;)
&lt;span class="pl-s1"&gt;tornado&lt;/span&gt;.&lt;span class="pl-s1"&gt;ioloop&lt;/span&gt;.&lt;span class="pl-v"&gt;IOLoop&lt;/span&gt;.&lt;span class="pl-en"&gt;current&lt;/span&gt;().&lt;span class="pl-en"&gt;start&lt;/span&gt;()&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;$ curl --cacert ssl.crt https://localhost:8080/hello/world&lt;span class="pl-k"&gt;?&lt;/span&gt;size=3
world size: 3&lt;/pre&gt;&lt;/div&gt;

                      
            </description>
      <guid isPermaLink="false">https://nathants.com/projects/py-web</guid>
    </item>
    <item>
      <title>ptop</title>
      <link>https://nathants.com/projects/ptop</link>
      <description>
                
&lt;h2 id="what"&gt;&lt;a class="heading-link" href="#what"&gt;what&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;a minimal htop alternative&lt;/p&gt;
&lt;h2 id="demo"&gt;&lt;a class="heading-link" href="#demo"&gt;demo&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;a target="_blank" rel="noopener noreferrer" href="https://github.com/nathants/ptop/raw/master/demo.gif"&gt;&lt;img src="https://github.com/nathants/ptop/raw/master/demo.gif" alt="demo" style="max-width: 100%;"/&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2 id="install"&gt;&lt;a class="heading-link" href="#install"&gt;install&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;python3 -m pip install git+https://github.com/nathants/ptop&lt;/pre&gt;&lt;/div&gt;
&lt;h2 id="usage"&gt;&lt;a class="heading-link" href="#usage"&gt;usage&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;ptop&lt;/pre&gt;&lt;/div&gt;
&lt;h2 id="help"&gt;&lt;a class="heading-link" href="#help"&gt;help&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;hit &lt;code&gt;h&lt;/code&gt; for help&lt;/p&gt;
&lt;p&gt;&lt;a target="_blank" rel="noopener noreferrer" href="https://github.com/nathants/ptop/raw/master/help.png"&gt;&lt;img src="https://github.com/nathants/ptop/raw/master/help.png" alt="help" style="max-width: 100%;"/&gt;&lt;/a&gt;&lt;/p&gt;

                      
            </description>
      <guid isPermaLink="false">https://nathants.com/projects/ptop</guid>
    </item>
    <item>
      <title>notify</title>
      <link>https://nathants.com/projects/notify</link>
      <description>
                
&lt;h2 id="why"&gt;&lt;a class="heading-link" href="#why"&gt;why&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;sometimes popup messages are for great good.&lt;/p&gt;
&lt;h2 id="what"&gt;&lt;a class="heading-link" href="#what"&gt;what&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;a fullscreen terminal app, typically launched in a new single use terminal, to display a popup message.&lt;/p&gt;
&lt;p&gt;an optional y/n prompt will change the exit code.&lt;/p&gt;
&lt;p&gt;an optional delay avoids accidental input.&lt;/p&gt;
&lt;h2 id="demo"&gt;&lt;a class="heading-link" href="#demo"&gt;demo&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;a target="_blank" rel="noopener noreferrer" href="https://github.com/nathants/notify/raw/master/demo.gif"&gt;&lt;img src="https://github.com/nathants/notify/raw/master/demo.gif" alt="" style="max-width: 100%;"/&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2 id="install"&gt;&lt;a class="heading-link" href="#install"&gt;install&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;python3 -m pip install git+https://github.com/nathants/notify&lt;/code&gt;&lt;/p&gt;
&lt;h2 id="usage"&gt;&lt;a class="heading-link" href="#usage"&gt;usage&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; notify -h

usage: notify [-h] [-d DELAY] [-p] msg

    notify the user of a message with a fullscreen popup. hit any key to exit.


positional arguments:
  msg                   message to display

options:
  -h, --help            show this &lt;span class="pl-c1"&gt;help&lt;/span&gt; message and &lt;span class="pl-c1"&gt;exit&lt;/span&gt;
  -d DELAY, --delay DELAY
                        delay seconds before accepting user input &lt;span class="pl-k"&gt;for&lt;/span&gt; prompt (default: 0.5)
  -p, --prompt          prompt the user y/n &lt;span class="pl-k"&gt;then&lt;/span&gt; &lt;span class="pl-c1"&gt;exit&lt;/span&gt; 0/1 (default: False)
&lt;/pre&gt;&lt;/div&gt;

                      
            </description>
      <guid isPermaLink="false">https://nathants.com/projects/notify</guid>
    </item>
    <item>
      <title>go-hasdefer</title>
      <link>https://nathants.com/projects/go-hasdefer</link>
      <description>
                
&lt;h2 id="what"&gt;&lt;a class="heading-link" href="#what"&gt;what&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;a go linter to check that all goroutines have a defer statement.&lt;/p&gt;
&lt;h2 id="why"&gt;&lt;a class="heading-link" href="#why"&gt;why&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;it's too easy to miss panics when they happen inside goroutines, since they exit the defer scope of the caller.&lt;/p&gt;
&lt;h2 id="install"&gt;&lt;a class="heading-link" href="#install"&gt;install&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;go install github.com/nathants/go-hasdefer@latest&lt;/code&gt;&lt;/p&gt;
&lt;h2 id="usage"&gt;&lt;a class="heading-link" href="#usage"&gt;usage&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; go-hasdefer &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;find test/good/ -name &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;*.go&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; go-hasdefer &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;find test/bad/ -name &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;*.go&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;
missing defer anon func oneliner:        test/bad/bad_onliner.go:4      go &lt;span class="pl-en"&gt;func&lt;/span&gt;() &lt;span class="pl-en"&gt;{}&lt;/span&gt;()
missing defer anon func multiliner:      test/bad/bad_multiliner.go:4   go &lt;span class="pl-en"&gt;func&lt;/span&gt;() {
missing defer top level func multiliner: test/bad/bad_import.go:12 func &lt;span class="pl-en"&gt;Foobar&lt;/span&gt;() {
missing defer top level func oneliner:   test/bad/bad_import.go:6 func (d &lt;span class="pl-k"&gt;*&lt;/span&gt;Data) &lt;span class="pl-en"&gt;Foobar2&lt;/span&gt;() {}
missing defer named func multiliner:     test/bad/bad_imported.go:8     Foobar3 := &lt;span class="pl-en"&gt;func&lt;/span&gt;() {
missing defer top level func multiliner: test/bad/bad_import.go:8 func (d &lt;span class="pl-k"&gt;*&lt;/span&gt;Data) &lt;span class="pl-en"&gt;Foobar4&lt;/span&gt;() {
missing defer named func oneliner:       test/bad/bad_imported.go:12    Foobar4 := &lt;span class="pl-en"&gt;func&lt;/span&gt;() {}&lt;/pre&gt;&lt;/div&gt;
&lt;h2 id="notes"&gt;&lt;a class="heading-link" href="#notes"&gt;notes&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;either of these is valid:&lt;/p&gt;
&lt;div class="highlight highlight-source-go"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;go&lt;/span&gt; &lt;span class="pl-k"&gt;func&lt;/span&gt;() {
    &lt;span class="pl-k"&gt;defer&lt;/span&gt; &lt;span class="pl-k"&gt;func&lt;/span&gt;() {}()
}()&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-go"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;go&lt;/span&gt; &lt;span class="pl-k"&gt;func&lt;/span&gt;() {
   &lt;span class="pl-c"&gt;// defer func() {}()&lt;/span&gt;
}()&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;the purpose of this linter is to force you to consider what happens if a goroutine panics, not to force you to include empty defers.&lt;/p&gt;
&lt;p&gt;this is similar to always considering what to do with an err, even if you decide to assign it to &lt;code&gt;_&lt;/code&gt;.&lt;/p&gt;

                      
            </description>
      <guid isPermaLink="false">https://nathants.com/projects/go-hasdefer</guid>
    </item>
    <item>
      <title>go-hasdefault</title>
      <link>https://nathants.com/projects/go-hasdefault</link>
      <description>
                
&lt;h2 id="what"&gt;&lt;a class="heading-link" href="#what"&gt;what&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;a go linter to check that all switch statements have a default case.&lt;/p&gt;
&lt;h2 id="why"&gt;&lt;a class="heading-link" href="#why"&gt;why&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;sometimes a missing default is an error.&lt;/p&gt;
&lt;h2 id="install"&gt;&lt;a class="heading-link" href="#install"&gt;install&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;go install github.com/nathants/go-hasdefault@latest&lt;/code&gt;&lt;/p&gt;
&lt;h2 id="usage"&gt;&lt;a class="heading-link" href="#usage"&gt;usage&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; go-hasdefault &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;find test/good/ -name &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;*.go&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; go-hasdefault &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;find test/bad/ -name &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;*.go&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;
test/bad/bad.go:3: switch statement missing default &lt;span class="pl-k"&gt;case&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

                      
            </description>
      <guid isPermaLink="false">https://nathants.com/projects/go-hasdefault</guid>
    </item>
    <item>
      <title>go-dynamolock</title>
      <link>https://nathants.com/projects/go-dynamolock</link>
      <description>
                
&lt;h2 id="why"&gt;&lt;a class="heading-link" href="#why"&gt;why&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;locking around dynamodb should be simple and easy.&lt;/p&gt;
&lt;h2 id="what"&gt;&lt;a class="heading-link" href="#what"&gt;what&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;a minimal go library for locking around dynamodb.&lt;/p&gt;
&lt;p&gt;compared &lt;a href="https://github.com/cirello-io/dynamolock"&gt;to&lt;/a&gt; &lt;a href="https://github.com/Clever/dynamodb-lock-go"&gt;alternatives&lt;/a&gt; it has less code and fewer features.&lt;/p&gt;
&lt;h2 id="how"&gt;&lt;a class="heading-link" href="#how"&gt;how&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;a record in dynamodb uses a uuid and a timetamp to coordinate callers.&lt;/p&gt;
&lt;p&gt;to lock, a caller finds the uuid missing and adds it.&lt;/p&gt;
&lt;p&gt;while locked, the caller heartbeats the timestamp.&lt;/p&gt;
&lt;p&gt;to unlock, the caller removes the uuid.&lt;/p&gt;
&lt;p&gt;arbitrary data can be stored atomically in the lock record. it is read via lock, and written via unlock.&lt;/p&gt;
&lt;p&gt;manipulation of external state while the lock is held is subject to concurrent updates depending on maxAge, heartbeatInterval, and caller clock drift.&lt;/p&gt;
&lt;p&gt;in practice, a small heartbeatInterval, a large maxAge, and reasonable clock drift should be &lt;a href="https://en.wikipedia.org/wiki/Lease_(computer_science)" rel="nofollow"&gt;safe&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;prefer to store data within the lock when possible.&lt;/p&gt;
&lt;h2 id="install"&gt;&lt;a class="heading-link" href="#install"&gt;install&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;go get github.com/nathants/go-dynamolock&lt;/code&gt;&lt;/p&gt;
&lt;h2 id="usage"&gt;&lt;a class="heading-link" href="#usage"&gt;usage&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight highlight-source-go"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;package&lt;/span&gt; main

&lt;span class="pl-k"&gt;import&lt;/span&gt; (
	&lt;span class="pl-s"&gt;"context"&lt;/span&gt;
	&lt;span class="pl-s"&gt;"time"&lt;/span&gt;
	&lt;span class="pl-s"&gt;"github.com/nathants/go-dynamolock"&lt;/span&gt;
	&lt;span class="pl-s"&gt;"github.com/aws/aws-sdk-go/service/dynamodb/dynamodbattribute"&lt;/span&gt;
)


&lt;span class="pl-k"&gt;type&lt;/span&gt; &lt;span class="pl-smi"&gt;Data&lt;/span&gt; &lt;span class="pl-k"&gt;struct&lt;/span&gt; {
    &lt;span class="pl-c1"&gt;Value&lt;/span&gt; &lt;span class="pl-smi"&gt;string&lt;/span&gt;
}

&lt;span class="pl-k"&gt;func&lt;/span&gt; &lt;span class="pl-en"&gt;main&lt;/span&gt;() {
	&lt;span class="pl-s1"&gt;ctx&lt;/span&gt; &lt;span class="pl-c1"&gt;:=&lt;/span&gt; &lt;span class="pl-s1"&gt;context&lt;/span&gt;.&lt;span class="pl-en"&gt;Background&lt;/span&gt;()

	&lt;span class="pl-c"&gt;// dynamodb table&lt;/span&gt;
	&lt;span class="pl-s1"&gt;table&lt;/span&gt; &lt;span class="pl-c1"&gt;:=&lt;/span&gt; &lt;span class="pl-s"&gt;"table"&lt;/span&gt;

	&lt;span class="pl-c"&gt;// dynamodb key&lt;/span&gt;
	&lt;span class="pl-s1"&gt;id&lt;/span&gt; &lt;span class="pl-c1"&gt;:=&lt;/span&gt; &lt;span class="pl-s"&gt;"lock1"&lt;/span&gt;

	&lt;span class="pl-c"&gt;// after a failure to unlock/heartbeat, this much time must pass before lock is available&lt;/span&gt;
	&lt;span class="pl-s1"&gt;maxAge&lt;/span&gt; &lt;span class="pl-c1"&gt;:=&lt;/span&gt; &lt;span class="pl-s1"&gt;time&lt;/span&gt;.&lt;span class="pl-c1"&gt;Second&lt;/span&gt; &lt;span class="pl-c1"&gt;*&lt;/span&gt; &lt;span class="pl-c1"&gt;30&lt;/span&gt;

	&lt;span class="pl-c"&gt;// how often to heartbeat lock timestamp&lt;/span&gt;
	&lt;span class="pl-s1"&gt;heartbeat&lt;/span&gt; &lt;span class="pl-c1"&gt;:=&lt;/span&gt; &lt;span class="pl-s1"&gt;time&lt;/span&gt;.&lt;span class="pl-c1"&gt;Second&lt;/span&gt; &lt;span class="pl-c1"&gt;*&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;

	&lt;span class="pl-c"&gt;// lock and read data&lt;/span&gt;
	&lt;span class="pl-s1"&gt;unlock&lt;/span&gt;, &lt;span class="pl-s1"&gt;item&lt;/span&gt;, &lt;span class="pl-s1"&gt;err&lt;/span&gt; &lt;span class="pl-c1"&gt;:=&lt;/span&gt; &lt;span class="pl-s1"&gt;dynamolock&lt;/span&gt;.&lt;span class="pl-en"&gt;Lock&lt;/span&gt;(&lt;span class="pl-s1"&gt;ctx&lt;/span&gt;, &lt;span class="pl-s1"&gt;table&lt;/span&gt;, &lt;span class="pl-s1"&gt;id&lt;/span&gt;, &lt;span class="pl-s1"&gt;maxAge&lt;/span&gt;, &lt;span class="pl-s1"&gt;heartbeat&lt;/span&gt;)
	&lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-s1"&gt;err&lt;/span&gt; &lt;span class="pl-c1"&gt;!=&lt;/span&gt; &lt;span class="pl-c1"&gt;nil&lt;/span&gt; {
		&lt;span class="pl-c"&gt;// TODO handle lock contention&lt;/span&gt;
		&lt;span class="pl-en"&gt;panic&lt;/span&gt;(&lt;span class="pl-s1"&gt;err&lt;/span&gt;)
	}
	&lt;span class="pl-s1"&gt;data&lt;/span&gt; &lt;span class="pl-c1"&gt;:=&lt;/span&gt; &lt;span class="pl-c1"&gt;&amp;amp;&lt;/span&gt;&lt;span class="pl-smi"&gt;Data&lt;/span&gt;{}
	&lt;span class="pl-s1"&gt;err&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;dynamodbattribute&lt;/span&gt;.&lt;span class="pl-en"&gt;UnmarshalMap&lt;/span&gt;(&lt;span class="pl-s1"&gt;item&lt;/span&gt;, &lt;span class="pl-s1"&gt;data&lt;/span&gt;)
	&lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-s1"&gt;err&lt;/span&gt; &lt;span class="pl-c1"&gt;!=&lt;/span&gt; &lt;span class="pl-c1"&gt;nil&lt;/span&gt; {
		&lt;span class="pl-en"&gt;panic&lt;/span&gt;(&lt;span class="pl-s1"&gt;err&lt;/span&gt;)
	}

	&lt;span class="pl-c"&gt;// do work with the lock&lt;/span&gt;
	&lt;span class="pl-s1"&gt;time&lt;/span&gt;.&lt;span class="pl-en"&gt;Sleep&lt;/span&gt;(&lt;span class="pl-s1"&gt;time&lt;/span&gt;.&lt;span class="pl-c1"&gt;Second&lt;/span&gt; &lt;span class="pl-c1"&gt;*&lt;/span&gt; &lt;span class="pl-c1"&gt;1&lt;/span&gt;)
	&lt;span class="pl-s1"&gt;data&lt;/span&gt;.&lt;span class="pl-c1"&gt;Value&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s"&gt;"updated"&lt;/span&gt;

	&lt;span class="pl-c"&gt;// unlock and write data&lt;/span&gt;
	&lt;span class="pl-s1"&gt;item&lt;/span&gt;, &lt;span class="pl-s1"&gt;err&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;dynamodbattribute&lt;/span&gt;.&lt;span class="pl-en"&gt;MarshalMap&lt;/span&gt;(&lt;span class="pl-s1"&gt;data&lt;/span&gt;)
	&lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-s1"&gt;err&lt;/span&gt; &lt;span class="pl-c1"&gt;!=&lt;/span&gt; &lt;span class="pl-c1"&gt;nil&lt;/span&gt; {
		&lt;span class="pl-en"&gt;panic&lt;/span&gt;(&lt;span class="pl-s1"&gt;err&lt;/span&gt;)
	}
	&lt;span class="pl-s1"&gt;err&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;unlock&lt;/span&gt;(&lt;span class="pl-s1"&gt;item&lt;/span&gt;)
	&lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-s1"&gt;err&lt;/span&gt; &lt;span class="pl-c1"&gt;!=&lt;/span&gt; &lt;span class="pl-c1"&gt;nil&lt;/span&gt; {
		&lt;span class="pl-en"&gt;panic&lt;/span&gt;(&lt;span class="pl-s1"&gt;err&lt;/span&gt;)
	}
}&lt;/pre&gt;&lt;/div&gt;

                      
            </description>
      <guid isPermaLink="false">https://nathants.com/projects/go-dynamolock</guid>
    </item>
    <item>
      <title>c-argh</title>
      <link>https://nathants.com/projects/c-argh</link>
      <description>
                
&lt;h2 id="why"&gt;&lt;a class="heading-link" href="#why"&gt;why&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;minimal &lt;a href="https://github.com/nathants/c-argh/blob/master/example.c"&gt;argument parsing&lt;/a&gt; shouldn't require a hundred lines of code.&lt;/p&gt;
&lt;h2 id="what"&gt;&lt;a class="heading-link" href="#what"&gt;what&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;header only argument parsing for c inspired by the simplicity of &lt;a href="https://pythonhosted.org/argh/" rel="nofollow"&gt;argh&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="example"&gt;&lt;a class="heading-link" href="#example"&gt;example&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight highlight-source-c"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;#include&lt;/span&gt; &lt;span class="pl-s"&gt;&amp;lt;stdbool.h&amp;gt;&lt;/span&gt;
&lt;span class="pl-k"&gt;#include&lt;/span&gt; &lt;span class="pl-s"&gt;&amp;lt;stdio.h&amp;gt;&lt;/span&gt;
&lt;span class="pl-k"&gt;#include&lt;/span&gt; &lt;span class="pl-s"&gt;&amp;lt;stdlib.h&amp;gt;&lt;/span&gt;
&lt;span class="pl-k"&gt;#include&lt;/span&gt; &lt;span class="pl-s"&gt;&amp;lt;string.h&amp;gt;&lt;/span&gt;
&lt;span class="pl-k"&gt;#include&lt;/span&gt; &lt;span class="pl-s"&gt;"argh.h"&lt;/span&gt;

&lt;span class="pl-k"&gt;#define&lt;/span&gt; &lt;span class="pl-c1"&gt;USAGE&lt;/span&gt; "example [-l|--lz4] [-h N|--head N] [-p|--prefix] POS1 ... POSN"

&lt;span class="pl-smi"&gt;int&lt;/span&gt; &lt;span class="pl-en"&gt;main&lt;/span&gt;(&lt;span class="pl-smi"&gt;int&lt;/span&gt; &lt;span class="pl-s1"&gt;argc&lt;/span&gt;, &lt;span class="pl-smi"&gt;char&lt;/span&gt; &lt;span class="pl-c1"&gt;*&lt;/span&gt;&lt;span class="pl-c1"&gt;*&lt;/span&gt;&lt;span class="pl-s1"&gt;argv&lt;/span&gt;) {
    &lt;span class="pl-smi"&gt;bool&lt;/span&gt; &lt;span class="pl-s1"&gt;prefix&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; false;
    &lt;span class="pl-smi"&gt;bool&lt;/span&gt; &lt;span class="pl-s1"&gt;lz4&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; false;
    &lt;span class="pl-smi"&gt;int&lt;/span&gt; &lt;span class="pl-s1"&gt;head&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;0&lt;/span&gt;;
    &lt;span class="pl-smi"&gt;ARGH_PARSE&lt;/span&gt; {
        &lt;span class="pl-en"&gt;ARGH_NEXT&lt;/span&gt;();
        &lt;span class="pl-k"&gt;if&lt;/span&gt;      &lt;span class="pl-en"&gt;ARGH_BOOL&lt;/span&gt;(&lt;span class="pl-s"&gt;"-p"&lt;/span&gt;, &lt;span class="pl-s"&gt;"--prefix"&lt;/span&gt;) { &lt;span class="pl-s1"&gt;prefix&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; true;}
        &lt;span class="pl-smi"&gt;else&lt;/span&gt; &lt;span class="pl-s1"&gt;if&lt;/span&gt; &lt;span class="pl-en"&gt;ARGH_BOOL&lt;/span&gt;(&lt;span class="pl-s"&gt;"-l"&lt;/span&gt;, &lt;span class="pl-s"&gt;"--lz4"&lt;/span&gt;)    { &lt;span class="pl-s1"&gt;lz4&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; true; }
        &lt;span class="pl-smi"&gt;else&lt;/span&gt; &lt;span class="pl-s1"&gt;if&lt;/span&gt; &lt;span class="pl-en"&gt;ARGH_FLAG&lt;/span&gt;(&lt;span class="pl-s"&gt;"-h"&lt;/span&gt;, &lt;span class="pl-s"&gt;"--head"&lt;/span&gt;)   { &lt;span class="pl-s1"&gt;head&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;atol&lt;/span&gt;(&lt;span class="pl-en"&gt;ARGH_VAL&lt;/span&gt;()); }
    }
    &lt;span class="pl-en"&gt;printf&lt;/span&gt;(&lt;span class="pl-s"&gt;"head: %d, prefix: %d, lz4: %d\n"&lt;/span&gt;, &lt;span class="pl-s1"&gt;head&lt;/span&gt;, &lt;span class="pl-s1"&gt;prefix&lt;/span&gt;, &lt;span class="pl-s1"&gt;lz4&lt;/span&gt;);
    &lt;span class="pl-k"&gt;for&lt;/span&gt; (&lt;span class="pl-smi"&gt;int&lt;/span&gt; &lt;span class="pl-s1"&gt;i&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;0&lt;/span&gt;; &lt;span class="pl-s1"&gt;i&lt;/span&gt; &lt;span class="pl-c1"&gt;&amp;lt;&lt;/span&gt; &lt;span class="pl-c1"&gt;ARGH_ARGC&lt;/span&gt;; &lt;span class="pl-s1"&gt;i&lt;/span&gt;&lt;span class="pl-c1"&gt;++&lt;/span&gt;)
        &lt;span class="pl-en"&gt;printf&lt;/span&gt;(&lt;span class="pl-s"&gt;"positional arg %d: %s\n"&lt;/span&gt;, &lt;span class="pl-s1"&gt;i&lt;/span&gt;, &lt;span class="pl-c1"&gt;ARGH_ARGV&lt;/span&gt;[&lt;span class="pl-s1"&gt;i&lt;/span&gt;]);
}&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; make

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ./example -lph5 asdf 123
head: 5, prefix: 1, lz4: 1
pos arg 0: asdf
pos arg 1: 123

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ./example --lz4 asdf -p 123 --head 5
head: 5, prefix: 1, lz4: 1
pos arg 0: asdf
pos arg 1: 123


&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ./example asdf 123 --head 5 --lz4
head: 5, prefix: 0, lz4: 1
pos arg 0: asdf
pos arg 1: 123&lt;/pre&gt;&lt;/div&gt;

                      
            </description>
      <guid isPermaLink="false">https://nathants.com/projects/c-argh</guid>
    </item>
    <item>
      <title>backup</title>
      <link>https://nathants.com/projects/backup</link>
      <description>
                
&lt;h2 id="why"&gt;&lt;a class="heading-link" href="#why"&gt;why&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;backups should be simple and easy.&lt;/p&gt;
&lt;h2 id="how"&gt;&lt;a class="heading-link" href="#how"&gt;how&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;easily create immutable, trustless backups with revision history, compression, and file deduplication.&lt;/p&gt;
&lt;h2 id="what"&gt;&lt;a class="heading-link" href="#what"&gt;what&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;the index, tracked in git, contains filesystem metadata.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;the &lt;a href="./examples/index"&gt;index&lt;/a&gt; is a sorted tsv file of: &lt;code&gt;path, tarball, hash, size, mode&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;for every line of metadata in the index, there is one and only one tarball containing a file with that hash.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;duplicate files, by &lt;a href="https://www.blake2.net/" rel="nofollow"&gt;blake2b&lt;/a&gt; hash, are never stored.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;the index is encrypted with &lt;a href="https://github.com/spwhitton/git-remote-gcrypt"&gt;git-remote-gcrypt&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;the tarballs are split into chunks, compressed with &lt;a href="https://github.com/lz4/lz4"&gt;lz4&lt;/a&gt;, then encrypted with &lt;a href="https://gnupg.org/" rel="nofollow"&gt;gpg&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;all remote storage is handled via &lt;a href="https://rclone.org/" rel="nofollow"&gt;rclone&lt;/a&gt; on any &lt;a href="https://rclone.org/overview/#features" rel="nofollow"&gt;backend&lt;/a&gt; it supports.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;the &lt;a href="./examples/ignore"&gt;ignore&lt;/a&gt; file, tracked in git, contains one regex per line of file paths to ignore.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;a clean restore will clone the git repo, checkout a revision, select file paths by regex, gather needed tarball names, fetch tarballs from storage, and extract the selected files.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="usage"&gt;&lt;a class="heading-link" href="#usage"&gt;usage&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;backup-add&lt;/code&gt; - scan the filesystem for changes.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;backup-diff&lt;/code&gt; - inspect the uncommitted backup diff.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;backup-ignore&lt;/code&gt; - if needed, edit the ignore regexes, then goto &lt;code&gt;backup-add&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;backup-commit&lt;/code&gt; - commit the backup diff to remote storage.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;backup-find&lt;/code&gt; - search for files in the index by regex at revision.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;backup-restore&lt;/code&gt; - restore files from remote storage by regex at revision.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="dependencies"&gt;&lt;a class="heading-link" href="#dependencies"&gt;dependencies&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;awk&lt;/li&gt;
&lt;li&gt;bash&lt;/li&gt;
&lt;li&gt;cat&lt;/li&gt;
&lt;li&gt;git&lt;/li&gt;
&lt;li&gt;git-remote-gcrypt&lt;/li&gt;
&lt;li&gt;gpg&lt;/li&gt;
&lt;li&gt;grep&lt;/li&gt;
&lt;li&gt;lz4&lt;/li&gt;
&lt;li&gt;python3&lt;/li&gt;
&lt;li&gt;rclone&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="installation"&gt;&lt;a class="heading-link" href="#installation"&gt;installation&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;put &lt;code&gt;bin/&lt;/code&gt; on &lt;code&gt;$PATH&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;or&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;sudo mv bin/* /usr/local/bin&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="setup"&gt;&lt;a class="heading-link" href="#setup"&gt;setup&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;add some environment variables to your bashrc:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;export BACKUP_ROOT=~&lt;/code&gt; - root directory to backup&lt;/p&gt;
&lt;p&gt;&lt;code&gt;export BACKUP_RCLONE_REMOTE=$REMOTE&lt;/code&gt; - a remote setup with &lt;code&gt;rclone config&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;export BACKUP_DESTINATION=$BUCKET/backups/$(hostname)&lt;/code&gt; - where to rclone data to&lt;/p&gt;
&lt;p&gt;&lt;code&gt;export BACKUP_CHUNK_MEGABYTES=100&lt;/code&gt; - approximate size of each tarball before compression&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;have a gpg key and a gpg.conf that looks like the following:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt; cat ~/.gnupg/gpg.conf

default-key YOUR@EMAIL.COM
default-recipient YOUR@EMAIL.COM

personal-cipher-preferences AES256
personal-digest-preferences SHA512
personal-compress-preferences Uncompressed
default-preference-list SHA512 AES256 Uncompressed
cert-digest-algo SHA512
s2k-cipher-algo AES256
s2k-digest-algo SHA512
s2k-mode 3
s2k-count 65011712
disable-cipher-algo 3DES
weak-digest SHA1
force-mdc
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="api"&gt;&lt;a class="heading-link" href="#api"&gt;api&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;modify backup state:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;backup-add&lt;/code&gt; - scan the filesystem for changes&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;backup-commit&lt;/code&gt; - commit the backup diff to remote storage&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;backup-ignore&lt;/code&gt; - edit the ignore file in $EDITOR&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;backup-reset&lt;/code&gt; - clear uncommited backup state&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;view backup state:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;backup-additions-sizes&lt;/code&gt; - show large files in the uncommited backup diff&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;backup-additions&lt;/code&gt; - inspect the uncommited backup diff, additions only&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;backup-diff&lt;/code&gt; - inspect the uncommited backup diff&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;backup-find&lt;/code&gt; - find files by regex at revision&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;backup-index&lt;/code&gt; - view the backup index&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;backup-log&lt;/code&gt; - view the git log&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;restore backup content:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;backup-restore&lt;/code&gt; - restore files from remote storage by regex at revision&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="test"&gt;&lt;a class="heading-link" href="#test"&gt;test&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;export BACKUP_TEST_RCLONE_REMOTE=$REMOTE
export BACKUP_TEST_DESTINATION=$BUCKET/test
tox
&lt;/code&gt;&lt;/pre&gt;

                      
            </description>
      <guid isPermaLink="false">https://nathants.com/projects/backup</guid>
    </item>
    <item>
      <title>aws-gocljs</title>
      <link>https://nathants.com/projects/aws-gocljs</link>
      <description>
                
&lt;h2 id="why"&gt;&lt;a class="heading-link" href="#why"&gt;why&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;fullstack web should be easy and fun.&lt;/p&gt;
&lt;h2 id="how"&gt;&lt;a class="heading-link" href="#how"&gt;how&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;start with working implementations of everything, then &lt;a href="#sdlc-demo"&gt;tinker and tweak&lt;/a&gt; until your app is complete!&lt;/p&gt;
&lt;p&gt;fast and reliable &lt;a href="https://github.com/nathants/aws-gocljs/tree/master/bin"&gt;automation&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;easy browser &lt;a href="https://github.com/nathants/py-webengine"&gt;testing&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="what"&gt;&lt;a class="heading-link" href="#what"&gt;what&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;a project scaffold for a fullstack webapp on aws with an &lt;a href="https://github.com/nathants/aws-gocljs/tree/master/infra.yaml"&gt;infrastructure set&lt;/a&gt; ready-to-deploy with &lt;a href="https://github.com/nathants/libaws"&gt;libaws&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;the project scaffold contains:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;a go lambda backend.&lt;/li&gt;
&lt;li&gt;a clojurescript and &lt;a href="http://reagent-project.github.io/" rel="nofollow"&gt;react&lt;/a&gt; frontend.&lt;/li&gt;
&lt;li&gt;s3 and dynamodb for state.&lt;/li&gt;
&lt;li&gt;http and websocket apis.&lt;/li&gt;
&lt;li&gt;low latency &lt;a href="https://github.com/nathants/aws-gocljs/tree/master/bin/logs.sh"&gt;logging&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;automated &lt;a href="https://github.com/nathants/aws-gocljs/tree/master/bin"&gt;devops&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;a live demo on aws is &lt;a href="https://gocljs.nathants.com" rel="nofollow"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="lambda-zip"&gt;&lt;a class="heading-link" href="#lambda-zip"&gt;lambda zip&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;the lambda zip contains only 3 files:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ls -lh &lt;span class="pl-k"&gt;|&lt;/span&gt; awk &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;{print $9, $5}&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; column -t

favicon.png    2.7K &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; favicon&lt;/span&gt;
index.html.gz  296K &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; web app&lt;/span&gt;
main           15M  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; go binary&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;the index.html.gz:&lt;/p&gt;
&lt;div class="highlight highlight-text-html-basic"&gt;&lt;pre&gt;&lt;span class="pl-c1"&gt;&amp;lt;!DOCTYPE html&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;html&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;head&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;meta&lt;/span&gt; &lt;span class="pl-c1"&gt;charset&lt;/span&gt;="&lt;span class="pl-s"&gt;utf-8&lt;/span&gt;"&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;meta&lt;/span&gt; &lt;span class="pl-c1"&gt;http-equiv&lt;/span&gt;="&lt;span class="pl-s"&gt;Content-Security-Policy&lt;/span&gt;" &lt;span class="pl-c1"&gt;content&lt;/span&gt;="&lt;span class="pl-s"&gt;script-src 'sha256-${JS_SHA256}'&lt;/span&gt;"&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;link&lt;/span&gt; &lt;span class="pl-c1"&gt;rel&lt;/span&gt;="&lt;span class="pl-s"&gt;icon&lt;/span&gt;" &lt;span class="pl-c1"&gt;href&lt;/span&gt;="&lt;span class="pl-s"&gt;/favicon.png&lt;/span&gt;"&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;head&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;body&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;div&lt;/span&gt; &lt;span class="pl-c1"&gt;id&lt;/span&gt;="&lt;span class="pl-s"&gt;app&lt;/span&gt;"&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;div&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="pl-kos"&gt;&amp;lt;&lt;/span&gt;&lt;span class="pl-ent"&gt;script&lt;/span&gt; &lt;span class="pl-c1"&gt;type&lt;/span&gt;="&lt;span class="pl-s"&gt;text/javascript&lt;/span&gt;"&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
          &lt;span class="pl-s1"&gt;$&lt;/span&gt;&lt;span class="pl-kos"/&gt;&lt;span class="pl-kos"&gt;{&lt;/span&gt;&lt;span class="pl-c1"&gt;JS&lt;/span&gt;&lt;span class="pl-kos"&gt;}&lt;/span&gt;
        &lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;script&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;body&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="pl-kos"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="pl-ent"&gt;html&lt;/span&gt;&lt;span class="pl-kos"&gt;&amp;gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;the lambda zip itself:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ls -lh &lt;span class="pl-k"&gt;|&lt;/span&gt; awk &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;{print $9, $5}&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
lambda.zip 4.6M&lt;/pre&gt;&lt;/div&gt;
&lt;h2 id="sdlc-demo"&gt;&lt;a class="heading-link" href="#sdlc-demo"&gt;sdlc demo&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;a target="_blank" rel="noopener noreferrer" href="https://github.com/nathants/aws-gocljs/raw/master/demo.gif"&gt;&lt;img src="https://github.com/nathants/aws-gocljs/raw/master/demo.gif" alt="" style="max-width: 100%;"/&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2 id="dependencies"&gt;&lt;a class="heading-link" href="#dependencies"&gt;dependencies&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;use the included &lt;a href="./Dockerfile"&gt;Dockerfile&lt;/a&gt; or install the following dependencies:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;npm&lt;/li&gt;
&lt;li&gt;jdk&lt;/li&gt;
&lt;li&gt;go&lt;/li&gt;
&lt;li&gt;bash&lt;/li&gt;
&lt;li&gt;&lt;a href="https://formulae.brew.sh/formula/entr" rel="nofollow"&gt;entr&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/nathants/libaws"&gt;libaws&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="aws-prerequisites"&gt;&lt;a class="heading-link" href="#aws-prerequisites"&gt;aws prerequisites&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;aws &lt;a href="https://console.aws.amazon.com/route53/v2/hostedzones" rel="nofollow"&gt;route53&lt;/a&gt; has the domain or its parent from env.sh&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;aws &lt;a href="https://us-west-2.console.aws.amazon.com/acm/home" rel="nofollow"&gt;acm&lt;/a&gt; has a wildcard cert for the domain or its parent from env.sh&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="install"&gt;&lt;a class="heading-link" href="#install"&gt;install&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;git clone https://github.com/nathants/aws-gocljs
&lt;span class="pl-c1"&gt;cd&lt;/span&gt; aws-gocljs
cp env.sh.template env.sh &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; update values&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h2 id="usage"&gt;&lt;a class="heading-link" href="#usage"&gt;usage&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;bash bin/check.sh         &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; lint&lt;/span&gt;
bash bin/preview.sh       &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; preview changes to aws infra&lt;/span&gt;
bash bin/ensure.sh        &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; ensure aws infra&lt;/span&gt;
bash bin/dev.sh           &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; iterate on backend and frontend&lt;/span&gt;
bash bin/logs.sh          &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; tail the logs&lt;/span&gt;
bash bin/delete.sh        &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; delete aws infra&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h2 id="usage-with-bad-upload-bandwidth"&gt;&lt;a class="heading-link" href="#usage-with-bad-upload-bandwidth"&gt;usage with bad upload bandwidth:&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; bash bin/dev.sh         # this needs upload bandwidth&lt;/span&gt;
bash bin/dev_frontend.sh  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; iterate on localhost frontend&lt;/span&gt;
bash bin/relay.sh         &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; iterate on backend via ec2 relay&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;

                      
            </description>
      <guid isPermaLink="false">https://nathants.com/projects/aws-gocljs</guid>
    </item>
    <item>
      <title>aws-exec</title>
      <link>https://nathants.com/projects/aws-exec</link>
      <description>
                
&lt;h2 id="why"&gt;&lt;a class="heading-link" href="#why"&gt;why&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;building services on lambda should be easy and fun.&lt;/p&gt;
&lt;h2 id="what"&gt;&lt;a class="heading-link" href="#what"&gt;what&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;a project scaffold for a backend service on aws with an &lt;a href="https://github.com/nathants/aws-exec/blob/master/infra.yaml"&gt;infrastructure set&lt;/a&gt; ready-to-deploy with &lt;a href="https://github.com/nathants/libaws"&gt;libaws&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;the project scaffold makes it easy to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;authenticate callers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;implement fast synchronous apis that return all results immediately.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;implement slow asynchronous apis with streaming logs, exit code, and 15 minutes max duration.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;use the &lt;a href="#web-demo"&gt;web&lt;/a&gt; admin interface, even from a &lt;a href="#mobile-demo"&gt;phone&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;use the &lt;a href="#cli-demo"&gt;cli&lt;/a&gt; admin interface, executing locally or on lambda.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;use the &lt;a href="#api-demo"&gt;api&lt;/a&gt; interface, calling efficiently from other backend services.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="how"&gt;&lt;a class="heading-link" href="#how"&gt;how&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;synchronous apis are normal http on lambda.&lt;/p&gt;
&lt;p&gt;asynchronous apis are a http post that triggers an async lambda which invokes a command via &lt;a href="https://github.com/nathants/aws-exec/tree/master/cmd/rpc/rpc.go"&gt;rpc&lt;/a&gt; or &lt;a href="https://github.com/nathants/aws-exec/tree/master/cmd/exec/exec.go"&gt;subprocess&lt;/a&gt; and stores the results in s3.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;each invocation creates 3 objects in s3:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;log: all stdout and stderr, updated in its entirety every second.&lt;/li&gt;
&lt;li&gt;exit: the exit code of the command, written once.&lt;/li&gt;
&lt;li&gt;size: the size in bytes of the log after the final update, written once, written last.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;objects are stored in either:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;aws-exec private s3.&lt;/li&gt;
&lt;li&gt;presigned s3 put urls provided by the caller.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;to follow invocation status, the caller:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;polls the log object with increasing range-start.&lt;/li&gt;
&lt;li&gt;stops when the size object exists and range-start equals size.&lt;/li&gt;
&lt;li&gt;returns the exit object.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;there are three ways to invoke an asynchronous api:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="#api-demo"&gt;api&lt;/a&gt; invoke via &lt;a href="https://github.com/nathants/aws-exec/tree/master/cmd/rpc/rpc.go"&gt;rpc&lt;/a&gt;, this is faster.&lt;/li&gt;
&lt;li&gt;
&lt;a href="#cli-demo"&gt;cli&lt;/a&gt; invoke via &lt;a href="https://github.com/nathants/aws-exec/tree/master/cmd/exec/exec.go"&gt;subprocess&lt;/a&gt;, this is slower.&lt;/li&gt;
&lt;li&gt;
&lt;a href="#web-demo"&gt;web&lt;/a&gt; invoke via &lt;a href="https://github.com/nathants/aws-exec/tree/master/cmd/exec/exec.go"&gt;subprocess&lt;/a&gt;, this is slower.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="add-a-new-synchronous-functionality"&gt;&lt;a class="heading-link" href="#add-a-new-synchronous-functionality"&gt;add a new synchronous functionality&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;add to &lt;a href="https://github.com/nathants/aws-exec/tree/master/backend/backend.go#L353"&gt;api/&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;duplicate the &lt;a href="https://github.com/nathants/aws-exec/tree/master/backend/backend.go#L140"&gt;httpExecGet&lt;/a&gt; or &lt;a href="https://github.com/nathants/aws-exec/tree/master/backend/backend.go#L224"&gt;httpExecPost&lt;/a&gt; handler and modify it to introduce new functionality.&lt;/p&gt;
&lt;h2 id="add-a-new-asynchronous-functionality"&gt;&lt;a class="heading-link" href="#add-a-new-asynchronous-functionality"&gt;add a new asynchronous functionality&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;add to &lt;a href="https://github.com/nathants/aws-exec/tree/master/cmd"&gt;cmd/&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;duplicate the &lt;a href="https://github.com/nathants/aws-exec/tree/master/cmd/listdir/listdir.go"&gt;listdir&lt;/a&gt; command and modify it to introduce new functionality.&lt;/p&gt;
&lt;h2 id="web-demo"&gt;&lt;a class="heading-link" href="#web-demo"&gt;web demo&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;a target="_blank" rel="noopener noreferrer" href="https://github.com/nathants/aws-exec/raw/master/gif/web.gif"&gt;&lt;img src="https://github.com/nathants/aws-exec/raw/master/gif/web.gif" alt="" style="max-width: 100%;"/&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2 id="cli-demo"&gt;&lt;a class="heading-link" href="#cli-demo"&gt;cli demo&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;a target="_blank" rel="noopener noreferrer" href="https://github.com/nathants/aws-exec/raw/master/gif/cli.gif"&gt;&lt;img src="https://github.com/nathants/aws-exec/raw/master/gif/cli.gif" alt="" style="max-width: 100%;"/&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2 id="api-demo"&gt;&lt;a class="heading-link" href="#api-demo"&gt;api demo&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;a target="_blank" rel="noopener noreferrer" href="https://github.com/nathants/aws-exec/raw/master/gif/api.gif"&gt;&lt;img src="https://github.com/nathants/aws-exec/raw/master/gif/api.gif" alt="" style="max-width: 100%;"/&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2 id="mobile-demo"&gt;&lt;a class="heading-link" href="#mobile-demo"&gt;mobile demo&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;a target="_blank" rel="noopener noreferrer" href="https://github.com/nathants/aws-exec/raw/master/gif/mobile.gif"&gt;&lt;img src="https://github.com/nathants/aws-exec/raw/master/gif/mobile.gif" alt="" style="max-width: 100%;"/&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2 id="dependencies"&gt;&lt;a class="heading-link" href="#dependencies"&gt;dependencies&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;use the included &lt;a href="./Dockerfile"&gt;Dockerfile&lt;/a&gt; or install the following dependencies:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;npm&lt;/li&gt;
&lt;li&gt;jdk&lt;/li&gt;
&lt;li&gt;go&lt;/li&gt;
&lt;li&gt;bash&lt;/li&gt;
&lt;li&gt;&lt;a href="https://formulae.brew.sh/formula/entr" rel="nofollow"&gt;entr&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/nathants/libaws"&gt;libaws&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="aws-prerequisites"&gt;&lt;a class="heading-link" href="#aws-prerequisites"&gt;aws prerequisites&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;aws &lt;a href="https://console.aws.amazon.com/route53/v2/hostedzones" rel="nofollow"&gt;route53&lt;/a&gt; has the domain or its parent from env.sh&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;aws &lt;a href="https://us-west-2.console.aws.amazon.com/acm/home" rel="nofollow"&gt;acm&lt;/a&gt; has a wildcard cert for the domain or its parent from env.sh&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="usage"&gt;&lt;a class="heading-link" href="#usage"&gt;usage&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;go install github.com/nathants/libaws@latest
&lt;span class="pl-k"&gt;export&lt;/span&gt; PATH=&lt;span class="pl-smi"&gt;$PATH&lt;/span&gt;:&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;go env GOPATH&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;/bin

cp env.sh.template env.sh &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; update values&lt;/span&gt;
bash bin/check.sh         &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; lint&lt;/span&gt;
bash bin/preview.sh       &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; preview changes to aws infra&lt;/span&gt;
bash bin/ensure.sh        &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; ensure aws infra&lt;/span&gt;
bash bin/dev.sh           &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; iterate on backend and frontend&lt;/span&gt;
bash bin/logs.sh          &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; tail the logs&lt;/span&gt;
bash bin/delete.sh        &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; delete aws infra&lt;/span&gt;
bash bin/cli.sh -h        &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; interact with the service via the cli&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h2 id="usage-with-bad-upload-bandwidth"&gt;&lt;a class="heading-link" href="#usage-with-bad-upload-bandwidth"&gt;usage with bad upload bandwidth:&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; bash bin/dev.sh         # this needs upload bandwidth&lt;/span&gt;
bash bin/dev_frontend.sh  &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; iterate on localhost frontend&lt;/span&gt;
bash bin/relay.sh         &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; iterate on backend via ec2 relay&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h2 id="usage-with-docker"&gt;&lt;a class="heading-link" href="#usage-with-docker"&gt;usage with docker&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;cp env.sh.template env.sh &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; update values&lt;/span&gt;
docker build -t aws-exec:latest &lt;span class="pl-c1"&gt;.&lt;/span&gt;
docker run -it --rm \
    -v &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;pwd&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;:/code \
    -e AWS_DEFAULT_REGION \
    -e AWS_ACCESS_KEY_ID \
    -e AWS_SECRET_ACCESS_KEY \
    aws-exec:latest \
    bash -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;        cd /code&lt;/span&gt;
&lt;span class="pl-s"&gt;        bash bin/ensure.sh&lt;/span&gt;
&lt;span class="pl-s"&gt;    &lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h2 id="create-auth"&gt;&lt;a class="heading-link" href="#create-auth"&gt;create auth&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;bash bin/cli.sh auth-new test-user&lt;/pre&gt;&lt;/div&gt;
&lt;h2 id="install-and-use-cli"&gt;&lt;a class="heading-link" href="#install-and-use-cli"&gt;install and use cli&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;go install github.com/nathants/aws-exec@latest
&lt;span class="pl-k"&gt;export&lt;/span&gt; PATH=&lt;span class="pl-smi"&gt;$PATH&lt;/span&gt;:&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;go env GOPATH&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;/bin

&lt;span class="pl-k"&gt;export&lt;/span&gt; AUTH=&lt;span class="pl-smi"&gt;$AUTH&lt;/span&gt;
&lt;span class="pl-k"&gt;export&lt;/span&gt; PROJECT_DOMAIN=&lt;span class="pl-smi"&gt;$DOMAIN&lt;/span&gt;
aws-exec &lt;span class="pl-c1"&gt;exec&lt;/span&gt; -- whoami&lt;/pre&gt;&lt;/div&gt;
&lt;h2 id="install-and-use-api"&gt;&lt;a class="heading-link" href="#install-and-use-api"&gt;install and use api&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;go get github.com/nathants/aws-exec@latest&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-go"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;package&lt;/span&gt; cmd

&lt;span class="pl-k"&gt;import&lt;/span&gt; (
	&lt;span class="pl-s"&gt;"context"&lt;/span&gt;
	&lt;span class="pl-s"&gt;"encoding/json"&lt;/span&gt;
	&lt;span class="pl-s"&gt;"fmt"&lt;/span&gt;
	&lt;span class="pl-s"&gt;"os"&lt;/span&gt;

	awsexec &lt;span class="pl-s"&gt;"github.com/nathants/aws-exec/exec"&lt;/span&gt;
)

&lt;span class="pl-k"&gt;func&lt;/span&gt; &lt;span class="pl-en"&gt;main&lt;/span&gt;() {
	&lt;span class="pl-s1"&gt;val&lt;/span&gt;, &lt;span class="pl-s1"&gt;err&lt;/span&gt; &lt;span class="pl-c1"&gt;:=&lt;/span&gt; &lt;span class="pl-s1"&gt;json&lt;/span&gt;.&lt;span class="pl-en"&gt;Marshal&lt;/span&gt;(&lt;span class="pl-k"&gt;map&lt;/span&gt;[&lt;span class="pl-smi"&gt;string&lt;/span&gt;]&lt;span class="pl-k"&gt;interface&lt;/span&gt;{}{
		&lt;span class="pl-s"&gt;"path"&lt;/span&gt;: &lt;span class="pl-s"&gt;"."&lt;/span&gt;,
	})
	&lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-s1"&gt;err&lt;/span&gt; &lt;span class="pl-c1"&gt;!=&lt;/span&gt; &lt;span class="pl-c1"&gt;nil&lt;/span&gt; {
	    &lt;span class="pl-en"&gt;panic&lt;/span&gt;(&lt;span class="pl-s1"&gt;err&lt;/span&gt;)
	}
	&lt;span class="pl-s1"&gt;exitCode&lt;/span&gt;, &lt;span class="pl-s1"&gt;err&lt;/span&gt; &lt;span class="pl-c1"&gt;:=&lt;/span&gt; &lt;span class="pl-s1"&gt;awsexec&lt;/span&gt;.&lt;span class="pl-en"&gt;Exec&lt;/span&gt;(&lt;span class="pl-s1"&gt;context&lt;/span&gt;.&lt;span class="pl-en"&gt;Background&lt;/span&gt;(), &lt;span class="pl-c1"&gt;&amp;amp;&lt;/span&gt;awsexec.&lt;span class="pl-smi"&gt;Args&lt;/span&gt;{
		&lt;span class="pl-c1"&gt;Url&lt;/span&gt;:     &lt;span class="pl-s"&gt;"https://%s"&lt;/span&gt; &lt;span class="pl-c1"&gt;+&lt;/span&gt; &lt;span class="pl-s1"&gt;os&lt;/span&gt;.&lt;span class="pl-en"&gt;Getenv&lt;/span&gt;(&lt;span class="pl-s"&gt;"PROJECT_DOMAIN"&lt;/span&gt;),
		&lt;span class="pl-c1"&gt;Auth&lt;/span&gt;:    &lt;span class="pl-s1"&gt;os&lt;/span&gt;.&lt;span class="pl-en"&gt;Getenv&lt;/span&gt;(&lt;span class="pl-s"&gt;"AUTH"&lt;/span&gt;),
		&lt;span class="pl-c1"&gt;RpcName&lt;/span&gt;: &lt;span class="pl-s"&gt;"listdir"&lt;/span&gt;,
		&lt;span class="pl-c1"&gt;RpcArgs&lt;/span&gt;: &lt;span class="pl-en"&gt;string&lt;/span&gt;(&lt;span class="pl-s1"&gt;val&lt;/span&gt;),
		&lt;span class="pl-c1"&gt;LogDataCallback&lt;/span&gt;: &lt;span class="pl-k"&gt;func&lt;/span&gt;(&lt;span class="pl-s1"&gt;logs&lt;/span&gt; &lt;span class="pl-smi"&gt;string&lt;/span&gt;) {
			&lt;span class="pl-s1"&gt;fmt&lt;/span&gt;.&lt;span class="pl-en"&gt;Print&lt;/span&gt;(&lt;span class="pl-s1"&gt;logs&lt;/span&gt;)
		},
	})
	&lt;span class="pl-k"&gt;if&lt;/span&gt; &lt;span class="pl-s1"&gt;err&lt;/span&gt; &lt;span class="pl-c1"&gt;!=&lt;/span&gt; &lt;span class="pl-c1"&gt;nil&lt;/span&gt; {
		&lt;span class="pl-en"&gt;panic&lt;/span&gt;(&lt;span class="pl-s1"&gt;err&lt;/span&gt;)
	}
	&lt;span class="pl-s1"&gt;os&lt;/span&gt;.&lt;span class="pl-en"&gt;Exit&lt;/span&gt;(&lt;span class="pl-s1"&gt;exitCode&lt;/span&gt;)
}&lt;/pre&gt;&lt;/div&gt;

                      
            </description>
      <guid isPermaLink="false">https://nathants.com/projects/aws-exec</guid>
    </item>
    <item>
      <title>aws-ensure-route53</title>
      <link>https://nathants.com/projects/aws-ensure-route53</link>
      <description>
                
&lt;h2 id="why"&gt;&lt;a class="heading-link" href="#why"&gt;why&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;managing your dns in route53 across multiple accounts should be easy.&lt;/p&gt;
&lt;h2 id="what"&gt;&lt;a class="heading-link" href="#what"&gt;what&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;tooling to make managing your dns in route53 across multiple accounts simple and easy using &lt;a href="https://github.com/nathants/libaws"&gt;libaws&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="install"&gt;&lt;a class="heading-link" href="#install"&gt;install&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;go install github.com/nathants/libaws@latest

export PATH=$PATH:$(go env GOPATH)/bin
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;or use the &lt;a href="./Dockerfile"&gt;dockerfile&lt;/a&gt;&lt;/p&gt;
&lt;h2 id="usage"&gt;&lt;a class="heading-link" href="#usage"&gt;usage&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;h3 id="bootstrap"&gt;&lt;a class="heading-link" href="#bootstrap"&gt;bootstrap&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;clone this repo, and setup a new private remote. you will version your dns data here. you probably don't want this on public github.&lt;/p&gt;
&lt;p&gt;setup your credentials using: &lt;code&gt;libaws creds-add -h&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;you can now list all your credentials with: &lt;code&gt;libaws creds-ls&lt;/code&gt;&lt;/p&gt;
&lt;h3 id="initialize"&gt;&lt;a class="heading-link" href="#initialize"&gt;initialize&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;pull all your dns records across all accounts with: &lt;code&gt;bash bin/pull.sh&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;commit this initial data.&lt;/p&gt;
&lt;p&gt;your repo now looks like:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt; tree
├── accounts
│   ├── work-prod
│   │   └── dns.txt
│   ├── work-staging
│   │   └── dns.txt
│   ├── work-scratch
│   │   └── dns.txt
│   ├── personal-prod
│   │   └── dns.txt
│   └── personal-scratch
│       └── dns.txt
└── bin
    ├── ensure_all.sh
    ├── ensure.sh
    ├── preview_all.sh
    ├── preview.sh
    └── pull.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;the dns.txt files contain entries created by &lt;a href="https://github.com/nathants/libaws/blob/master/cmd/route53/ls.go"&gt;route53-ls&lt;/a&gt; that look like:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;example.com example.com       Type=A     TTL=60 Value=1.1.1.1 Value=2.2.2.2
example.com cname.example.com Type=CNAME TTL=60 Value=about.us-west-2.domain.example.com
example.com alias.example.com Type=Alias        Value=d-XXX.execute-api.us-west-2.amazonaws.com     HostedZoneId=XXX
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id="update"&gt;&lt;a class="heading-link" href="#update"&gt;update&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;you can now modify or add entires to these files, and deploy them.&lt;/p&gt;
&lt;p&gt;you could make a change like:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt; git diff
diff --git a/accounts/work-prod/dns.txt b/accounts/work-prod/dns.txt
index 4b959e4..67415b7 100644
--- a/accounts/work-prod/dns.txt
+++ b/accounts/work-prod/dns.txt
@@ -1,4 +1,4 @@
-example.com foo.example.com Type=CNAME TTL=300 Value=bar
+example.com foo.example.com Type=CNAME TTL=300 Value=barr
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id="preview"&gt;&lt;a class="heading-link" href="#preview"&gt;preview&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;to preview those changes, use &lt;code&gt;bash bin/preview_all.sh&lt;/code&gt; or &lt;code&gt;bash bin/preview.sh work-prod&lt;/code&gt;, which looks like:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt; bash bin/preview_all.sh
preview dns: work-prod
lib/route53.go:258: preview: route53 update Values for foo.example.com: ["bar"] =&amp;gt; ["barr"]
preview dns: work-staging
preview dns: work-scratch
preview dns: personal-prod
preview dns: personal-scratch
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;no output means no changes.&lt;/p&gt;
&lt;h3 id="deploy"&gt;&lt;a class="heading-link" href="#deploy"&gt;deploy&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;to deploy those changes using &lt;a href="https://github.com/nathants/libaws/blob/master/cmd/route53/ensure_record.go"&gt;route53-ensure-record&lt;/a&gt;, use &lt;code&gt;bash bin/ensure_all.sh&lt;/code&gt; or &lt;code&gt;bash bin/ensure.sh work-prod&lt;/code&gt;, which looks like:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt; bash bin/ensure_all.sh
ensure dns: work-prod
lib/route53.go:258: route53 update Values for foo.example.com: ["bar"] =&amp;gt; ["barr"]
lib/route53.go:284: route53 updated record: foo.example.com
ensure dns: work-staging
ensure dns: work-scratch
ensure dns: personal-prod
ensure dns: personal-scratch
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id="delete"&gt;&lt;a class="heading-link" href="#delete"&gt;delete&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;like all the &lt;a href="https://github.com/nathants/libaws/search?q=ensure&amp;amp;type=code"&gt;ensure&lt;/a&gt; functions in &lt;a href="https://github.com/nathants/libaws"&gt;libaws&lt;/a&gt;, &lt;code&gt;ensure&lt;/code&gt; creates or updates infrastructure as needed, but does not remove it.&lt;/p&gt;
&lt;p&gt;to delete a record, remove it from its &lt;code&gt;dns.txt&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt; git diff
diff --git a/accounts/dns/dns.txt b/accounts/dns/dns.txt
index 4b959e4..dd68522 100644
--- a/accounts/work-prod/dns.txt
+++ b/accounts/work-prod/dns.txt
@@ -1,4 +1,3 @@
-example.com foo.example.com Type=CNAME TTL=300 Value=barr
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;then preview the delete using &lt;a href="https://github.com/nathants/libaws/blob/master/cmd/route53/rm_record.go"&gt;route53-rm-record&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt; libaws route53-rm-record --preview example.com foo.example.com Type=CNAME TTL=300 Value=barr
lib/route53.go:85: preview: route53 deleted record foo.example.com: TTL=300 Type=CNAME Value=barr
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;then perform the delete using &lt;a href="https://github.com/nathants/libaws/blob/master/cmd/route53/rm_record.go"&gt;route53-rm-record&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt; libaws route53-rm-record example.com foo.example.com Type=CNAME TTL=300 Value=barr
lib/route53.go:85: route53 deleted record foo.example.com: TTL=300 Type=CNAME Value=barr
&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id="monitoring"&gt;&lt;a class="heading-link" href="#monitoring"&gt;monitoring&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;add &lt;code&gt;pull.sh&lt;/code&gt; to your crontab to keep track of changes to your dns:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;0 15 * * * bash -c 'cd ~/repos/aws-ensure-route53 &amp;amp;&amp;amp; bash bin/pull.sh'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;when you notice uncommited changes in &lt;code&gt;git status&lt;/code&gt;, you can either commit them, or investigate them. foo likely should not be barr.&lt;/p&gt;

                      
            </description>
      <guid isPermaLink="false">https://nathants.com/projects/aws-ensure-route53</guid>
    </item>
    <item>
      <title>agr</title>
      <link>https://nathants.com/projects/agr</link>
      <description>
                
&lt;h2 id="why"&gt;&lt;a class="heading-link" href="#why"&gt;why&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;repo wide search and replace should be easier.&lt;/p&gt;
&lt;h2 id="what"&gt;&lt;a class="heading-link" href="#what"&gt;what&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;like ack or ag, but for search and replace.&lt;/p&gt;
&lt;p&gt;can climb to the root of a git repo before running.&lt;/p&gt;
&lt;p&gt;shows a preview of the replacements to be made.&lt;/p&gt;
&lt;p&gt;prompts to continue globally or at each change site.&lt;/p&gt;
&lt;h2 id="install"&gt;&lt;a class="heading-link" href="#install"&gt;install&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;python3 -m pip install git+https://github.com/nathants/agr&lt;/pre&gt;&lt;/div&gt;
&lt;h2 id="usage"&gt;&lt;a class="heading-link" href="#usage"&gt;usage&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;agr &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;(\w+)_factory&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;\1_factory_factory&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h2 id="dependencies"&gt;&lt;a class="heading-link" href="#dependencies"&gt;dependencies&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/ggreer/the_silver_searcher"&gt;silver-search (ag)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://python.org" rel="nofollow"&gt;python3&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="examples"&gt;&lt;a class="heading-link" href="#examples"&gt;examples&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;$ agr '(\w+)_factory' '\1_factory_factory'
&amp;gt; lib.py:14: def poodle_factory_factory(): =&amp;gt; def poodle_factory_factory_factory():
&amp;gt; main.py:3: def dog_factory_factory(): =&amp;gt; def dog_factory_factory_factory():
&amp;gt; proceed? y/n
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;$ agr '(\w+)_factory' --delete
&amp;gt; lib.py:14: def poodle_factory_factory(): =&amp;gt; DELETED!
&amp;gt; main.py:3: def dog_factory_factory(): =&amp;gt; DELETED!
&amp;gt; proceed? y/n
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt; agr -h
usage: agr [-h] [-d] [-p] [-s] [-u] [-n] [-y] [-e] pattern [replacement]

positional arguments:
  pattern             regex to match
  replacement         replacement for matches (default: -)

optional arguments:
  -h, --help          show this help message and exit
  -d, --delete        rather than substitute replacement, delete the matched line (default: False)
  -p, --preview       show diffs and then exit without prompting for commit (default: False)
  -s, --short         show shorter diffs (default: False)
  -u, --unrestricted  process all files, not just code files (default: False)
  -n, --no-climb      no climbing upwards until a .git dir is found (default: False)
  -y, --yes           commit without prompting (default: False)
  -e, --each          prompt for y/n at each change site (default: False)
&lt;/code&gt;&lt;/pre&gt;

                      
            </description>
      <guid isPermaLink="false">https://nathants.com/projects/agr</guid>
    </item>
    <item>
      <title>go-libsodium</title>
      <link>https://nathants.com/projects/go-libsodium</link>
      <description>
                
&lt;h2 id="why"&gt;&lt;a class="heading-link" href="#why"&gt;why&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;libsodium should be easy.&lt;/p&gt;
&lt;h2 id="how"&gt;&lt;a class="heading-link" href="#how"&gt;how&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;a minimal cgo interface to the following libsodium constructs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="https://doc.libsodium.org/secret-key_cryptography/secretbox" rel="nofollow"&gt;crypt_box_easy&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="https://doc.libsodium.org/public-key_cryptography/sealed_boxes" rel="nofollow"&gt;crypto_box_seal&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="https://doc.libsodium.org/public-key_cryptography/public-key_signatures" rel="nofollow"&gt;crypto_sign&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="https://doc.libsodium.org/secret-key_cryptography/secretstream" rel="nofollow"&gt;crypto_stream&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="what"&gt;&lt;a class="heading-link" href="#what"&gt;what&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;pre lang="go"&gt;&lt;code&gt;
func Init()

func StreamKeygen() (key []byte, err error)

func StreamEncrypt(key []byte, plainText io.Reader, cipherText io.Writer) error

func StreamDecrypt(key []byte, cipherText io.Reader, plainText io.Writer) error

func StreamEncryptRecipients(publicKeys [][]byte, plainText io.Reader, cipherText io.Writer) error

func StreamDecryptRecipients(secretKey []byte, cipherText io.Reader, plainText io.Writer) error

func BoxKeypair() (publicKey, secretKey []byte, err error)

func BoxSealedEncrypt(plainText, recipientPublicKey []byte) (cipherText []byte, err error)

func BoxSealedDecrypt(cipherText, recipientSecretKey []byte) (plainText []byte, err error)

func BoxEasyEncrypt(plainText, recipientPublicKey, senderSecretKey []byte) (cipherText []byte, err error)

func BoxEasyDecrypt(cipherText, senderPublicKey, recipientSecretKey []byte) (plainText []byte, err error)

func SignKeypair() (publicKey, secretKey []byte, err error)

func Sign(plainText, signerSecretKey []byte) (signedText []byte, err error)

func SignVerify(signedText, plainText, signerPublicKey []byte) error

&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id="install"&gt;&lt;a class="heading-link" href="#install"&gt;install&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;pre lang="bash"&gt;&lt;code&gt;brew install         go     libsodium     # homebrew
sudo pacman -S       go     libsodium     # arch
sudo apk add         go     libsodium-dev # alpine
sudo apt-get install golang libsodium-dev # ubuntu/debian
&lt;/code&gt;&lt;/pre&gt;
&lt;pre lang="bash"&gt;&lt;code&gt;go get github.com/nathants/go-libsodium
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id="usage"&gt;&lt;a class="heading-link" href="#usage"&gt;usage&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;pre lang="go"&gt;&lt;code&gt;package main

import (
	"bytes"
	"fmt"

	"github.com/nathants/go-libsodium"
)

func Stream() {
	libsodium.Init()
	key, err := libsodium.StreamKeygen()
	if err != nil {
		panic(err)
	}
	value := []byte("hello world")
	var cipher bytes.Buffer
	err = libsodium.StreamEncrypt(key, bytes.NewReader(value), &amp;amp;cipher)
	if err != nil {
		panic(err)
	}
	var plain bytes.Buffer
	err = libsodium.StreamDecrypt(key, bytes.NewReader(cipher.Bytes()), &amp;amp;plain)
	if err != nil {
		panic(err)
	}
	fmt.Println("stream", bytes.Equal(value, plain.Bytes()))
}

func StreamRecipients() {
	libsodium.Init()
	pk1, sk1, err := libsodium.BoxKeypair()
	if err != nil {
		panic(err)
	}
	pk2, sk2, err := libsodium.BoxKeypair()
	if err != nil {
		panic(err)
	}
	value := []byte("hello world")
	var cipher bytes.Buffer
	err = libsodium.StreamEncryptRecipients([][]byte{pk1, pk2}, bytes.NewReader(value), &amp;amp;cipher)
	if err != nil {
		panic(err)
	}
	var plain bytes.Buffer
	err = libsodium.StreamDecryptRecipients(sk1, bytes.NewReader(cipher.Bytes()), &amp;amp;plain)
	if err != nil {
		panic(err)
	}
	fmt.Println("recipient1", bytes.Equal(value, plain.Bytes()))
	plain.Reset()
	err = libsodium.StreamDecryptRecipients(sk2, bytes.NewReader(cipher.Bytes()), &amp;amp;plain)
	if err != nil {
		panic(err)
	}
	fmt.Println("recipient2", bytes.Equal(value, plain.Bytes()))
}

func BoxSeal() {
	libsodium.Init()
	value := []byte("hello world")
	pk, sk, err := libsodium.BoxKeypair()
	if err != nil {
	    panic(err)
	}
	cipher, err := libsodium.BoxSealedEncrypt(value, pk)
	if err != nil {
	    panic(err)
	}
	plain, err := libsodium.BoxSealedDecrypt(cipher, sk)
	if err != nil {
	    panic(err)
	}
	fmt.Println("seal", bytes.Equal(value, plain))
}

func BoxEasy() {
	value := []byte("hello world")
	pk1, sk1, err := libsodium.BoxKeypair()
	if err != nil {
	    panic(err)
	}
	pk2, sk2, err := libsodium.BoxKeypair()
	if err != nil {
	    panic(err)
	}
	cipher, err := libsodium.BoxEasyEncrypt(value, pk2, sk1)
	if err != nil {
	    panic(err)
	}
	plain, err := libsodium.BoxEasyDecrypt(cipher, pk1, sk2)
	if err != nil {
	    panic(err)
	}
	fmt.Println("easy", bytes.Equal(value, plain))
}

func Sign() {
	value := []byte("hello world")
	pk, sk, err := libsodium.SignKeypair()
	if err != nil {
	    panic(err)
	}
	signature, err := libsodium.Sign(value, sk)
	if err != nil {
	    panic(err)
	}
	err = libsodium.SignVerify(signature, value, pk)
	if err != nil {
	    panic(err)
	}
	fmt.Println("signature")
}

func main() {
	Stream()
	StreamRecipients()
	BoxSeal()
	BoxEasy()
	Sign()
}
&lt;/code&gt;&lt;/pre&gt;

                      
            </description>
      <guid isPermaLink="false">https://nathants.com/projects/go-libsodium</guid>
    </item>
    <item>
      <title>git-remote-aws</title>
      <link>https://nathants.com/projects/git-remote-aws</link>
      <description>
                
&lt;h2 id="why"&gt;&lt;a class="heading-link" href="#why"&gt;why&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;encrypted git hosting should be easy.&lt;/p&gt;
&lt;h2 id="how"&gt;&lt;a class="heading-link" href="#how"&gt;how&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;encrypted git &lt;a href="https://git-scm.com/docs/git-bundle" rel="nofollow"&gt;bundles&lt;/a&gt; are stored in s3.&lt;/p&gt;
&lt;p&gt;compare and swap against dynamodb updates an ordered list of bundles. this enables multiple writers to safely collaborate on a single remote.&lt;/p&gt;
&lt;p&gt;each remote can hold one and only one branch.&lt;/p&gt;
&lt;p&gt;bundles in s3 are immutable, and force push is not allowed.&lt;/p&gt;
&lt;p&gt;bundles are encrypted with libsodium &lt;a href="https://doc.libsodium.org/secret-key_cryptography/secretstream" rel="nofollow"&gt;secretstream&lt;/a&gt;. user keys are libsodium box &lt;a href="https://doc.libsodium.org/public-key_cryptography/authenticated_encryption#key-pair-generation" rel="nofollow"&gt;keypairs&lt;/a&gt;. authorized user public keys are added to a &lt;code&gt;.publickeys&lt;/code&gt; file in the git repository. to add or remove authorized users, update the publickeys file, then create and push to a new remote or delete s3 data and recreate an existing remote.&lt;/p&gt;
&lt;p&gt;metadata is stored unencrypted:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;branch name&lt;/li&gt;
&lt;li&gt;remote name&lt;/li&gt;
&lt;li&gt;git hash for the start and end of each bundle&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;data is stored encrypted:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;git bundles&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;both git sha1 and sha256 hashing algorithms are supported.&lt;/p&gt;
&lt;p&gt;private s3 buckets and dynamodb tables are created ondemand if they do not already exist.&lt;/p&gt;
&lt;h2 id="what"&gt;&lt;a class="heading-link" href="#what"&gt;what&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;a custom git remote adding support for remotes like:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;git remote add origin aws://${s3_bucket}+${dynamo_table}/${remote_name}&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;the git remote binary provides a keygen for libsodium box &lt;a href="https://doc.libsodium.org/public-key_cryptography/authenticated_encryption#key-pair-generation" rel="nofollow"&gt;keypairs&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;git-remote-aws --keygen ~/.git-remote-aws/publickey ~/.git-remote-aws/secretkey&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;the default path for your secret key is &lt;code&gt;~/.git-remote-aws/secretkey&lt;/code&gt;. this can be changed via environment variable &lt;code&gt;GIT_REMOTE_AWS_SECRETKEY&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="install"&gt;&lt;a class="heading-link" href="#install"&gt;install&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;install go and libsodium from your package manager:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;brew install         go     libsodium     &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; homebrew&lt;/span&gt;
sudo pacman -S       go     libsodium     &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; arch&lt;/span&gt;
sudo apk add         go     libsodium-dev &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; alpine&lt;/span&gt;
sudo apt-get install golang libsodium-dev &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; ubuntu/debian&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;install the binary and update PATH:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;go install github.com/nathants/git-remote-aws@latest

&lt;span class="pl-k"&gt;export&lt;/span&gt; PATH=&lt;span class="pl-smi"&gt;$PATH&lt;/span&gt;:&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;go env GOPATH&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;/bin&lt;/pre&gt;&lt;/div&gt;
&lt;h2 id="usage"&gt;&lt;a class="heading-link" href="#usage"&gt;usage&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; git init

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; git remote add origin aws://&lt;span class="pl-smi"&gt;${bucket}&lt;/span&gt;+&lt;span class="pl-smi"&gt;${table}&lt;/span&gt;/myrepo

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; mkdir -p &lt;span class="pl-k"&gt;~&lt;/span&gt;/.git-remote-aws

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; git-remote-aws --keygen &lt;span class="pl-k"&gt;~&lt;/span&gt;/.git-remote-aws/publickey &lt;span class="pl-k"&gt;~&lt;/span&gt;/.git-remote-aws/secretkey

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; cat &lt;span class="pl-k"&gt;~&lt;/span&gt;/.git-remote-aws/publickey &lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; .publickeys

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; git add &lt;span class="pl-c1"&gt;.&lt;/span&gt;

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; git commit -m init

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; git push -u origin master

creating private s3 bucket: &lt;span class="pl-smi"&gt;$bucket&lt;/span&gt;
lib/s3.go:329: created bucket: &lt;span class="pl-smi"&gt;$bucket&lt;/span&gt;
lib/s3.go:367: created bucket tags for: &lt;span class="pl-smi"&gt;$bucket&lt;/span&gt;
lib/s3.go:415: created public access block &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-smi"&gt;$bucket&lt;/span&gt;: private
lib/s3.go:657: created encryption &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-smi"&gt;$bucket&lt;/span&gt;: &lt;span class="pl-c1"&gt;true&lt;/span&gt;
lib/s3.go:688: put bucket metrics for: &lt;span class="pl-smi"&gt;$bucket&lt;/span&gt;
created private s3 bucket: &lt;span class="pl-smi"&gt;$bucket&lt;/span&gt;
creating private dynamodb table: &lt;span class="pl-smi"&gt;$table&lt;/span&gt;
lib/dynamodb.go:481: created table: &lt;span class="pl-smi"&gt;$table&lt;/span&gt;
lib/dynamodb.go:974: waiting &lt;span class="pl-k"&gt;for&lt;/span&gt; table active: &lt;span class="pl-smi"&gt;$table&lt;/span&gt;
lib/dynamodb.go:974: waiting &lt;span class="pl-k"&gt;for&lt;/span&gt; table active: &lt;span class="pl-smi"&gt;$table&lt;/span&gt;
created private dynamodb table: &lt;span class="pl-smi"&gt;$table&lt;/span&gt;
get dynamodb://&lt;span class="pl-smi"&gt;$table&lt;/span&gt;/&lt;span class="pl-smi"&gt;$bucket&lt;/span&gt;/myrepo
get dynamodb://&lt;span class="pl-smi"&gt;$table&lt;/span&gt;/&lt;span class="pl-smi"&gt;$bucket&lt;/span&gt;/myrepo
get s3://&lt;span class="pl-smi"&gt;$bucket&lt;/span&gt;/
git bundle: 0000000000000000000000000000000000000000..daf8ea23a2aa082a3eeffacbdda04917d14916cc
put s3://&lt;span class="pl-smi"&gt;$bucket&lt;/span&gt;/myrepo/0000000000000000000000000000000000000000..daf8ea23a2aa082a3eeffacbdda04917d14916cc
put s3://&lt;span class="pl-smi"&gt;$bucket&lt;/span&gt;/myrepo/bundles_daf8ea23a2aa082a3eeffacbdda04917d14916cc
put dynamodb://&lt;span class="pl-smi"&gt;$table&lt;/span&gt;/&lt;span class="pl-smi"&gt;$bucket&lt;/span&gt;/myrepo
To aws://&lt;span class="pl-smi"&gt;$bucket&lt;/span&gt;+&lt;span class="pl-smi"&gt;$table&lt;/span&gt;/myrepo
 &lt;span class="pl-k"&gt;*&lt;/span&gt; [new branch]      master -&lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; master

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; libaws s3-ls &lt;span class="pl-smi"&gt;$bucket&lt;/span&gt;/ -r

770 myrepo/0000000000000000000000000000000000000000..daf8ea23a2aa082a3eeffacbdda04917d14916cc
 82 myrepo/bundles_daf8ea23a2aa082a3eeffacbdda04917d14916cc

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; libaws dynamodb-item-scan &lt;span class="pl-smi"&gt;$table&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; jq &lt;span class="pl-c1"&gt;.&lt;/span&gt;

{
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;branch&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;master&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;bundles&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;myrepo/bundles_daf8ea23a2aa082a3eeffacbdda04917d14916cc&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;id&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;span class="pl-smi"&gt;$bucket&lt;/span&gt;/myrepo&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;,
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;uid&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: null,
  &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;unix&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt;: 0
}

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;cd&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;mktemp -d&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; git clone aws://&lt;span class="pl-smi"&gt;${bucket}&lt;/span&gt;+&lt;span class="pl-smi"&gt;${table}&lt;/span&gt;/myrepo

Cloning into &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;myrepo&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;...
get dynamodb://&lt;span class="pl-smi"&gt;$table&lt;/span&gt;/&lt;span class="pl-smi"&gt;$bucket&lt;/span&gt;/myrepo
get s3://&lt;span class="pl-smi"&gt;$bucket&lt;/span&gt;/myrepo/bundles_daf8ea23a2aa082a3eeffacbdda04917d14916cc
get dynamodb://&lt;span class="pl-smi"&gt;$table&lt;/span&gt;/&lt;span class="pl-smi"&gt;$bucket&lt;/span&gt;/myrepo
get s3://&lt;span class="pl-smi"&gt;$bucket&lt;/span&gt;/myrepo/bundles_daf8ea23a2aa082a3eeffacbdda04917d14916cc
get s3://&lt;span class="pl-smi"&gt;$bucket&lt;/span&gt;/myrepo/0000000000000000000000000000000000000000..daf8ea23a2aa082a3eeffacbdda04917d14916cc
git unbundle: 0000000000000000000000000000000000000000..daf8ea23a2aa082a3eeffacbdda04917d14916cc
get dynamodb://&lt;span class="pl-smi"&gt;$table&lt;/span&gt;/&lt;span class="pl-smi"&gt;$bucket&lt;/span&gt;/myrepo
get s3://&lt;span class="pl-smi"&gt;$bucket&lt;/span&gt;/myrepo/bundles_daf8ea23a2aa082a3eeffacbdda04917d14916cc
&lt;/pre&gt;&lt;/div&gt;

                      
            </description>
      <guid isPermaLink="false">https://nathants.com/projects/git-remote-aws</guid>
    </item>
    <item>
      <title>libaws</title>
      <link>https://nathants.com/projects/libaws</link>
      <description>
                
&lt;h2 id="why"&gt;&lt;a class="heading-link" href="#why"&gt;why&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;aws is amazing, but it's hard to see the forest for the trees.&lt;/p&gt;
&lt;p&gt;aws should:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;have fewer knobs&lt;/li&gt;
&lt;li&gt;have sane defaults&lt;/li&gt;
&lt;li&gt;be easy to use&lt;/li&gt;
&lt;li&gt;be hard to screw up&lt;/li&gt;
&lt;li&gt;be fast&lt;/li&gt;
&lt;li&gt;be fun&lt;/li&gt;
&lt;li&gt;have a &lt;a href="#tldr"&gt;tldr&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;it should be easy for a &lt;a href="#lambda"&gt;lambda&lt;/a&gt; to react to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;docker push to &lt;a href="#ecr"&gt;ecr&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="#s3-1"&gt;s3&lt;/a&gt; put object&lt;/li&gt;
&lt;li&gt;
&lt;a href="#dynamodb-1"&gt;dynamodb&lt;/a&gt; put item&lt;/li&gt;
&lt;li&gt;
&lt;a href="#sqs-1"&gt;sqs&lt;/a&gt; send message&lt;/li&gt;
&lt;li&gt;
&lt;a href="#schedule"&gt;time&lt;/a&gt; passing&lt;/li&gt;
&lt;li&gt;
&lt;a href="#api"&gt;http&lt;/a&gt; requests&lt;/li&gt;
&lt;li&gt;
&lt;a href="#websocket"&gt;websocket&lt;/a&gt; messages&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;it should be easy to create:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#vpc"&gt;vpcs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#security-group"&gt;security groups&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#instance-profile"&gt;instance profiles&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#keypair"&gt;keypairs&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="how"&gt;&lt;a class="heading-link" href="#how"&gt;how&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;a href="#define-an-infrastructure-set"&gt;declare&lt;/a&gt; and &lt;a href="#ensure-the-infrastructure-set"&gt;deploy&lt;/a&gt; groups of related aws infrastructure as &lt;a href="#infrastructure-set"&gt;infrastructure sets&lt;/a&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;that contain:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#lambda"&gt;lambdas&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="#s3"&gt;s3&lt;/a&gt; buckets&lt;/li&gt;
&lt;li&gt;
&lt;a href="#dynamodb"&gt;dynamodb&lt;/a&gt; tables&lt;/li&gt;
&lt;li&gt;
&lt;a href="#sqs"&gt;sqs&lt;/a&gt; queues&lt;/li&gt;
&lt;li&gt;&lt;a href="#vpc"&gt;vpcs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#security-group"&gt;security groups&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#instance-profile"&gt;instance profiles&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#keypair"&gt;keypairs&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;that react to lambda &lt;a href="#trigger"&gt;triggers&lt;/a&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;http &lt;a href="#api"&gt;apis&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="#websocket"&gt;websocket&lt;/a&gt; messages&lt;/li&gt;
&lt;li&gt;
&lt;a href="#s3-1"&gt;s3&lt;/a&gt; bucket writes&lt;/li&gt;
&lt;li&gt;
&lt;a href="#dynamodb-1"&gt;dynamodb&lt;/a&gt; table writes&lt;/li&gt;
&lt;li&gt;
&lt;a href="#sqs-1"&gt;sqs&lt;/a&gt; queue puts&lt;/li&gt;
&lt;li&gt;cron &lt;a href="#schedule"&gt;schedules&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="#ecr"&gt;ecr&lt;/a&gt; docker pushes&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="what"&gt;&lt;a class="heading-link" href="#what"&gt;what&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;a simpler way to &lt;a href="#infrayaml"&gt;declare&lt;/a&gt; aws infrastructure that is easy to &lt;a href="#typical-usage"&gt;use&lt;/a&gt; and &lt;a href="#extending"&gt;extend&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;there are two ways to use it:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="#infrayaml"&gt;yaml&lt;/a&gt; and the &lt;a href="#explore-the-cli"&gt;cli&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="https://github.com/nathants/libaws/blob/master/lib/infra.go#L52"&gt;go structs&lt;/a&gt; and the &lt;a href="#explore-the-go-api"&gt;go api&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;the primary entrypoints are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="#ensure-the-infrastructure-set"&gt;infra-ensure&lt;/a&gt;: deploy an infrastructure set.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;libaws infra-ensure ./infra.yaml --preview
libaws infra-ensure ./infra.yaml&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="#view-the-infrastructure-set"&gt;infra-ls&lt;/a&gt;: view infrastructure sets.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;libaws infra-ls&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="#quickly-update-lambda-code"&gt;infra-ensure --quick&lt;/a&gt;: quickly update lambda code.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;libaws infra-ensure ./infra.yaml --quick LAMBDA_NAME&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="#delete-the-infrastructure-set"&gt;infra-rm&lt;/a&gt;: remove an infrastructure set.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;libaws infra-rm ./infra.yaml --preview
libaws infra-rm ./infra.yaml&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;code&gt;infra-ensure&lt;/code&gt; is a &lt;a href="#tradeoffs"&gt;positive assertion&lt;/a&gt;. it asserts that some named infrastructure exists, and is configured correctly, creating or updating it if needed.&lt;/p&gt;
&lt;p&gt;many other entrypoints exist, and can be explored by type. they fall into two categories:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;mutate aws state:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; libaws -h &lt;span class="pl-k"&gt;|&lt;/span&gt; grep ensure &lt;span class="pl-k"&gt;|&lt;/span&gt; wc -l
19

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; libaws -h &lt;span class="pl-k"&gt;|&lt;/span&gt; grep new &lt;span class="pl-k"&gt;|&lt;/span&gt; wc -l
1

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; libaws -h &lt;span class="pl-k"&gt;|&lt;/span&gt; grep rm &lt;span class="pl-k"&gt;|&lt;/span&gt; wc -l
26
&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;view aws state:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; libaws -h &lt;span class="pl-k"&gt;|&lt;/span&gt; grep ls &lt;span class="pl-k"&gt;|&lt;/span&gt; wc -l
33

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; libaws -h &lt;span class="pl-k"&gt;|&lt;/span&gt; grep describe &lt;span class="pl-k"&gt;|&lt;/span&gt; wc -l
6

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; libaws -h &lt;span class="pl-k"&gt;|&lt;/span&gt; grep get &lt;span class="pl-k"&gt;|&lt;/span&gt; wc -l
16

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; libaws -h &lt;span class="pl-k"&gt;|&lt;/span&gt; grep scan &lt;span class="pl-k"&gt;|&lt;/span&gt; wc -l
1&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="aws-sdk-pulumi-terraform-cloudformation-and-serverless"&gt;&lt;a class="heading-link" href="#aws-sdk-pulumi-terraform-cloudformation-and-serverless"&gt;aws sdk, pulumi, terraform, cloudformation, and serverless&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;compared to the full aws api, systems declared as &lt;a href="#infrastructure-set"&gt;infrastructure sets&lt;/a&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="https://github.com/nathants/libaws/tree/master/examples/simple/python"&gt;have&lt;/a&gt; &lt;a href="https://github.com/nathants/libaws/tree/master/examples/simple/go"&gt;simpler&lt;/a&gt; &lt;a href="https://github.com/nathants/libaws/tree/master/examples/simple/docker"&gt;examples&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;have &lt;a href="#typical-usage"&gt;fewer&lt;/a&gt; &lt;a href="#infrayaml"&gt;knobs&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;are easier to use.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;are harder to screw up.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;are almost always enough, and easy to &lt;a href="#extending"&gt;extend&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;are more fun.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;if you want to use the full aws api, there are many great tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/sdk-for-go/" rel="nofollow"&gt;aws sdk for go&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.pulumi.com/" rel="nofollow"&gt;pulumi&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.terraform.io/" rel="nofollow"&gt;terraform&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/cloudformation/" rel="nofollow"&gt;cloudformation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.serverless.com/" rel="nofollow"&gt;serverless&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="readme-index"&gt;&lt;a class="heading-link" href="#readme-index"&gt;readme index&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="#install"&gt;install&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#cli"&gt;cli&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#go-api"&gt;go api&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="#tldr"&gt;tldr&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#define-an-infrastructure-set"&gt;define an infrastructure set&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#ensure-the-infrastructure-set"&gt;ensure the infrastructure set&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#view-the-infrastructure-set"&gt;view the infrastructure set&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#trigger-the-infrastructure-set"&gt;trigger the infrastructure set&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#quickly-update-lambda-code"&gt;quickly update lambda code&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#delete-the-infrastructure-set"&gt;delete the infrastructure set&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="#usage"&gt;usage&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#explore-the-cli"&gt;explore the cli&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#explore-a-cli-entrypoint"&gt;explore a cli entrypoint&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#explore-the-go-api"&gt;explore the go api&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#explore-simple-examples"&gt;explore simple examples&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#explore-complex-examples"&gt;explore complex examples&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#explore-external-examples"&gt;explore external examples&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#infrastructure-set"&gt;infrastructure set&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#typical-usage"&gt;typical usage&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#design"&gt;design&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#tradeoffs"&gt;tradeoffs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="#infrayaml"&gt;infra.yaml&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#environment-variable-substitution"&gt;environment variable substitution&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#name"&gt;name&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#s3"&gt;s3&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#dynamodb"&gt;dynamodb&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#sqs"&gt;sqs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#keypair"&gt;keypair&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="#vpc"&gt;vpc&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#security-group"&gt;security group&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#instance-profile"&gt;instance profile&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="#lambda"&gt;lambda&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#entrypoint"&gt;entrypoint&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#attr"&gt;attr&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#policy"&gt;policy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#allow"&gt;allow&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#env"&gt;env&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#include"&gt;include&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#require"&gt;require&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="#trigger"&gt;trigger&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#api"&gt;api&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#websocket"&gt;websocket&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#s3-1"&gt;s3&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#dynamodb-1"&gt;dynamodb&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#sqs-1"&gt;sqs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#schedule"&gt;schedule&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#ecr"&gt;ecr&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#bash-completion"&gt;bash completion&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#extending"&gt;extending&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#testing"&gt;testing&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="install"&gt;&lt;a class="heading-link" href="#install"&gt;install&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;h3 id="cli"&gt;&lt;a class="heading-link" href="#cli"&gt;cli&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;go install github.com/nathants/libaws@latest

&lt;span class="pl-k"&gt;export&lt;/span&gt; PATH=&lt;span class="pl-smi"&gt;$PATH&lt;/span&gt;:&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;go env GOPATH&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;/bin&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id="go-api"&gt;&lt;a class="heading-link" href="#go-api"&gt;go api&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;go get github.com/nathants/libaws@latest&lt;/pre&gt;&lt;/div&gt;
&lt;h2 id="tldr"&gt;&lt;a class="heading-link" href="#tldr"&gt;tldr&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;h3 id="define-an-infrastructure-set"&gt;&lt;a class="heading-link" href="#define-an-infrastructure-set"&gt;define an infrastructure set&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;cd&lt;/span&gt; examples/simple/go/s3 &lt;span class="pl-k"&gt;&amp;amp;&amp;amp;&lt;/span&gt; tree
&lt;span class="pl-c1"&gt;.&lt;/span&gt;
├── infra.yaml
└── main.go&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;test-infraset-${uid}&lt;/span&gt;

&lt;span class="pl-ent"&gt;s3&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;test-bucket-${uid}&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;attr&lt;/span&gt;:
      - &lt;span class="pl-s"&gt;acl=private&lt;/span&gt;

&lt;span class="pl-ent"&gt;lambda&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;test-lambda-${uid}&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;entrypoint&lt;/span&gt;: &lt;span class="pl-s"&gt;main.go&lt;/span&gt;
    &lt;span class="pl-ent"&gt;attr&lt;/span&gt;:
      - &lt;span class="pl-s"&gt;concurrency=0&lt;/span&gt;
      - &lt;span class="pl-s"&gt;memory=128&lt;/span&gt;
      - &lt;span class="pl-s"&gt;timeout=60&lt;/span&gt;
    &lt;span class="pl-ent"&gt;policy&lt;/span&gt;:
      - &lt;span class="pl-s"&gt;AWSLambdaBasicExecutionRole&lt;/span&gt;
    &lt;span class="pl-ent"&gt;trigger&lt;/span&gt;:
      - &lt;span class="pl-ent"&gt;type&lt;/span&gt;: &lt;span class="pl-c1"&gt;s3&lt;/span&gt;
        &lt;span class="pl-ent"&gt;attr&lt;/span&gt;:
          - &lt;span class="pl-s"&gt;test-bucket-${uid}&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-go"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;package&lt;/span&gt; main

&lt;span class="pl-k"&gt;import&lt;/span&gt; (
	&lt;span class="pl-s"&gt;"context"&lt;/span&gt;
	&lt;span class="pl-s"&gt;"fmt"&lt;/span&gt;
	&lt;span class="pl-s"&gt;"github.com/aws/aws-lambda-go/events"&lt;/span&gt;
	&lt;span class="pl-s"&gt;"github.com/aws/aws-lambda-go/lambda"&lt;/span&gt;
)

&lt;span class="pl-k"&gt;func&lt;/span&gt; &lt;span class="pl-en"&gt;handleRequest&lt;/span&gt;(&lt;span class="pl-s1"&gt;_&lt;/span&gt; context.&lt;span class="pl-smi"&gt;Context&lt;/span&gt;, &lt;span class="pl-s1"&gt;e&lt;/span&gt; events.&lt;span class="pl-smi"&gt;S3Event&lt;/span&gt;) (events.&lt;span class="pl-smi"&gt;APIGatewayProxyResponse&lt;/span&gt;, &lt;span class="pl-smi"&gt;error&lt;/span&gt;) {
	&lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-s1"&gt;_&lt;/span&gt;, &lt;span class="pl-s1"&gt;record&lt;/span&gt; &lt;span class="pl-c1"&gt;:=&lt;/span&gt; &lt;span class="pl-k"&gt;range&lt;/span&gt; &lt;span class="pl-s1"&gt;e&lt;/span&gt;.&lt;span class="pl-c1"&gt;Records&lt;/span&gt; {
		&lt;span class="pl-s1"&gt;fmt&lt;/span&gt;.&lt;span class="pl-en"&gt;Println&lt;/span&gt;(&lt;span class="pl-s1"&gt;record&lt;/span&gt;.&lt;span class="pl-c1"&gt;S3&lt;/span&gt;.&lt;span class="pl-c1"&gt;Object&lt;/span&gt;.&lt;span class="pl-c1"&gt;Key&lt;/span&gt;)
	}
	&lt;span class="pl-k"&gt;return&lt;/span&gt; events.&lt;span class="pl-smi"&gt;APIGatewayProxyResponse&lt;/span&gt;{&lt;span class="pl-c1"&gt;StatusCode&lt;/span&gt;: &lt;span class="pl-c1"&gt;200&lt;/span&gt;}, &lt;span class="pl-c1"&gt;nil&lt;/span&gt;
}

&lt;span class="pl-k"&gt;func&lt;/span&gt; &lt;span class="pl-en"&gt;main&lt;/span&gt;() {
	&lt;span class="pl-s1"&gt;lambda&lt;/span&gt;.&lt;span class="pl-en"&gt;Start&lt;/span&gt;(&lt;span class="pl-s1"&gt;handleRequest&lt;/span&gt;)
}&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id="ensure-the-infrastructure-set"&gt;&lt;a class="heading-link" href="#ensure-the-infrastructure-set"&gt;ensure the infrastructure set&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;&lt;a target="_blank" rel="noopener noreferrer" href="https://github.com/nathants/libaws/raw/master/gif/ensure.gif"&gt;&lt;img src="https://github.com/nathants/libaws/raw/master/gif/ensure.gif" alt="" style="max-width: 100%;"/&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3 id="view-the-infrastructure-set"&gt;&lt;a class="heading-link" href="#view-the-infrastructure-set"&gt;view the infrastructure set&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;depth based colors by &lt;a href="https://gist.github.com/nathants/1955b2c3130b7d1a00c8420ad6231639"&gt;yaml&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a target="_blank" rel="noopener noreferrer" href="https://github.com/nathants/libaws/raw/master/gif/ls.gif"&gt;&lt;img src="https://github.com/nathants/libaws/raw/master/gif/ls.gif" alt="" style="max-width: 100%;"/&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3 id="trigger-the-infrastructure-set"&gt;&lt;a class="heading-link" href="#trigger-the-infrastructure-set"&gt;trigger the infrastructure set&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;&lt;a target="_blank" rel="noopener noreferrer" href="https://github.com/nathants/libaws/raw/master/gif/trigger.gif"&gt;&lt;img src="https://github.com/nathants/libaws/raw/master/gif/trigger.gif" alt="" style="max-width: 100%;"/&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3 id="quickly-update-lambda-code"&gt;&lt;a class="heading-link" href="#quickly-update-lambda-code"&gt;quickly update lambda code&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;&lt;a target="_blank" rel="noopener noreferrer" href="https://github.com/nathants/libaws/raw/master/gif/update.gif"&gt;&lt;img src="https://github.com/nathants/libaws/raw/master/gif/update.gif" alt="" style="max-width: 100%;"/&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3 id="delete-the-infrastructure-set"&gt;&lt;a class="heading-link" href="#delete-the-infrastructure-set"&gt;delete the infrastructure set&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;&lt;a target="_blank" rel="noopener noreferrer" href="https://github.com/nathants/libaws/raw/master/gif/rm.gif"&gt;&lt;img src="https://github.com/nathants/libaws/raw/master/gif/rm.gif" alt="" style="max-width: 100%;"/&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2 id="usage"&gt;&lt;a class="heading-link" href="#usage"&gt;usage&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;h3 id="explore-the-cli"&gt;&lt;a class="heading-link" href="#explore-the-cli"&gt;explore the cli&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; libaws -h &lt;span class="pl-k"&gt;|&lt;/span&gt; grep ensure &lt;span class="pl-k"&gt;|&lt;/span&gt; head

codecommit-ensure             - ensure a codecommit repository
dynamodb-ensure               - ensure a dynamodb table
ec2-ensure-keypair            - ensure a keypair
ec2-ensure-sg                 - ensure a sg
ecr-ensure                    - ensure ecr image
iam-ensure-ec2-spot-roles     - ensure iam ec2 spot roles that are needed to use ec2 spot
iam-ensure-instance-profile   - ensure an iam instance-profile
iam-ensure-role               - ensure an iam role
iam-ensure-user-api           - ensure an iam user with api key
iam-ensure-user-login         - ensure an iam user with login&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id="explore-a-cli-entrypoint"&gt;&lt;a class="heading-link" href="#explore-a-cli-entrypoint"&gt;explore a cli entrypoint&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; libaws s3-ensure -h

ensure a s3 bucket

example:
 - libaws s3-ensure test-bucket acl=public versioning=true

optional attrs:
 - acl=VALUE        (values = public &lt;span class="pl-k"&gt;|&lt;/span&gt; private, default = private)
 - versioning=VALUE (values = &lt;span class="pl-c1"&gt;true&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; false,     default = false)
 - metrics=VALUE    (values = &lt;span class="pl-c1"&gt;true&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; false,     default = true)
 - cors=VALUE       (values = &lt;span class="pl-c1"&gt;true&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; false,     default = false)
 - ttldays=VALUE    (values = 0 &lt;span class="pl-k"&gt;|&lt;/span&gt; n,            default = 0)

setting &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;cors=true&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; uses &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;*&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;for&lt;/span&gt; allowed origins. to specify one or more explicit origins, &lt;span class="pl-k"&gt;do&lt;/span&gt; this instead:
 - corsorigin=http://localhost:8080
 - corsorigin=https://example.com

Usage: s3-ensure [--preview] NAME [ATTR [ATTR ...]]

Positional arguments:
  NAME
  ATTR

Options:
  --preview, &lt;span class="pl-k"&gt;-p&lt;/span&gt;
  --help, &lt;span class="pl-k"&gt;-h&lt;/span&gt;             display this &lt;span class="pl-c1"&gt;help&lt;/span&gt; and &lt;span class="pl-c1"&gt;exit&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id="explore-the-go-api"&gt;&lt;a class="heading-link" href="#explore-the-go-api"&gt;explore the go api&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;div class="highlight highlight-source-go"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;package&lt;/span&gt; main

&lt;span class="pl-k"&gt;import&lt;/span&gt; (
	&lt;span class="pl-s"&gt;"github.com/nathants/libaws/lib"&lt;/span&gt;
)

&lt;span class="pl-k"&gt;func&lt;/span&gt; &lt;span class="pl-s1"&gt;main&lt;/span&gt;() {
    &lt;span class="pl-s1"&gt;lib&lt;/span&gt;. (&lt;span class="pl-smi"&gt;TAB&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt;&lt;span class="pl-c1"&gt;&amp;gt;&lt;/span&gt;)
&lt;span class="pl-s1"/&gt;      &lt;span class="pl-c1"&gt;|&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;|&lt;/span&gt;
      &lt;span class="pl-c1"&gt;|&lt;/span&gt;&lt;span class="pl-s1"&gt;f&lt;/span&gt; &lt;span class="pl-s1"&gt;AcmClient&lt;/span&gt; &lt;span class="pl-k"&gt;func&lt;/span&gt;() &lt;span class="pl-c1"&gt;*&lt;/span&gt;acm.&lt;span class="pl-smi"&gt;ACM&lt;/span&gt; (&lt;span class="pl-s1"&gt;Function&lt;/span&gt;)                                          &lt;span class="pl-c1"&gt;|&lt;/span&gt;
      &lt;span class="pl-c1"&gt;|&lt;/span&gt;&lt;span class="pl-s1"&gt;f&lt;/span&gt; &lt;span class="pl-s1"&gt;AcmClientExplicit&lt;/span&gt; &lt;span class="pl-k"&gt;func&lt;/span&gt;(&lt;span class="pl-s1"&gt;accessKeyID&lt;/span&gt; &lt;span class="pl-smi"&gt;string&lt;/span&gt;, &lt;span class="pl-s1"&gt;accessKeySecret&lt;/span&gt; &lt;span class="pl-smi"&gt;string&lt;/span&gt;, &lt;span class="pl-s1"&gt;region&lt;/span&gt; &lt;span class="pl-s1"&gt;stri&lt;/span&gt;&lt;span class="pl-c1"&gt;|&lt;/span&gt;
      &lt;span class="pl-c1"&gt;|&lt;/span&gt;&lt;span class="pl-s1"&gt;f&lt;/span&gt; &lt;span class="pl-s1"&gt;AcmListCertificates&lt;/span&gt; &lt;span class="pl-k"&gt;func&lt;/span&gt;(&lt;span class="pl-s1"&gt;ctx&lt;/span&gt; context.&lt;span class="pl-smi"&gt;Context&lt;/span&gt;) ([]&lt;span class="pl-c1"&gt;*&lt;/span&gt;acm.&lt;span class="pl-smi"&gt;CertificateSummary&lt;/span&gt;, &lt;span class="pl-s1"&gt;erro&lt;/span&gt;&lt;span class="pl-c1"&gt;|&lt;/span&gt;
      &lt;span class="pl-c1"&gt;|&lt;/span&gt;&lt;span class="pl-s1"&gt;f&lt;/span&gt; &lt;span class="pl-s1"&gt;Api&lt;/span&gt; &lt;span class="pl-k"&gt;func&lt;/span&gt;(&lt;span class="pl-s1"&gt;ctx&lt;/span&gt; context.&lt;span class="pl-smi"&gt;Context&lt;/span&gt;, &lt;span class="pl-s1"&gt;name&lt;/span&gt; &lt;span class="pl-smi"&gt;string&lt;/span&gt;) (&lt;span class="pl-c1"&gt;*&lt;/span&gt;apigatewayv2.&lt;span class="pl-smi"&gt;Api&lt;/span&gt;, &lt;span class="pl-smi"&gt;error&lt;/span&gt;) (&lt;span class="pl-s1"&gt;Functio&lt;/span&gt;&lt;span class="pl-c1"&gt;|&lt;/span&gt;
      &lt;span class="pl-c1"&gt;|&lt;/span&gt;&lt;span class="pl-s1"&gt;f&lt;/span&gt; &lt;span class="pl-s1"&gt;ApiClient&lt;/span&gt; &lt;span class="pl-k"&gt;func&lt;/span&gt;() &lt;span class="pl-c1"&gt;*&lt;/span&gt;apigatewayv2.&lt;span class="pl-smi"&gt;ApiGatewayV2&lt;/span&gt; (&lt;span class="pl-s1"&gt;Function&lt;/span&gt;)                        &lt;span class="pl-c1"&gt;|&lt;/span&gt;
      &lt;span class="pl-c1"&gt;|&lt;/span&gt;&lt;span class="pl-s1"&gt;f&lt;/span&gt; &lt;span class="pl-s1"&gt;ApiClientExplicit&lt;/span&gt; &lt;span class="pl-k"&gt;func&lt;/span&gt;(&lt;span class="pl-s1"&gt;accessKeyID&lt;/span&gt; &lt;span class="pl-smi"&gt;string&lt;/span&gt;, &lt;span class="pl-s1"&gt;accessKeySecret&lt;/span&gt; &lt;span class="pl-smi"&gt;string&lt;/span&gt;, &lt;span class="pl-s1"&gt;region&lt;/span&gt; &lt;span class="pl-s1"&gt;stri&lt;/span&gt;&lt;span class="pl-c1"&gt;|&lt;/span&gt;
      &lt;span class="pl-c1"&gt;|&lt;/span&gt;&lt;span class="pl-s1"&gt;f&lt;/span&gt; &lt;span class="pl-s1"&gt;ApiList&lt;/span&gt; &lt;span class="pl-k"&gt;func&lt;/span&gt;(&lt;span class="pl-s1"&gt;ctx&lt;/span&gt; context.&lt;span class="pl-smi"&gt;Context&lt;/span&gt;) ([]&lt;span class="pl-c1"&gt;*&lt;/span&gt;apigatewayv2.&lt;span class="pl-smi"&gt;Api&lt;/span&gt;, &lt;span class="pl-smi"&gt;error&lt;/span&gt;) (&lt;span class="pl-s1"&gt;Function&lt;/span&gt;)     &lt;span class="pl-c1"&gt;|&lt;/span&gt;
      &lt;span class="pl-c1"&gt;|&lt;/span&gt;&lt;span class="pl-s1"&gt;f&lt;/span&gt; &lt;span class="pl-s1"&gt;ApiListDomains&lt;/span&gt; &lt;span class="pl-k"&gt;func&lt;/span&gt;(&lt;span class="pl-s1"&gt;ctx&lt;/span&gt; context.&lt;span class="pl-smi"&gt;Context&lt;/span&gt;) ([]&lt;span class="pl-c1"&gt;*&lt;/span&gt;apigatewayv2.&lt;span class="pl-smi"&gt;DomainName&lt;/span&gt;, &lt;span class="pl-smi"&gt;error&lt;/span&gt;) (&lt;span class="pl-c1"&gt;|&lt;/span&gt;
      &lt;span class="pl-c1"&gt;|&lt;/span&gt;&lt;span class="pl-s1"&gt;f&lt;/span&gt; &lt;span class="pl-s1"&gt;ApiUrl&lt;/span&gt; &lt;span class="pl-k"&gt;func&lt;/span&gt;(&lt;span class="pl-s1"&gt;ctx&lt;/span&gt; context.&lt;span class="pl-smi"&gt;Context&lt;/span&gt;, &lt;span class="pl-s1"&gt;name&lt;/span&gt; &lt;span class="pl-smi"&gt;string&lt;/span&gt;) (&lt;span class="pl-smi"&gt;string&lt;/span&gt;, &lt;span class="pl-smi"&gt;error&lt;/span&gt;) (&lt;span class="pl-s1"&gt;Function&lt;/span&gt;)      &lt;span class="pl-c1"&gt;|&lt;/span&gt;
      &lt;span class="pl-c1"&gt;|&lt;/span&gt;&lt;span class="pl-s1"&gt;f&lt;/span&gt; &lt;span class="pl-s1"&gt;ApiUrlDomain&lt;/span&gt; &lt;span class="pl-k"&gt;func&lt;/span&gt;(&lt;span class="pl-s1"&gt;ctx&lt;/span&gt; context.&lt;span class="pl-smi"&gt;Context&lt;/span&gt;, &lt;span class="pl-s1"&gt;name&lt;/span&gt; &lt;span class="pl-smi"&gt;string&lt;/span&gt;) (&lt;span class="pl-smi"&gt;string&lt;/span&gt;, &lt;span class="pl-smi"&gt;error&lt;/span&gt;) (&lt;span class="pl-s1"&gt;Function&lt;/span&gt;)&lt;span class="pl-c1"&gt;|&lt;/span&gt;
      &lt;span class="pl-c1"&gt;|&lt;/span&gt;&lt;span class="pl-c1"&gt;...&lt;/span&gt;                                                                             &lt;span class="pl-c1"&gt;|&lt;/span&gt;&lt;span class="pl-s1"/&gt;
      &lt;span class="pl-c1"&gt;|&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;-&lt;/span&gt;&lt;span class="pl-c1"&gt;|&lt;/span&gt;
}&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id="explore-simple-examples"&gt;&lt;a class="heading-link" href="#explore-simple-examples"&gt;explore simple examples&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;api: &lt;a href="https://github.com/nathants/libaws/tree/master/examples/simple/python/api"&gt;python&lt;/a&gt;, &lt;a href="https://github.com/nathants/libaws/tree/master/examples/simple/go/api"&gt;go&lt;/a&gt;, &lt;a href="https://github.com/nathants/libaws/tree/master/examples/simple/docker/api"&gt;docker&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;dynamodb: &lt;a href="https://github.com/nathants/libaws/tree/master/examples/simple/python/dynamodb"&gt;python&lt;/a&gt;, &lt;a href="https://github.com/nathants/libaws/tree/master/examples/simple/go/dynamodb"&gt;go&lt;/a&gt;, &lt;a href="https://github.com/nathants/libaws/tree/master/examples/simple/docker/dynamodb"&gt;docker&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;ecr: &lt;a href="https://github.com/nathants/libaws/tree/master/examples/simple/python/ecr"&gt;python&lt;/a&gt;, &lt;a href="https://github.com/nathants/libaws/tree/master/examples/simple/go/ecr"&gt;go&lt;/a&gt;, &lt;a href="https://github.com/nathants/libaws/tree/master/examples/simple/docker/ecr"&gt;docker&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;includes: &lt;a href="https://github.com/nathants/libaws/tree/master/examples/simple/python/includes"&gt;python&lt;/a&gt;, &lt;a href="https://github.com/nathants/libaws/tree/master/examples/simple/go/includes"&gt;go&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;s3: &lt;a href="https://github.com/nathants/libaws/tree/master/examples/simple/python/s3"&gt;python&lt;/a&gt;, &lt;a href="https://github.com/nathants/libaws/tree/master/examples/simple/go/s3"&gt;go&lt;/a&gt;, &lt;a href="https://github.com/nathants/libaws/tree/master/examples/simple/docker/s3"&gt;docker&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;schedule: &lt;a href="https://github.com/nathants/libaws/tree/master/examples/simple/python/schedule"&gt;python&lt;/a&gt;, &lt;a href="https://github.com/nathants/libaws/tree/master/examples/simple/go/schedule"&gt;go&lt;/a&gt;, &lt;a href="https://github.com/nathants/libaws/tree/master/examples/simple/docker/schedule"&gt;docker&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;sqs: &lt;a href="https://github.com/nathants/libaws/tree/master/examples/simple/python/sqs"&gt;python&lt;/a&gt;, &lt;a href="https://github.com/nathants/libaws/tree/master/examples/simple/go/sqs"&gt;go&lt;/a&gt;, &lt;a href="https://github.com/nathants/libaws/tree/master/examples/simple/docker/sqs"&gt;docker&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;websocket: &lt;a href="https://github.com/nathants/libaws/tree/master/examples/simple/python/websocket"&gt;python&lt;/a&gt;, &lt;a href="https://github.com/nathants/libaws/tree/master/examples/simple/go/websocket"&gt;go&lt;/a&gt;, &lt;a href="https://github.com/nathants/libaws/tree/master/examples/simple/docker/websocket"&gt;docker&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="explore-complex-examples"&gt;&lt;a class="heading-link" href="#explore-complex-examples"&gt;explore complex examples&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/nathants/libaws/tree/master/examples/complex/s3-ec2"&gt;s3-ec2&lt;/a&gt;:
&lt;ul&gt;
&lt;li&gt;write to s3 in-bucket&lt;/li&gt;
&lt;li&gt;which triggers lambda&lt;/li&gt;
&lt;li&gt;which launches ec2 spot&lt;/li&gt;
&lt;li&gt;which reads from in-bucket, writes to out-bucket, and terminates&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="explore-external-examples"&gt;&lt;a class="heading-link" href="#explore-external-examples"&gt;explore external examples&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="https://github.com/nathants/aws-gocljs"&gt;aws-gocljs&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="https://github.com/nathants/aws-exec"&gt;aws-exec&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="https://github.com/nathants/aws-ensure-route53"&gt;aws-ensure-route53&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="infrastructure-set"&gt;&lt;a class="heading-link" href="#infrastructure-set"&gt;infrastructure set&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;an infrastructure set is defined by &lt;a href="#infrayaml"&gt;yaml&lt;/a&gt; or &lt;a href="https://github.com/nathants/libaws/blob/master/lib/infra.go#L52"&gt;go struct&lt;/a&gt; and contains:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;stateful infrastructure:
&lt;ul&gt;
&lt;li&gt;&lt;a href="#s3"&gt;s3&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#dynamodb"&gt;dynamodb&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#sqs"&gt;sqs&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;ec2 infrastructure:
&lt;ul&gt;
&lt;li&gt;&lt;a href="#keypair"&gt;keypairs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#instance-profile"&gt;instance profiles&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="#vpc"&gt;vpcs&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#security-group"&gt;security groups&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="#lambda"&gt;lambdas&lt;/a&gt;:
&lt;ul&gt;
&lt;li&gt;
&lt;a href="#trigger"&gt;triggers&lt;/a&gt;:
&lt;ul&gt;
&lt;li&gt;&lt;a href="#api"&gt;api&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#websocket"&gt;websocket&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#s3-1"&gt;s3&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#dynamodb-1"&gt;dynamodb&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#sqs-1"&gt;sqs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#schedule"&gt;schedule&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#ecr"&gt;ecr&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="typical-usage"&gt;&lt;a class="heading-link" href="#typical-usage"&gt;typical usage&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;use &lt;a href="#ensure-the-infrastructure-set"&gt;infra-ensure&lt;/a&gt; to deploy an infrastructure set.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;libaws infra-ensure ./infra.yaml --preview
libaws infra-ensure ./infra.yaml&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;use &lt;a href="#view-the-infrastructure-set"&gt;infra-ls&lt;/a&gt; to view infrastructure sets.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;libaws infra-ls&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;use &lt;a href="#quickly-update-lambda-code"&gt;infra-ensure --quick LAMBDA_NAME&lt;/a&gt; to quickly update lambda code.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;libaws infra-ensure ./infra.yaml --quick LAMBDA_NAME&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;use &lt;a href="#delete-the-infrastructure-set"&gt;infra-rm&lt;/a&gt; to remove an infrastructure set.&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;libaws infra-rm ./infra.yaml --preview
libaws infra-rm ./infra.yaml&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="design"&gt;&lt;a class="heading-link" href="#design"&gt;design&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;there is no implicit coordination.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;if you aren't already serializing your infrastructure mutations, lock around &lt;a href="https://github.com/nathants/go-dynamolock"&gt;dynamodb&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;there are only two state locations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;aws.&lt;/li&gt;
&lt;li&gt;your code.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;aws infrastructure is uniquely identified by name.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;all aws infrastructure share a private namespace scoped to account/region. use good names.&lt;/li&gt;
&lt;li&gt;except s3, which shares a public namespace scoped to earth. use better names.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;mutative operations manipulate aws state.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;mutative operations are idempotent. if they fail due to a transient error, run them again.&lt;/li&gt;
&lt;li&gt;mutative operations can &lt;code&gt;--preview&lt;/code&gt;. no output means no changes.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;ensure&lt;/code&gt; are mutative operations that create or update infrastructure.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;rm&lt;/code&gt; are mutative operations that delete infrastructure.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;ls&lt;/code&gt;, &lt;code&gt;get&lt;/code&gt;, &lt;code&gt;scan&lt;/code&gt;, and &lt;code&gt;describe&lt;/code&gt; operations are non-mutative.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;multiple infrastructure sets can be deployed into the same account/region.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="tradeoffs"&gt;&lt;a class="heading-link" href="#tradeoffs"&gt;tradeoffs&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;no attempt is made to avoid vendor lock-in.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;migrating between cloud providers will always be non-trivial.&lt;/li&gt;
&lt;li&gt;attempting to mitigate future migrations has more cost than benefit in the typical case.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;ensure&lt;/code&gt; operations are positive assertions. they assert that some named infrastructure exists, and is configured correctly, creating or updating it if needed.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;positive assertions &lt;strong&gt;CANNOT&lt;/strong&gt; remove top level infrastructure, but &lt;strong&gt;CAN&lt;/strong&gt; remove configuration from them.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;removing a &lt;code&gt;trigger&lt;/code&gt;, &lt;code&gt;policy&lt;/code&gt;, or &lt;code&gt;allow&lt;/code&gt; &lt;strong&gt;WILL&lt;/strong&gt; remove that from the &lt;code&gt;lambda&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;removing &lt;code&gt;policy&lt;/code&gt;, or &lt;code&gt;allow&lt;/code&gt; &lt;strong&gt;WILL&lt;/strong&gt; remove that from the &lt;code&gt;instance-profile&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;removing a &lt;code&gt;security-group&lt;/code&gt; &lt;strong&gt;WILL&lt;/strong&gt; remove that from the &lt;code&gt;vpc&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;removing a &lt;code&gt;rule&lt;/code&gt; &lt;strong&gt;WILL&lt;/strong&gt; remove that from the &lt;code&gt;security-group&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;removing an &lt;code&gt;attr&lt;/code&gt; &lt;strong&gt;WILL&lt;/strong&gt; remove that from a &lt;code&gt;sqs&lt;/code&gt;, &lt;code&gt;s3&lt;/code&gt;, &lt;code&gt;dynamodb&lt;/code&gt;, or &lt;code&gt;lambda&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;removing a &lt;code&gt;keypair&lt;/code&gt;, &lt;code&gt;vpc&lt;/code&gt;, &lt;code&gt;instance-profile&lt;/code&gt;, &lt;code&gt;sqs&lt;/code&gt;, &lt;code&gt;s3&lt;/code&gt;, &lt;code&gt;dynamodb&lt;/code&gt;, or &lt;code&gt;lambda&lt;/code&gt; &lt;strong&gt;WON'T&lt;/strong&gt; remove that from the account/region.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;the operator decides &lt;strong&gt;IF&lt;/strong&gt; and &lt;strong&gt;WHEN&lt;/strong&gt; top level infrastructure should be deleted, then uses an &lt;code&gt;rm&lt;/code&gt; operation to do so.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;as a convenience, &lt;code&gt;infra-rm&lt;/code&gt; will remove &lt;strong&gt;ALL&lt;/strong&gt; infrastructure &lt;strong&gt;CURRENTLY&lt;/strong&gt; declared in an &lt;code&gt;infra.yaml&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;when using &lt;code&gt;ensure&lt;/code&gt; operations, no output means no changes.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;for large infrastructure sets, this can mean a minute or two without output if no changes are needed.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;to see a lot of output instead of none, set this environment variable:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;export&lt;/span&gt; DEBUG=yes&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;infra-ls&lt;/code&gt; is designed to list aws accounts managed with &lt;code&gt;infra-ensure&lt;/code&gt;. it will not work well in other scenarios.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="infrayaml"&gt;&lt;a class="heading-link" href="#infrayaml"&gt;infra.yaml&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;use an &lt;code&gt;infra.yaml&lt;/code&gt; file to declare an infrastructure set. the schema is as follows:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;VALUE&lt;/span&gt;
&lt;span class="pl-ent"&gt;lambda&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;VALUE&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;entrypoint&lt;/span&gt;: &lt;span class="pl-s"&gt;VALUE&lt;/span&gt;
    &lt;span class="pl-ent"&gt;policy&lt;/span&gt;:     &lt;span class="pl-s"&gt;[VALUE ...]&lt;/span&gt;
    &lt;span class="pl-ent"&gt;allow&lt;/span&gt;:      &lt;span class="pl-s"&gt;[VALUE ...]&lt;/span&gt;
    &lt;span class="pl-ent"&gt;attr&lt;/span&gt;:       &lt;span class="pl-s"&gt;[VALUE ...]&lt;/span&gt;
    &lt;span class="pl-ent"&gt;require&lt;/span&gt;:    &lt;span class="pl-s"&gt;[VALUE ...]&lt;/span&gt;
    &lt;span class="pl-ent"&gt;env&lt;/span&gt;:        &lt;span class="pl-s"&gt;[VALUE ...]&lt;/span&gt;
    &lt;span class="pl-ent"&gt;include&lt;/span&gt;:    &lt;span class="pl-s"&gt;[VALUE ...]&lt;/span&gt;
    &lt;span class="pl-ent"&gt;trigger&lt;/span&gt;:
      - &lt;span class="pl-ent"&gt;type&lt;/span&gt;: &lt;span class="pl-s"&gt;VALUE&lt;/span&gt;
        &lt;span class="pl-ent"&gt;attr&lt;/span&gt;: &lt;span class="pl-s"&gt;[VALUE ...]&lt;/span&gt;
&lt;span class="pl-ent"&gt;s3&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;VALUE&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;attr&lt;/span&gt;: &lt;span class="pl-s"&gt;[VALUE ...]&lt;/span&gt;
&lt;span class="pl-ent"&gt;dynamodb&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;VALUE&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;key&lt;/span&gt;:  &lt;span class="pl-s"&gt;[VALUE ...]&lt;/span&gt;
    &lt;span class="pl-ent"&gt;attr&lt;/span&gt;: &lt;span class="pl-s"&gt;[VALUE ...]&lt;/span&gt;
&lt;span class="pl-ent"&gt;sqs&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;VALUE&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;attr&lt;/span&gt;: &lt;span class="pl-s"&gt;[VALUE ...]&lt;/span&gt;
&lt;span class="pl-ent"&gt;vpc&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;VALUE&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;security-group&lt;/span&gt;:
      &lt;span class="pl-ent"&gt;VALUE&lt;/span&gt;:
        &lt;span class="pl-ent"&gt;rule&lt;/span&gt;: &lt;span class="pl-s"&gt;[VALUE ...]&lt;/span&gt;
&lt;span class="pl-ent"&gt;keypair&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;VALUE&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;pubkey-content&lt;/span&gt;: &lt;span class="pl-s"&gt;VALUE&lt;/span&gt;
&lt;span class="pl-ent"&gt;instance-profile&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;VALUE&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;allow&lt;/span&gt;: &lt;span class="pl-s"&gt;[VALUE ...]&lt;/span&gt;
    &lt;span class="pl-ent"&gt;policy&lt;/span&gt;: &lt;span class="pl-s"&gt;[VALUE ...]&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id="environment-variable-substitution"&gt;&lt;a class="heading-link" href="#environment-variable-substitution"&gt;environment variable substitution&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;anywhere in &lt;code&gt;infra.yaml&lt;/code&gt; you can substitute environment variables from the caller's environment:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;example:
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;s3&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;test-bucket-${uid}&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;attr&lt;/span&gt;:
      - &lt;span class="pl-s"&gt;versioning=${versioning}&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;the following variables are defined during deployment, and are useful in &lt;code&gt;allow&lt;/code&gt; declarations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;${API_ID}&lt;/code&gt; the id of the apigateway v2 api created by an &lt;code&gt;api&lt;/code&gt; trigger.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;${WEBSOCKET_ID}&lt;/code&gt; the id of the apigateway v2 websocket created by a &lt;code&gt;websocket&lt;/code&gt; trigger.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="name"&gt;&lt;a class="heading-link" href="#name"&gt;name&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;defines the name of the infrastructure set.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;schema:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;VALUE&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;example:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;name&lt;/span&gt;: &lt;span class="pl-s"&gt;test-infraset&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="s3"&gt;&lt;a class="heading-link" href="#s3"&gt;s3&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;defines a &lt;a href="https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-s3-bucket.html" rel="nofollow"&gt;s3&lt;/a&gt; bucket:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;the following &lt;a href="https://github.com/nathants/libaws/tree/master/cmd/s3/ensure.go"&gt;attributes&lt;/a&gt; can be defined:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;acl=VALUE&lt;/code&gt;, values: &lt;code&gt;public | private&lt;/code&gt;, default: &lt;code&gt;private&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;versioning=VALUE&lt;/code&gt;, values: &lt;code&gt;true | false&lt;/code&gt;, default: &lt;code&gt;false&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;metrics=VALUE&lt;/code&gt;, values: &lt;code&gt;true | false&lt;/code&gt;, default: &lt;code&gt;true&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;cors=VALUE&lt;/code&gt;, values: &lt;code&gt;true | false&lt;/code&gt;, default: &lt;code&gt;false&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ttldays=VALUE&lt;/code&gt;, values: &lt;code&gt;0 | n&lt;/code&gt;, default: &lt;code&gt;0&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;setting &lt;code&gt;cors=true&lt;/code&gt; uses &lt;code&gt;*&lt;/code&gt; for allowed origins. to specify one or more explicit origins, do this instead:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;corsorigin=http://localhost:8080&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;corsorigin=https://example.com&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;schema:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;s3&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;VALUE&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;attr&lt;/span&gt;:
      - &lt;span class="pl-s"&gt;VALUE&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;example:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;s3&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;test-bucket&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;attr&lt;/span&gt;:
      - &lt;span class="pl-s"&gt;versioning=true&lt;/span&gt;
      - &lt;span class="pl-s"&gt;acl=public&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="dynamodb"&gt;&lt;a class="heading-link" href="#dynamodb"&gt;dynamodb&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;defines a &lt;a href="https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-dynamodb-table.html" rel="nofollow"&gt;dynamodb&lt;/a&gt; table:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;specify key as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;NAME:ATTR_TYPE:KEY_TYPE&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;the following &lt;a href="https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-dynamodb-table.html" rel="nofollow"&gt;attributes&lt;/a&gt; can be defined:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;read=VALUE&lt;/code&gt;, provisioned read capacity, default: &lt;code&gt;0&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;write=VALUE&lt;/code&gt;, provisioined write capacity, default: &lt;code&gt;0&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;on global indices the following &lt;a href="https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-dynamodb-gsi.html" rel="nofollow"&gt;attributes&lt;/a&gt; can be defined:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;projection=VALUE&lt;/code&gt;, provisioned read capacity, default: &lt;code&gt;ALL&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;read=VALUE&lt;/code&gt;, provisioned read capacity, default: &lt;code&gt;0&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;write=VALUE&lt;/code&gt;, provisioined write capacity, default: &lt;code&gt;0&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;on local indices the following &lt;a href="https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-dynamodb-lsi.html" rel="nofollow"&gt;attributes&lt;/a&gt; can be defined:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;projection=VALUE&lt;/code&gt;, provisioned read capacity, default: &lt;code&gt;ALL&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;schema:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;dynamodb&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;VALUE&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;key&lt;/span&gt;:
      - &lt;span class="pl-s"&gt;NAME:ATTR_TYPE:KEY_TYPE&lt;/span&gt;
    &lt;span class="pl-ent"&gt;attr&lt;/span&gt;:
      - &lt;span class="pl-s"&gt;VALUE&lt;/span&gt;
    &lt;span class="pl-ent"&gt;global-index&lt;/span&gt;:
      &lt;span class="pl-ent"&gt;VALUE&lt;/span&gt;:
        &lt;span class="pl-ent"&gt;key&lt;/span&gt;:
          - &lt;span class="pl-s"&gt;NAME:ATTR_TYPE:KEY_TYPE&lt;/span&gt;
        &lt;span class="pl-ent"&gt;non-key&lt;/span&gt;:
          - &lt;span class="pl-s"&gt;NAME&lt;/span&gt;
        &lt;span class="pl-ent"&gt;attr&lt;/span&gt;:
          - &lt;span class="pl-s"&gt;VALUE&lt;/span&gt;
    &lt;span class="pl-ent"&gt;local-index&lt;/span&gt;:
      &lt;span class="pl-ent"&gt;VALUE&lt;/span&gt;:
        &lt;span class="pl-ent"&gt;key&lt;/span&gt;:
          - &lt;span class="pl-s"&gt;NAME:ATTR_TYPE:KEY_TYPE&lt;/span&gt;
        &lt;span class="pl-ent"&gt;non-key&lt;/span&gt;:
          - &lt;span class="pl-s"&gt;NAME&lt;/span&gt;
        &lt;span class="pl-ent"&gt;attr&lt;/span&gt;:
          - &lt;span class="pl-s"&gt;VALUE&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;example:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;dynamodb&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;stream-table&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;key&lt;/span&gt;:
      - &lt;span class="pl-s"&gt;userid:s:hash&lt;/span&gt;
      - &lt;span class="pl-s"&gt;timestamp:n:range&lt;/span&gt;
    &lt;span class="pl-ent"&gt;attr&lt;/span&gt;:
      - &lt;span class="pl-s"&gt;stream=keys_only&lt;/span&gt;
  &lt;span class="pl-ent"&gt;auth-table&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;key&lt;/span&gt;:
      - &lt;span class="pl-s"&gt;id:s:hash&lt;/span&gt;
    &lt;span class="pl-ent"&gt;attr&lt;/span&gt;:
      - &lt;span class="pl-s"&gt;write=50&lt;/span&gt;
      - &lt;span class="pl-s"&gt;read=150&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;example global secondary index:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;dynamodb&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;test-table&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;key&lt;/span&gt;:
      - &lt;span class="pl-s"&gt;id:s:hash&lt;/span&gt;
    &lt;span class="pl-ent"&gt;global-index&lt;/span&gt;:
      &lt;span class="pl-ent"&gt;test-index&lt;/span&gt;:
        &lt;span class="pl-ent"&gt;key&lt;/span&gt;:
          - &lt;span class="pl-s"&gt;hometown:s:hash&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;example local secondary index:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;dynamodb&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;test-table&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;key&lt;/span&gt;:
      - &lt;span class="pl-s"&gt;id:s:hash&lt;/span&gt;
    &lt;span class="pl-ent"&gt;local-index&lt;/span&gt;:
      &lt;span class="pl-ent"&gt;test-index&lt;/span&gt;:
        &lt;span class="pl-ent"&gt;key&lt;/span&gt;:
          - &lt;span class="pl-s"&gt;hometown:s:hash&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="sqs"&gt;&lt;a class="heading-link" href="#sqs"&gt;sqs&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;defines a &lt;a href="https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-sqs-queue.html" rel="nofollow"&gt;sqs&lt;/a&gt; queue:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;the following &lt;a href="https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-sqs-queue.html#aws-resource-sqs-queue-syntax" rel="nofollow"&gt;attributes&lt;/a&gt; can be defined:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;delay=VALUE&lt;/code&gt;, delay seconds, default: &lt;code&gt;0&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;size=VALUE&lt;/code&gt;, maximum message size bytes, default: &lt;code&gt;262144&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;retention=VALUE&lt;/code&gt;, message rentention period seconds, default: &lt;code&gt;345600&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;wait=VALUE&lt;/code&gt;, receive wait time seconds, default: &lt;code&gt;0&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;timeout=VALUE&lt;/code&gt;, visibility timeout seconds, default: &lt;code&gt;30&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;schema:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;sqs&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;VALUE&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;attr&lt;/span&gt;:
      - &lt;span class="pl-s"&gt;VALUE&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;example:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;sqs&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;test-queue&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;attr&lt;/span&gt;:
      - &lt;span class="pl-s"&gt;delay=20&lt;/span&gt;
      - &lt;span class="pl-s"&gt;timeout=300&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="keypair"&gt;&lt;a class="heading-link" href="#keypair"&gt;keypair&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;defines an ec2 &lt;a href="https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-ec2-keypair.html" rel="nofollow"&gt;keypair&lt;/a&gt;.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;example:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;keypair&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;VALUE&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;pubkey-content&lt;/span&gt;: &lt;span class="pl-s"&gt;VALUE&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;example:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;keypair&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;test-keypair&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;pubkey-content&lt;/span&gt;: &lt;span class="pl-s"&gt;ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAICVp11Z99AySWfbLrMBewZluh7cwLlkjifGH5u22RXor&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="vpc"&gt;&lt;a class="heading-link" href="#vpc"&gt;vpc&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;defines a default-like &lt;a href="https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-ec2-vpc.html" rel="nofollow"&gt;vpc&lt;/a&gt; with an &lt;a href="https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-ec2-internetgateway.html" rel="nofollow"&gt;internet gateway&lt;/a&gt; and &lt;a href="https://docs.aws.amazon.com/vpc/latest/userguide/vpc-dns.html#vpc-dns-support" rel="nofollow"&gt;public access&lt;/a&gt;.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;schema:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;vpc&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;VALUE&lt;/span&gt;: &lt;span class="pl-s"&gt;{}&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;example:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;vpc&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;test-vpc&lt;/span&gt;: &lt;span class="pl-s"&gt;{}&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="security-group"&gt;&lt;a class="heading-link" href="#security-group"&gt;security group&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;defines a &lt;a href="https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-ec2-security-group.html" rel="nofollow"&gt;security group&lt;/a&gt; on a vpc&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;schema:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;vpc&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;VALUE&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;security-group&lt;/span&gt;:
      &lt;span class="pl-ent"&gt;VALUE&lt;/span&gt;:
        &lt;span class="pl-ent"&gt;rule&lt;/span&gt;:
          - &lt;span class="pl-s"&gt;PROTO:PORT:SOURCE&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;example:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;vpc&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;test-vpc&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;security-group&lt;/span&gt;:
      &lt;span class="pl-ent"&gt;test-sg&lt;/span&gt;:
        &lt;span class="pl-ent"&gt;rule&lt;/span&gt;:
          - &lt;span class="pl-s"&gt;tcp:22:0.0.0.0/0&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="instance-profile"&gt;&lt;a class="heading-link" href="#instance-profile"&gt;instance profile&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;defines an ec2 &lt;a href="https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-iam-instanceprofile.html" rel="nofollow"&gt;instance profile&lt;/a&gt;.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;schema:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;instance-profile&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;VALUE&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;allow&lt;/span&gt;:
      - &lt;span class="pl-s"&gt;SERVICE:ACTION ARN&lt;/span&gt;
    &lt;span class="pl-ent"&gt;policy&lt;/span&gt;:
      - &lt;span class="pl-s"&gt;VALUE&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;example:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;instance-profile&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;test-profile&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;allow&lt;/span&gt;:
      - &lt;span class="pl-s"&gt;s3:* *&lt;/span&gt;
    &lt;span class="pl-ent"&gt;policy&lt;/span&gt;:
      - &lt;span class="pl-s"&gt;AWSLambdaBasicExecutionRole&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="lambda"&gt;&lt;a class="heading-link" href="#lambda"&gt;lambda&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;defines a &lt;a href="https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-lambda-function.html" rel="nofollow"&gt;lambda&lt;/a&gt;.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;schema:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;lambda&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;VALUE&lt;/span&gt;: &lt;span class="pl-s"&gt;{}&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;example:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;lambda&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;test-lambda&lt;/span&gt;: &lt;span class="pl-s"&gt;{}&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="entrypoint"&gt;&lt;a class="heading-link" href="#entrypoint"&gt;entrypoint&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;defines the code of the lambda. it is one of:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;a python file.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;a go file.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;an ecr container uri.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;schema:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;lambda&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;VALUE&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;entrypoint&lt;/span&gt;: &lt;span class="pl-s"&gt;VALUE&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;example:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;lambda&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;test-lambda&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;entrypoint&lt;/span&gt;: &lt;span class="pl-s"&gt;main.go&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="attr"&gt;&lt;a class="heading-link" href="#attr"&gt;attr&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;defines lambda attributes. the following can be defined:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;concurrency&lt;/code&gt; defines the reserved concurrent executions, default: &lt;code&gt;0&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;memory&lt;/code&gt; defines lambda ram in megabytes, default: &lt;code&gt;128&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;timeout&lt;/code&gt; defines the lambda timeout in seconds, default: &lt;code&gt;300&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;logs-ttl-days&lt;/code&gt; defines the ttl days for cloudwatch logs, default: &lt;code&gt;7&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;schema:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;lambda&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;VALUE&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;attr&lt;/span&gt;:
      - &lt;span class="pl-s"&gt;KEY=VALUE&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;example:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;lambda&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;test-lambda&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;attr&lt;/span&gt;:
      - &lt;span class="pl-s"&gt;concurrency=100&lt;/span&gt;
      - &lt;span class="pl-s"&gt;memory=256&lt;/span&gt;
      - &lt;span class="pl-s"&gt;timeout=60&lt;/span&gt;
      - &lt;span class="pl-s"&gt;logs-ttl-days=1&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="policy"&gt;&lt;a class="heading-link" href="#policy"&gt;policy&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;defines policies on the lambda's iam role.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;schema:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;lambda&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;VALUE&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;policy&lt;/span&gt;:
      - &lt;span class="pl-s"&gt;VALUE&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;example:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;lambda&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;test-lambda&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;policy&lt;/span&gt;:
      - &lt;span class="pl-s"&gt;AWSLambdaBasicExecutionRole&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="allow"&gt;&lt;a class="heading-link" href="#allow"&gt;allow&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;defines allows on the lambda's iam role.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;schema:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;lambda&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;VALUE&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;allow&lt;/span&gt;:
      - &lt;span class="pl-s"&gt;SERVICE:ACTION ARN&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;example:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;lambda&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;test-lambda&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;allow&lt;/span&gt;:
      - &lt;span class="pl-s"&gt;s3:* *&lt;/span&gt;
      - &lt;span class="pl-s"&gt;dynamodb:* arn:aws:dynamodb:*:*:table/test-table&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="env"&gt;&lt;a class="heading-link" href="#env"&gt;env&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;defines environment variables on the lambda:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;schema:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;lambda&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;VALUE&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;env&lt;/span&gt;:
      - &lt;span class="pl-s"&gt;KEY=VALUE&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;example:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;lambda&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;test-lambda&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;env&lt;/span&gt;:
      - &lt;span class="pl-s"&gt;kind=production&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="include"&gt;&lt;a class="heading-link" href="#include"&gt;include&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;defines extra content to include in the lambda zip:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;this is ignored when &lt;code&gt;entrypoint&lt;/code&gt; is an ecr container uri.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;schema:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;lambda&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;VALUE&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;include&lt;/span&gt;:
      - &lt;span class="pl-s"&gt;VALUE&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;example:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;lambda&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;test-lambda&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;include&lt;/span&gt;:
      - &lt;span class="pl-s"&gt;./cacerts.crt&lt;/span&gt;
      - &lt;span class="pl-s"&gt;../frontend/public/*&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="require"&gt;&lt;a class="heading-link" href="#require"&gt;require&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;defines dependencies to install with pip in the virtualenv zip.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;this is ignored unless the &lt;code&gt;entrypoint&lt;/code&gt; is a python file.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;schema:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;lambda&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;VALUE&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;require&lt;/span&gt;:
      - &lt;span class="pl-s"&gt;VALUE&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;example:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;lambda&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;test-lambda&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;require&lt;/span&gt;:
      - &lt;span class="pl-s"&gt;fastapi==0.76.0&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="trigger"&gt;&lt;a class="heading-link" href="#trigger"&gt;trigger&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h4&gt;
&lt;p&gt;defines triggers for the lambda:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;schema:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;lambda&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;VALUE&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;trigger&lt;/span&gt;:
      - &lt;span class="pl-ent"&gt;type&lt;/span&gt;: &lt;span class="pl-s"&gt;VALUE&lt;/span&gt;
        &lt;span class="pl-ent"&gt;attr&lt;/span&gt;:
          - &lt;span class="pl-s"&gt;VALUE&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;example:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;lambda&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;test-lambda&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;trigger&lt;/span&gt;:
      - &lt;span class="pl-ent"&gt;type&lt;/span&gt;: &lt;span class="pl-s"&gt;dynamodb&lt;/span&gt;
        &lt;span class="pl-ent"&gt;attr&lt;/span&gt;:
          - &lt;span class="pl-s"&gt;test-table&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id="trigger-types"&gt;&lt;a class="heading-link" href="#trigger-types"&gt;trigger types&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h4&gt;
&lt;h5 id="api"&gt;&lt;a class="heading-link" href="#api"&gt;api&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h5&gt;
&lt;p&gt;defines an &lt;a href="https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-apigatewayv2-api.html" rel="nofollow"&gt;apigateway v2&lt;/a&gt; http api:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;add a custom domain with attr: &lt;code&gt;domain=api.example.com&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;add a custom domain and update route53 with attr: &lt;code&gt;dns=api.example.com&lt;/code&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;this domain, or its parent domain, must already exist as a hosted zone in &lt;a href="https://github.com/nathants/libaws/tree/master/cmd/route53/ls.go"&gt;route53&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;this domain, or its parent domain, must already have an &lt;a href="https://github.com/nathants/libaws/tree/master/cmd/acm/ls.go"&gt;acm&lt;/a&gt; certificate with subdomain wildcard.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;schema:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;lambda&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;VALUE&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;trigger&lt;/span&gt;:
      - &lt;span class="pl-ent"&gt;type&lt;/span&gt;: &lt;span class="pl-s"&gt;api&lt;/span&gt;
        &lt;span class="pl-ent"&gt;attr&lt;/span&gt;:
          - &lt;span class="pl-s"&gt;VALUE&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;example:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;lambda&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;test-lambda&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;trigger&lt;/span&gt;:
      - &lt;span class="pl-ent"&gt;type&lt;/span&gt;: &lt;span class="pl-s"&gt;api&lt;/span&gt;
        &lt;span class="pl-ent"&gt;attr&lt;/span&gt;:
          - &lt;span class="pl-s"&gt;dns=api.example.com&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h5 id="websocket"&gt;&lt;a class="heading-link" href="#websocket"&gt;websocket&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h5&gt;
&lt;p&gt;defines an &lt;a href="https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-apigatewayv2-api.html" rel="nofollow"&gt;apigateway v2&lt;/a&gt; websocket api:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;add a custom domain with attr: &lt;code&gt;domain=ws.example.com&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;add a custom domain and update route53 with attr: &lt;code&gt;dns=ws.example.com&lt;/code&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;this domain, or its parent domain, must already exist as a hosted zone in &lt;a href="https://github.com/nathants/libaws/tree/master/cmd/route53/ls.go"&gt;route53-ls&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;this domain, or its parent domain, must already have an &lt;a href="https://github.com/nathants/libaws/tree/master/cmd/acm/ls.go"&gt;acm&lt;/a&gt; certificate with subdomain wildcard.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;schema:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;lambda&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;VALUE&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;trigger&lt;/span&gt;:
      - &lt;span class="pl-ent"&gt;type&lt;/span&gt;: &lt;span class="pl-s"&gt;websocket&lt;/span&gt;
        &lt;span class="pl-ent"&gt;attr&lt;/span&gt;:
          - &lt;span class="pl-s"&gt;VALUE&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;example:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;lambda&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;test-lambda&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;trigger&lt;/span&gt;:
      - &lt;span class="pl-ent"&gt;type&lt;/span&gt;: &lt;span class="pl-s"&gt;websocket&lt;/span&gt;
        &lt;span class="pl-ent"&gt;attr&lt;/span&gt;:
          - &lt;span class="pl-s"&gt;dns=ws.example.com&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h5 id="s3-1"&gt;&lt;a class="heading-link" href="#s3-1"&gt;s3&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h5&gt;
&lt;p&gt;defines an &lt;a href="https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-s3-bucket-notificationconfig.html" rel="nofollow"&gt;s3 trigger&lt;/a&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;the only attribute must be the bucket name.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;object creation and deletion invoke the trigger.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;schema:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;lambda&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;VALUE&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;trigger&lt;/span&gt;:
      - &lt;span class="pl-ent"&gt;type&lt;/span&gt;: &lt;span class="pl-c1"&gt;s3&lt;/span&gt;
        &lt;span class="pl-ent"&gt;attr&lt;/span&gt;:
          - &lt;span class="pl-s"&gt;VALUE&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;example:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;lambda&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;test-lambda&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;trigger&lt;/span&gt;:
      - &lt;span class="pl-ent"&gt;type&lt;/span&gt;: &lt;span class="pl-c1"&gt;s3&lt;/span&gt;
        &lt;span class="pl-ent"&gt;attr&lt;/span&gt;:
          - &lt;span class="pl-s"&gt;test-bucket&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h5 id="dynamodb-1"&gt;&lt;a class="heading-link" href="#dynamodb-1"&gt;dynamodb&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h5&gt;
&lt;p&gt;defines a &lt;a href="https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-lambda-eventsourcemapping.html" rel="nofollow"&gt;dynamodb trigger&lt;/a&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;the first attribute must be the table name.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;the following trigger &lt;a href="https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-lambda-eventsourcemapping.html" rel="nofollow"&gt;attributes&lt;/a&gt; can be defined:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;batch=VALUE&lt;/code&gt;, maximum batch size, default: &lt;code&gt;100&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;parallel=VALUE&lt;/code&gt;, parallelization factor, default: &lt;code&gt;1&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;retry=VALUE&lt;/code&gt;, maximum retry attempts, default: &lt;code&gt;-1&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;window=VALUE&lt;/code&gt;, maximum batching window in seconds, default: &lt;code&gt;0&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;start=VALUE&lt;/code&gt;, starting position&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;schema:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;lambda&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;VALUE&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;trigger&lt;/span&gt;:
      - &lt;span class="pl-ent"&gt;type&lt;/span&gt;: &lt;span class="pl-s"&gt;dynamodb&lt;/span&gt;
        &lt;span class="pl-ent"&gt;attr&lt;/span&gt;:
          - &lt;span class="pl-s"&gt;VALUE&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;example:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;lambda&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;test-lambda&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;trigger&lt;/span&gt;:
      - &lt;span class="pl-ent"&gt;type&lt;/span&gt;: &lt;span class="pl-s"&gt;dynamodb&lt;/span&gt;
        &lt;span class="pl-ent"&gt;attr&lt;/span&gt;:
          - &lt;span class="pl-s"&gt;test-table&lt;/span&gt;
          - &lt;span class="pl-s"&gt;start=trim_horizon&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h5 id="sqs-1"&gt;&lt;a class="heading-link" href="#sqs-1"&gt;sqs&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h5&gt;
&lt;p&gt;defines a &lt;a href="https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-lambda-eventsourcemapping.html" rel="nofollow"&gt;sqs trigger&lt;/a&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;the first attribute must be the queue name.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;the following trigger &lt;a href="https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-lambda-eventsourcemapping.html" rel="nofollow"&gt;attributes&lt;/a&gt; can be defined:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;batch=VALUE&lt;/code&gt;, maximum batch size, default: &lt;code&gt;10&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;window=VALUE&lt;/code&gt;, maximum batching window in seconds, default: &lt;code&gt;0&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;schema:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;lambda&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;VALUE&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;trigger&lt;/span&gt;:
      - &lt;span class="pl-ent"&gt;type&lt;/span&gt;: &lt;span class="pl-s"&gt;sqs&lt;/span&gt;
        &lt;span class="pl-ent"&gt;attr&lt;/span&gt;:
          - &lt;span class="pl-s"&gt;VALUE&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;example:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;lambda&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;test-lambda&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;trigger&lt;/span&gt;:
      - &lt;span class="pl-ent"&gt;type&lt;/span&gt;: &lt;span class="pl-s"&gt;sqs&lt;/span&gt;
        &lt;span class="pl-ent"&gt;attr&lt;/span&gt;:
          - &lt;span class="pl-s"&gt;test-queue&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h5 id="schedule"&gt;&lt;a class="heading-link" href="#schedule"&gt;schedule&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h5&gt;
&lt;p&gt;defines a &lt;a href="https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-events-rule.html" rel="nofollow"&gt;schedule trigger&lt;/a&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;the only attribute must be the &lt;a href="https://docs.aws.amazon.com/lambda/latest/dg/services-cloudwatchevents-expressions.html" rel="nofollow"&gt;schedule expression&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;schema:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;lambda&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;VALUE&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;trigger&lt;/span&gt;:
      - &lt;span class="pl-ent"&gt;type&lt;/span&gt;: &lt;span class="pl-s"&gt;schedule&lt;/span&gt;
        &lt;span class="pl-ent"&gt;attr&lt;/span&gt;:
          - &lt;span class="pl-s"&gt;VALUE&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;example:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;lambda&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;test-lambda&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;trigger&lt;/span&gt;:
      - &lt;span class="pl-ent"&gt;type&lt;/span&gt;: &lt;span class="pl-s"&gt;schedule&lt;/span&gt;
        &lt;span class="pl-ent"&gt;attr&lt;/span&gt;:
          - &lt;span class="pl-s"&gt;rate(24 hours)&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h5 id="ecr"&gt;&lt;a class="heading-link" href="#ecr"&gt;ecr&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h5&gt;
&lt;p&gt;defines an &lt;a href="https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-events-rule.html" rel="nofollow"&gt;ecr trigger&lt;/a&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;successful &lt;a href="https://github.com/nathants/libaws/blob/163533034af790187e56d4e267a797d8131f1307/lib/lambda.go#L153"&gt;image actions&lt;/a&gt; to any ecr repository will invoke the trigger.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;schema:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;lambda&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;VALUE&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;trigger&lt;/span&gt;:
      - &lt;span class="pl-ent"&gt;type&lt;/span&gt;: &lt;span class="pl-s"&gt;ecr&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;example:&lt;/p&gt;
&lt;div class="highlight highlight-source-yaml"&gt;&lt;pre&gt;&lt;span class="pl-ent"&gt;lambda&lt;/span&gt;:
  &lt;span class="pl-ent"&gt;test-lambda&lt;/span&gt;:
    &lt;span class="pl-ent"&gt;trigger&lt;/span&gt;:
      - &lt;span class="pl-ent"&gt;type&lt;/span&gt;: &lt;span class="pl-s"&gt;ecr&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="bash-completion"&gt;&lt;a class="heading-link" href="#bash-completion"&gt;bash completion&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;source completions.d/libaws.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id="extending"&gt;&lt;a class="heading-link" href="#extending"&gt;extending&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;drop down to the &lt;a href="https://pkg.go.dev/github.com/aws/aws-sdk-go/service" rel="nofollow"&gt;aws go sdk&lt;/a&gt; and implement what you need.&lt;/p&gt;
&lt;p&gt;extend an &lt;a href="https://github.com/nathants/libaws/tree/master/cmd/sqs/ensure.go"&gt;existing&lt;/a&gt; &lt;a href="https://github.com/nathants/libaws/tree/master/cmd/s3/ensure.go"&gt;mutative&lt;/a&gt; &lt;a href="https://github.com/nathants/libaws/tree/master/cmd/dynamodb/ensure.go"&gt;operation&lt;/a&gt; or add a new one.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;make sure that mutative operations are &lt;strong&gt;IDEMPOTENT&lt;/strong&gt; and can be &lt;strong&gt;PREVIEWED&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;you will find examples in &lt;a href="https://github.com/nathants/libaws/tree/master/cmd"&gt;cmd/&lt;/a&gt; and &lt;a href="https://github.com/nathants/libaws/tree/master/lib"&gt;lib/&lt;/a&gt; that can provide a good place to start.&lt;/p&gt;
&lt;p&gt;you can reuse many existing operations like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="https://github.com/nathants/libaws/tree/master/lib/iam.go"&gt;lib/iam.go&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="https://github.com/nathants/libaws/tree/master/lib/lambda.go"&gt;lib/lambda.go&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="https://github.com/nathants/libaws/tree/master/lib/ec2.go"&gt;lib/ec2.go&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;alternatively, lift and shift to &lt;a href="https://www.pulumi.com/" rel="nofollow"&gt;other&lt;/a&gt; &lt;a href="https://www.terraform.io/" rel="nofollow"&gt;infrastructure&lt;/a&gt; &lt;a href="https://aws.amazon.com/cloudformation/" rel="nofollow"&gt;automation&lt;/a&gt; &lt;a href="https://www.serverless.com/" rel="nofollow"&gt;tooling&lt;/a&gt;. &lt;code&gt;ls&lt;/code&gt; and &lt;code&gt;describe&lt;/code&gt; operations will give you all the information you need.&lt;/p&gt;
&lt;h2 id="testing"&gt;&lt;a class="heading-link" href="#testing"&gt;testing&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;run all integration tests aws with &lt;a href="https://tox.wiki/en/latest/" rel="nofollow"&gt;tox&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;export&lt;/span&gt; LIBAWS_TEST_ACCOUNT=&lt;span class="pl-smi"&gt;$ACCOUNT_NUM&lt;/span&gt;

tox&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;run one integration test aws with &lt;a href="https://tox.wiki/en/latest/" rel="nofollow"&gt;tox&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;export&lt;/span&gt; LIBAWS_TEST_ACCOUNT=&lt;span class="pl-smi"&gt;$ACCOUNT_NUM&lt;/span&gt;

tox -- bash -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;cd examples/simple/python/api/ &amp;amp;&amp;amp; python test.py&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;

                      
            </description>
      <guid isPermaLink="false">https://nathants.com/projects/libaws</guid>
    </item>
    <item>
      <title>bsv</title>
      <link>https://nathants.com/projects/bsv</link>
      <description>
                
&lt;h2 id="why"&gt;&lt;a class="heading-link" href="#why"&gt;why&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;it should be simple and easy to process data at the speed of sequential io.&lt;/p&gt;
&lt;h2 id="what"&gt;&lt;a class="heading-link" href="#what"&gt;what&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;a simple and efficient &lt;a href="https://github.com/nathants/bsv/blob/master/util/load.h"&gt;data&lt;/a&gt; &lt;a href="https://github.com/nathants/bsv/blob/master/util/dump.h"&gt;format&lt;/a&gt; for easily manipulating chunks of rows of columns while minimizing allocations and copies.&lt;/p&gt;
&lt;p&gt;minimal cli &lt;a href="#tools"&gt;tools&lt;/a&gt; for rapidly composing performant data flow pipelines.&lt;/p&gt;
&lt;h2 id="how"&gt;&lt;a class="heading-link" href="#how"&gt;how&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;column: 0-65536 bytes.&lt;/p&gt;
&lt;p&gt;row: 0-65536 columns.&lt;/p&gt;
&lt;p&gt;chunk: up to 5MB containing 1 or more complete rows.&lt;/p&gt;
&lt;p&gt;note: row data cannot exceed chunk size.&lt;/p&gt;
&lt;h2 id="layout"&gt;&lt;a class="heading-link" href="#layout"&gt;layout&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://github.com/nathants/bsv/blob/master/util/read.h"&gt;chunk&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;|&lt;/span&gt; i32:size &lt;span class="pl-k"&gt;|&lt;/span&gt; u8[]:row &lt;span class="pl-k"&gt;|&lt;/span&gt; ... &lt;span class="pl-k"&gt;|&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;a href="https://github.com/nathants/bsv/blob/master/util/load.h"&gt;row&lt;/a&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;|&lt;/span&gt; u16:max &lt;span class="pl-k"&gt;|&lt;/span&gt; u16:size &lt;span class="pl-k"&gt;|&lt;/span&gt; ... &lt;span class="pl-k"&gt;|&lt;/span&gt; u8[]:column &lt;span class="pl-k"&gt;|&lt;/span&gt; ... &lt;span class="pl-k"&gt;|&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;note: column bytes are always followed by a single null byte.&lt;/p&gt;
&lt;p&gt;note: max is the maximum zero based index into the row.&lt;/p&gt;
&lt;h2 id="install"&gt;&lt;a class="heading-link" href="#install"&gt;install&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; curl https://raw.githubusercontent.com/nathants/bsv/master/scripts/install_archlinux.sh &lt;span class="pl-k"&gt;|&lt;/span&gt; bash&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; git clone https://github.com/nathants/bsv
&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;cd&lt;/span&gt; bsv
&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; make -j
&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; sudo mv -fv bin/&lt;span class="pl-k"&gt;*&lt;/span&gt; /usr/local/bin&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;note: for best pipeline performance increase maximum pipe size&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; sudo sysctl fs.pipe-max-size=5242880&lt;/pre&gt;&lt;/div&gt;
&lt;h2 id="test"&gt;&lt;a class="heading-link" href="#test"&gt;test&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; tox&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; docker build -t bsv:debian -f Dockerfile.debian &lt;span class="pl-c1"&gt;.&lt;/span&gt;

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; docker run -v &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;pwd&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;:/code --rm -it bsv:debian bash -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;cd /code &amp;amp;&amp;amp; py.test -vvx --tb native -n auto test/&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; docker build -t bsv:alpine -f Dockerfile.alpine &lt;span class="pl-c1"&gt;.&lt;/span&gt;

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; docker run -v &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;pwd&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;:/code --rm -it bsv:alpine bash -c &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;cd /code &amp;amp;&amp;amp; py.test -vvx --tb native -n auto test/&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;increase the number of generated tests cases with environment variable: &lt;code&gt;TEST_FACTOR=5&lt;/code&gt;&lt;/p&gt;
&lt;h2 id="example"&gt;&lt;a class="heading-link" href="#example"&gt;example&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;add &lt;code&gt;bsumall.c&lt;/code&gt; to &lt;code&gt;bsv/src/&lt;/code&gt;:&lt;/p&gt;
&lt;div class="highlight highlight-source-c"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;#include&lt;/span&gt; &lt;span class="pl-s"&gt;"util.h"&lt;/span&gt;
&lt;span class="pl-k"&gt;#include&lt;/span&gt; &lt;span class="pl-s"&gt;"load.h"&lt;/span&gt;
&lt;span class="pl-k"&gt;#include&lt;/span&gt; &lt;span class="pl-s"&gt;"dump.h"&lt;/span&gt;

&lt;span class="pl-k"&gt;#define&lt;/span&gt; &lt;span class="pl-c1"&gt;DESCRIPTION&lt;/span&gt; "sum columns of u16 as i64\n\n"
&lt;span class="pl-k"&gt;#define&lt;/span&gt; &lt;span class="pl-c1"&gt;USAGE&lt;/span&gt; "... | bsumall \n\n"
&lt;span class="pl-k"&gt;#define&lt;/span&gt; &lt;span class="pl-c1"&gt;EXAMPLE&lt;/span&gt; "&amp;gt;&amp;gt; echo '\n1,2\n3,4\n' | bsv | bschema a:u16,a:u16 | bsumall i64 | bschema i64:a,i64:a | csv\n4,6\n"

&lt;span class="pl-smi"&gt;int&lt;/span&gt; &lt;span class="pl-en"&gt;main&lt;/span&gt;(&lt;span class="pl-smi"&gt;int&lt;/span&gt; &lt;span class="pl-s1"&gt;argc&lt;/span&gt;, &lt;span class="pl-smi"&gt;char&lt;/span&gt; &lt;span class="pl-c1"&gt;*&lt;/span&gt;&lt;span class="pl-c1"&gt;*&lt;/span&gt;&lt;span class="pl-s1"&gt;argv&lt;/span&gt;) {

    &lt;span class="pl-c"&gt;// setup state&lt;/span&gt;
    &lt;span class="pl-en"&gt;SETUP&lt;/span&gt;();
    &lt;span class="pl-smi"&gt;readbuf_t&lt;/span&gt; &lt;span class="pl-s1"&gt;rbuf&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;rbuf_init&lt;/span&gt;((&lt;span class="pl-smi"&gt;FILE&lt;/span&gt;&lt;span class="pl-c1"&gt;*&lt;/span&gt;[]){&lt;span class="pl-s1"&gt;stdin&lt;/span&gt;}, &lt;span class="pl-c1"&gt;1&lt;/span&gt;, false);
    &lt;span class="pl-smi"&gt;writebuf_t&lt;/span&gt; &lt;span class="pl-s1"&gt;wbuf&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-en"&gt;wbuf_init&lt;/span&gt;((&lt;span class="pl-smi"&gt;FILE&lt;/span&gt;&lt;span class="pl-c1"&gt;*&lt;/span&gt;[]){&lt;span class="pl-s1"&gt;stdout&lt;/span&gt;}, &lt;span class="pl-c1"&gt;1&lt;/span&gt;, false);
    &lt;span class="pl-smi"&gt;i64&lt;/span&gt; &lt;span class="pl-s1"&gt;sums&lt;/span&gt;[&lt;span class="pl-c1"&gt;MAX_COLUMNS&lt;/span&gt;] &lt;span class="pl-c1"&gt;=&lt;/span&gt; {&lt;span class="pl-c1"&gt;0&lt;/span&gt;};
    &lt;span class="pl-smi"&gt;row_t&lt;/span&gt; &lt;span class="pl-s1"&gt;row&lt;/span&gt;;

    &lt;span class="pl-c"&gt;// process input row by row&lt;/span&gt;
    &lt;span class="pl-k"&gt;while&lt;/span&gt; (&lt;span class="pl-c1"&gt;1&lt;/span&gt;) {
        &lt;span class="pl-en"&gt;load_next&lt;/span&gt;(&lt;span class="pl-c1"&gt;&amp;amp;&lt;/span&gt;&lt;span class="pl-s1"&gt;rbuf&lt;/span&gt;, &lt;span class="pl-c1"&gt;&amp;amp;&lt;/span&gt;&lt;span class="pl-s1"&gt;row&lt;/span&gt;, &lt;span class="pl-c1"&gt;0&lt;/span&gt;);
        &lt;span class="pl-k"&gt;if&lt;/span&gt; (&lt;span class="pl-s1"&gt;row&lt;/span&gt;.&lt;span class="pl-c1"&gt;stop&lt;/span&gt;)
            &lt;span class="pl-k"&gt;break&lt;/span&gt;;
        &lt;span class="pl-k"&gt;for&lt;/span&gt; (&lt;span class="pl-smi"&gt;i32&lt;/span&gt; &lt;span class="pl-s1"&gt;i&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;0&lt;/span&gt;; &lt;span class="pl-s1"&gt;i&lt;/span&gt; &amp;lt;= &lt;span class="pl-s1"&gt;row&lt;/span&gt;.&lt;span class="pl-c1"&gt;max&lt;/span&gt;; &lt;span class="pl-s1"&gt;i&lt;/span&gt;&lt;span class="pl-c1"&gt;++&lt;/span&gt;) {
            &lt;span class="pl-en"&gt;ASSERT&lt;/span&gt;(&lt;span class="pl-k"&gt;sizeof&lt;/span&gt;(&lt;span class="pl-s1"&gt;u16&lt;/span&gt;) &lt;span class="pl-c1"&gt;==&lt;/span&gt; &lt;span class="pl-s1"&gt;row&lt;/span&gt;.&lt;span class="pl-c1"&gt;sizes&lt;/span&gt;[&lt;span class="pl-s1"&gt;i&lt;/span&gt;], &lt;span class="pl-s"&gt;"fatal: bad data\n"&lt;/span&gt;);
            &lt;span class="pl-s1"&gt;sums&lt;/span&gt;[&lt;span class="pl-s1"&gt;i&lt;/span&gt;] &lt;span class="pl-c1"&gt;+=&lt;/span&gt; &lt;span class="pl-c1"&gt;*&lt;/span&gt;(&lt;span class="pl-smi"&gt;u16&lt;/span&gt;&lt;span class="pl-c1"&gt;*&lt;/span&gt;)&lt;span class="pl-s1"&gt;row&lt;/span&gt;.&lt;span class="pl-c1"&gt;columns&lt;/span&gt;[&lt;span class="pl-s1"&gt;i&lt;/span&gt;];
        }
    }

    &lt;span class="pl-c"&gt;// generate output row&lt;/span&gt;
    &lt;span class="pl-s1"&gt;row&lt;/span&gt;.&lt;span class="pl-c1"&gt;max&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;-1&lt;/span&gt;;
    &lt;span class="pl-k"&gt;for&lt;/span&gt; (&lt;span class="pl-smi"&gt;i32&lt;/span&gt; &lt;span class="pl-s1"&gt;i&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;0&lt;/span&gt;; &lt;span class="pl-s1"&gt;i&lt;/span&gt; &lt;span class="pl-c1"&gt;&amp;lt;&lt;/span&gt; &lt;span class="pl-c1"&gt;MAX_COLUMNS&lt;/span&gt;; &lt;span class="pl-s1"&gt;i&lt;/span&gt;&lt;span class="pl-c1"&gt;++&lt;/span&gt;) {
        &lt;span class="pl-k"&gt;if&lt;/span&gt; (!&lt;span class="pl-s1"&gt;sums&lt;/span&gt;[&lt;span class="pl-s1"&gt;i&lt;/span&gt;])
            &lt;span class="pl-k"&gt;break&lt;/span&gt;;
        &lt;span class="pl-s1"&gt;row&lt;/span&gt;.&lt;span class="pl-c1"&gt;sizes&lt;/span&gt;[&lt;span class="pl-s1"&gt;i&lt;/span&gt;] &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-k"&gt;sizeof&lt;/span&gt;(&lt;span class="pl-s1"&gt;i64&lt;/span&gt;);
        &lt;span class="pl-s1"&gt;row&lt;/span&gt;.&lt;span class="pl-c1"&gt;columns&lt;/span&gt;[&lt;span class="pl-s1"&gt;i&lt;/span&gt;] &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-c1"&gt;&amp;amp;&lt;/span&gt;&lt;span class="pl-s1"&gt;sums&lt;/span&gt;[&lt;span class="pl-s1"&gt;i&lt;/span&gt;];
        &lt;span class="pl-s1"&gt;row&lt;/span&gt;.&lt;span class="pl-c1"&gt;max&lt;/span&gt;&lt;span class="pl-c1"&gt;++&lt;/span&gt;;
    }

    &lt;span class="pl-c"&gt;// dump output&lt;/span&gt;
    &lt;span class="pl-k"&gt;if&lt;/span&gt; (&lt;span class="pl-s1"&gt;row&lt;/span&gt;.&lt;span class="pl-c1"&gt;max&lt;/span&gt; &amp;gt;= &lt;span class="pl-c1"&gt;0&lt;/span&gt;)
        &lt;span class="pl-en"&gt;dump&lt;/span&gt;(&lt;span class="pl-c1"&gt;&amp;amp;&lt;/span&gt;&lt;span class="pl-s1"&gt;wbuf&lt;/span&gt;, &lt;span class="pl-c1"&gt;&amp;amp;&lt;/span&gt;&lt;span class="pl-s1"&gt;row&lt;/span&gt;, &lt;span class="pl-c1"&gt;0&lt;/span&gt;);
    &lt;span class="pl-en"&gt;dump_flush&lt;/span&gt;(&lt;span class="pl-c1"&gt;&amp;amp;&lt;/span&gt;&lt;span class="pl-s1"&gt;wbuf&lt;/span&gt;, &lt;span class="pl-c1"&gt;0&lt;/span&gt;);
}&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;build and run:&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ./scripts/makefile.sh

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; make bsumall

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; bsumall -h
sum columns of u16 as i64

usage: ... &lt;span class="pl-k"&gt;|&lt;/span&gt; bsumall

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;1,2&lt;/span&gt;
&lt;span class="pl-s"&gt;3,4&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; bsv &lt;span class="pl-k"&gt;|&lt;/span&gt; bschema a:u16,a:u16 &lt;span class="pl-k"&gt;|&lt;/span&gt; bsumall i64 &lt;span class="pl-k"&gt;|&lt;/span&gt; bschema i64:a,i64:a &lt;span class="pl-k"&gt;|&lt;/span&gt; csv
4,6&lt;/pre&gt;&lt;/div&gt;
&lt;h2 id="non-goals"&gt;&lt;a class="heading-link" href="#non-goals"&gt;non goals&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;support of hardware other than little endian.&lt;/p&gt;
&lt;p&gt;types and schemas as a part of the data format.&lt;/p&gt;
&lt;h2 id="testing-methodology"&gt;&lt;a class="heading-link" href="#testing-methodology"&gt;testing methodology&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://hypothesis.readthedocs.io/en/latest/" rel="nofollow"&gt;quickcheck&lt;/a&gt; style &lt;a href="https://github.com/nathants/bsv/blob/master/test"&gt;testing&lt;/a&gt; with python implementations to verify correct behavior for arbitrary inputs and varying buffer sizes.&lt;/p&gt;
&lt;h2 id="experiments"&gt;&lt;a class="heading-link" href="#experiments"&gt;experiments&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://github.com/nathants/bsv/blob/master/experiments/"&gt;performance&lt;/a&gt; experiments and alternate implementations.&lt;/p&gt;
&lt;h2 id="related-projects"&gt;&lt;a class="heading-link" href="#related-projects"&gt;related projects&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://github.com/nathants/s4"&gt;s4&lt;/a&gt; - a storage cluster that is cheap and fast, with data local compute and efficient shuffle.&lt;/p&gt;
&lt;h2 id="related-posts"&gt;&lt;a class="heading-link" href="#related-posts"&gt;related posts&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://nathants.com/posts/optimizing-a-bsv-data-processing-pipeline" rel="nofollow"&gt;optimizing a bsv data processing pipeline&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://nathants.com/posts/performant-batch-processing-with-bsv-s4-and-presto" rel="nofollow"&gt;performant batch processing with bsv, s4, and presto&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://nathants.com/posts/discovering-a-baseline-for-data-processing-performance" rel="nofollow"&gt;discovering a baseline for data processing performance&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://nathants.com/posts/refactoring-common-distributed-data-patterns-into-s4" rel="nofollow"&gt;refactoring common distributed data patterns into s4&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://nathants.com/posts/scaling-python-data-processing-horizontally" rel="nofollow"&gt;scaling python data processing horizontally&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://nathants.com/posts/scaling-python-data-processing-vertically" rel="nofollow"&gt;scaling python data processing vertically&lt;/a&gt;&lt;/p&gt;
&lt;h2 id="more-examples"&gt;&lt;a class="heading-link" href="#more-examples"&gt;more examples&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://github.com/nathants/s4/blob/master/examples/nyc_taxi_bsv"&gt;structured analysis of nyc taxi data with bsv and hive&lt;/a&gt;&lt;/p&gt;
&lt;h2 id="tools"&gt;&lt;a class="heading-link" href="#tools"&gt;tools&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;name&lt;/th&gt;
&lt;th&gt;description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#bcat"&gt;bcat&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;cat some bsv files to csv&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#bcombine"&gt;bcombine&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;prepend a new column by combining values from existing columns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#bcounteach"&gt;bcounteach&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;count as i64 each contiguous identical row by the first column&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#bcounteach-hash"&gt;bcounteach-hash&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;count as i64 by hash of the first column&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#bcountrows"&gt;bcountrows&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;count rows as i64&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#bcut"&gt;bcut&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;select some columns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#bdedupe"&gt;bdedupe&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;dedupe identical contiguous rows by the first column, keeping the first&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#bdedupe-hash"&gt;bdedupe-hash&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;dedupe rows by hash of the first column, keeping the first&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#bdropuntil"&gt;bdropuntil&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;for sorted input, drop until the first column is gte to VALUE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#bhead"&gt;bhead&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;keep the first n rows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#blz4"&gt;blz4&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;compress bsv data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#blz4d"&gt;blz4d&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;decompress bsv data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#bmerge"&gt;bmerge&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;merge sorted files from stdin&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#bpartition"&gt;bpartition&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;split into multiple files by consistent hash of the first column value&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#bquantile-merge"&gt;bquantile-merge&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;merge ddsketches and output quantile value pairs as f64&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#bquantile-sketch"&gt;bquantile-sketch&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;collapse the first column into a single row ddsketch&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#bschema"&gt;bschema&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;validate and converts row data with a schema of columns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#bsort"&gt;bsort&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;timsort rows by the first column&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#bsplit"&gt;bsplit&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;split a stream into multiple files&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#bsum"&gt;bsum&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;sum the first column&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#bsumeach"&gt;bsumeach&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;sum the second column of each contiguous identical row by the first column&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#bsumeach-hash"&gt;bsumeach-hash&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;sum as i64 the second column by hash of the first column&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#bsv"&gt;bsv&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;convert csv to bsv&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#btake"&gt;btake&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;take while the first column is VALUE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#btakeuntil"&gt;btakeuntil&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;for sorted input, take until the first column is gte to VALUE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#btopn"&gt;btopn&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;accumulate the top n rows in a heap by first column value&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#bunzip"&gt;bunzip&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;split a multi column input into single column outputs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#bzip"&gt;bzip&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;combine single column inputs into a multi column output&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#csv"&gt;csv&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;convert bsv to csv&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="#xxh3"&gt;xxh3&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;xxh3_64 hash stdin&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id="bcat"&gt;&lt;a class="heading-link" href="#bcat"/&gt;&lt;a href="https://github.com/nathants/bsv/blob/master/src/bcat.c"&gt;bcat&lt;/a&gt;&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/h3&gt;
&lt;p&gt;cat some bsv files to csv&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;usage: bcat [-l&lt;span class="pl-k"&gt;|&lt;/span&gt;--lz4] [-p&lt;span class="pl-k"&gt;|&lt;/span&gt;--prefix] [-h N&lt;span class="pl-k"&gt;|&lt;/span&gt;--head N] FILE1 ... FILEN&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-k"&gt;for&lt;/span&gt; &lt;span class="pl-smi"&gt;char&lt;/span&gt; &lt;span class="pl-k"&gt;in&lt;/span&gt; a a b b c c&lt;span class="pl-k"&gt;;&lt;/span&gt; &lt;span class="pl-k"&gt;do&lt;/span&gt;
     &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-smi"&gt;$char&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; bsv &lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; /tmp/&lt;span class="pl-smi"&gt;$char&lt;/span&gt;
   &lt;span class="pl-k"&gt;done&lt;/span&gt;

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; bcat --head 1 --prefix /tmp/{a,b,c}
/tmp/a:a
/tmp/b:b
/tmp/c:c&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id="bcombine"&gt;&lt;a class="heading-link" href="#bcombine"/&gt;&lt;a href="https://github.com/nathants/bsv/blob/master/src/bcombine.c"&gt;bcombine&lt;/a&gt;&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/h3&gt;
&lt;p&gt;prepend a new column by combining values from existing columns&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;usage: ... &lt;span class="pl-k"&gt;|&lt;/span&gt; bcombine COL1,...,COLN&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; a,b,c &lt;span class="pl-k"&gt;|&lt;/span&gt; bsv &lt;span class="pl-k"&gt;|&lt;/span&gt; bcombine 3,2 &lt;span class="pl-k"&gt;|&lt;/span&gt; csv
b:a,a,b,c&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id="bcounteach"&gt;&lt;a class="heading-link" href="#bcounteach"/&gt;&lt;a href="https://github.com/nathants/bsv/blob/master/src/bcounteach.c"&gt;bcounteach&lt;/a&gt;&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/h3&gt;
&lt;p&gt;count as i64 each contiguous identical row by the first column&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;usage: ... &lt;span class="pl-k"&gt;|&lt;/span&gt; bcounteach&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;a&lt;/span&gt;
&lt;span class="pl-s"&gt;a&lt;/span&gt;
&lt;span class="pl-s"&gt;b&lt;/span&gt;
&lt;span class="pl-s"&gt;b&lt;/span&gt;
&lt;span class="pl-s"&gt;b&lt;/span&gt;
&lt;span class="pl-s"&gt;a&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; bsv &lt;span class="pl-k"&gt;|&lt;/span&gt; bcounteach &lt;span class="pl-k"&gt;|&lt;/span&gt; bschema &lt;span class="pl-k"&gt;*&lt;/span&gt;,i64:a &lt;span class="pl-k"&gt;|&lt;/span&gt; csv
a,2
b,3
a,1&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id="bcounteach-hash"&gt;&lt;a class="heading-link" href="#bcounteach-hash"/&gt;&lt;a href="https://github.com/nathants/bsv/blob/master/src/bcounteach_hash.c"&gt;bcounteach-hash&lt;/a&gt;&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/h3&gt;
&lt;p&gt;count as i64 by hash of the first column&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;usage: ... &lt;span class="pl-k"&gt;|&lt;/span&gt; bcounteach-hash&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;a&lt;/span&gt;
&lt;span class="pl-s"&gt;a&lt;/span&gt;
&lt;span class="pl-s"&gt;b&lt;/span&gt;
&lt;span class="pl-s"&gt;b&lt;/span&gt;
&lt;span class="pl-s"&gt;b&lt;/span&gt;
&lt;span class="pl-s"&gt;a&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; bsv &lt;span class="pl-k"&gt;|&lt;/span&gt; bcounteach-hash &lt;span class="pl-k"&gt;|&lt;/span&gt; bschema &lt;span class="pl-k"&gt;*&lt;/span&gt;,i64:a &lt;span class="pl-k"&gt;|&lt;/span&gt; bsort &lt;span class="pl-k"&gt;|&lt;/span&gt; csv
a,3
b,3&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id="bcountrows"&gt;&lt;a class="heading-link" href="#bcountrows"/&gt;&lt;a href="https://github.com/nathants/bsv/blob/master/src/bcountrows.c"&gt;bcountrows&lt;/a&gt;&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/h3&gt;
&lt;p&gt;count rows as i64&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;usage: ... &lt;span class="pl-k"&gt;|&lt;/span&gt; bcountrows&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;1&lt;/span&gt;
&lt;span class="pl-s"&gt;2&lt;/span&gt;
&lt;span class="pl-s"&gt;3&lt;/span&gt;
&lt;span class="pl-s"&gt;4&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; bsv &lt;span class="pl-k"&gt;|&lt;/span&gt; bcountrows &lt;span class="pl-k"&gt;|&lt;/span&gt; csv
4&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id="bcut"&gt;&lt;a class="heading-link" href="#bcut"/&gt;&lt;a href="https://github.com/nathants/bsv/blob/master/src/bcut.c"&gt;bcut&lt;/a&gt;&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/h3&gt;
&lt;p&gt;select some columns&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;usage: ... &lt;span class="pl-k"&gt;|&lt;/span&gt; bcut COL1,...,COLN&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; a,b,c &lt;span class="pl-k"&gt;|&lt;/span&gt; bsv &lt;span class="pl-k"&gt;|&lt;/span&gt; bcut 3,3,3,2,2,1 &lt;span class="pl-k"&gt;|&lt;/span&gt; csv
c,c,c,b,b,a&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id="bdedupe"&gt;&lt;a class="heading-link" href="#bdedupe"/&gt;&lt;a href="https://github.com/nathants/bsv/blob/master/src/bdedupe.c"&gt;bdedupe&lt;/a&gt;&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/h3&gt;
&lt;p&gt;dedupe identical contiguous rows by the first column, keeping the first&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;usage: ... &lt;span class="pl-k"&gt;|&lt;/span&gt; bdedupe&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;a&lt;/span&gt;
&lt;span class="pl-s"&gt;a&lt;/span&gt;
&lt;span class="pl-s"&gt;b&lt;/span&gt;
&lt;span class="pl-s"&gt;b&lt;/span&gt;
&lt;span class="pl-s"&gt;a&lt;/span&gt;
&lt;span class="pl-s"&gt;a&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; bsv &lt;span class="pl-k"&gt;|&lt;/span&gt; bdedupe &lt;span class="pl-k"&gt;|&lt;/span&gt; csv
a
b
a&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id="bdedupe-hash"&gt;&lt;a class="heading-link" href="#bdedupe-hash"/&gt;&lt;a href="https://github.com/nathants/bsv/blob/master/src/bdedupe_hash.c"&gt;bdedupe-hash&lt;/a&gt;&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/h3&gt;
&lt;p&gt;dedupe rows by hash of the first column, keeping the first&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;usage: ... &lt;span class="pl-k"&gt;|&lt;/span&gt; bdedupe-hash&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;a&lt;/span&gt;
&lt;span class="pl-s"&gt;a&lt;/span&gt;
&lt;span class="pl-s"&gt;b&lt;/span&gt;
&lt;span class="pl-s"&gt;b&lt;/span&gt;
&lt;span class="pl-s"&gt;a&lt;/span&gt;
&lt;span class="pl-s"&gt;a&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; bsv &lt;span class="pl-k"&gt;|&lt;/span&gt; bdedupe-hash &lt;span class="pl-k"&gt;|&lt;/span&gt; csv
a
b&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id="bdropuntil"&gt;&lt;a class="heading-link" href="#bdropuntil"/&gt;&lt;a href="https://github.com/nathants/bsv/blob/master/src/bdropuntil.c"&gt;bdropuntil&lt;/a&gt;&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/h3&gt;
&lt;p&gt;for sorted input, drop until the first column is gte to VALUE&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;usage: ... &lt;span class="pl-k"&gt;|&lt;/span&gt; bdropuntil VALUE [TYPE]&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;a&lt;/span&gt;
&lt;span class="pl-s"&gt;b&lt;/span&gt;
&lt;span class="pl-s"&gt;c&lt;/span&gt;
&lt;span class="pl-s"&gt;d&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; bsv &lt;span class="pl-k"&gt;|&lt;/span&gt; bdropuntil c &lt;span class="pl-k"&gt;|&lt;/span&gt; csv
c
d&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id="bhead"&gt;&lt;a class="heading-link" href="#bhead"/&gt;&lt;a href="https://github.com/nathants/bsv/blob/master/src/bhead.c"&gt;bhead&lt;/a&gt;&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/h3&gt;
&lt;p&gt;keep the first n rows&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;usage: ... &lt;span class="pl-k"&gt;|&lt;/span&gt; bhead N&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;a&lt;/span&gt;
&lt;span class="pl-s"&gt;b&lt;/span&gt;
&lt;span class="pl-s"&gt;c&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; bsv &lt;span class="pl-k"&gt;|&lt;/span&gt; btail 2 &lt;span class="pl-k"&gt;|&lt;/span&gt; csv
a
b&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id="blz4"&gt;&lt;a class="heading-link" href="#blz4"/&gt;&lt;a href="https://github.com/nathants/bsv/blob/master/src/blz4.c"&gt;blz4&lt;/a&gt;&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/h3&gt;
&lt;p&gt;compress bsv data&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;usage: ... &lt;span class="pl-k"&gt;|&lt;/span&gt; blz4&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; a,b,c &lt;span class="pl-k"&gt;|&lt;/span&gt; bsv &lt;span class="pl-k"&gt;|&lt;/span&gt; blz4 &lt;span class="pl-k"&gt;|&lt;/span&gt; blz4d &lt;span class="pl-k"&gt;|&lt;/span&gt; csv
a,b,c&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id="blz4d"&gt;&lt;a class="heading-link" href="#blz4d"/&gt;&lt;a href="https://github.com/nathants/bsv/blob/master/src/blz4d.c"&gt;blz4d&lt;/a&gt;&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/h3&gt;
&lt;p&gt;decompress bsv data&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;usage: ... &lt;span class="pl-k"&gt;|&lt;/span&gt; blz4d&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; a,b,c &lt;span class="pl-k"&gt;|&lt;/span&gt; bsv &lt;span class="pl-k"&gt;|&lt;/span&gt; blz4 &lt;span class="pl-k"&gt;|&lt;/span&gt; blz4d &lt;span class="pl-k"&gt;|&lt;/span&gt; csv
a,b,c&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id="bmerge"&gt;&lt;a class="heading-link" href="#bmerge"/&gt;&lt;a href="https://github.com/nathants/bsv/blob/master/src/bmerge.c"&gt;bmerge&lt;/a&gt;&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/h3&gt;
&lt;p&gt;merge sorted files from stdin&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;usage: &lt;span class="pl-c1"&gt;echo&lt;/span&gt; FILE1 ... FILEN &lt;span class="pl-k"&gt;|&lt;/span&gt; bmerge [TYPE] [-r&lt;span class="pl-k"&gt;|&lt;/span&gt;--reversed] [-l&lt;span class="pl-k"&gt;|&lt;/span&gt;--lz4]&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; -e &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;a&lt;/span&gt;
&lt;span class="pl-s"&gt;c&lt;/span&gt;
&lt;span class="pl-s"&gt;e&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; bsv &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; a.bsv
&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; -e &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;b&lt;/span&gt;
&lt;span class="pl-s"&gt;d&lt;/span&gt;
&lt;span class="pl-s"&gt;f&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; bsv &lt;span class="pl-k"&gt;&amp;gt;&lt;/span&gt; b.bsv
&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; a.bsv b.bsv &lt;span class="pl-k"&gt;|&lt;/span&gt; bmerge
a
b
c
d
e
f&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id="bpartition"&gt;&lt;a class="heading-link" href="#bpartition"/&gt;&lt;a href="https://github.com/nathants/bsv/blob/master/src/bpartition.c"&gt;bpartition&lt;/a&gt;&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/h3&gt;
&lt;p&gt;split into multiple files by consistent hash of the first column value&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;usage: ... &lt;span class="pl-k"&gt;|&lt;/span&gt; bpartition NUM_BUCKETS [PREFIX] [-l&lt;span class="pl-k"&gt;|&lt;/span&gt;--lz4]&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;a&lt;/span&gt;
&lt;span class="pl-s"&gt;b&lt;/span&gt;
&lt;span class="pl-s"&gt;c&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; bsv &lt;span class="pl-k"&gt;|&lt;/span&gt; bpartition 10 prefix
prefix03
prefix06&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id="bquantile-merge"&gt;&lt;a class="heading-link" href="#bquantile-merge"/&gt;&lt;a href="https://github.com/nathants/bsv/blob/master/src/bquantile_merge.c"&gt;bquantile-merge&lt;/a&gt;&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/h3&gt;
&lt;p&gt;merge ddsketches and output quantile value pairs as f64&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;usage: ... &lt;span class="pl-k"&gt;|&lt;/span&gt; bquantile-merge QUANTILES&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; seq 1 100 &lt;span class="pl-k"&gt;|&lt;/span&gt; bsv &lt;span class="pl-k"&gt;|&lt;/span&gt; bschema a:i64 &lt;span class="pl-k"&gt;|&lt;/span&gt; bquantile-sketch i64 &lt;span class="pl-k"&gt;|&lt;/span&gt; bquantile-merge .2,.5,.7 &lt;span class="pl-k"&gt;|&lt;/span&gt; bschema f64:a,f64:a &lt;span class="pl-k"&gt;|&lt;/span&gt; csv
0.2,19.88667024086646
0.5,49.90296094906742
0.7,70.11183939140405&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id="bquantile-sketch"&gt;&lt;a class="heading-link" href="#bquantile-sketch"/&gt;&lt;a href="https://github.com/nathants/bsv/blob/master/src/bquantile_sketch.c"&gt;bquantile-sketch&lt;/a&gt;&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/h3&gt;
&lt;p&gt;collapse the first column into a single row ddsketch&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;usage: ... &lt;span class="pl-k"&gt;|&lt;/span&gt; bquantile-sketch TYPE [-a&lt;span class="pl-k"&gt;|&lt;/span&gt;--alpha] [-b&lt;span class="pl-k"&gt;|&lt;/span&gt;--max-bins] [-m&lt;span class="pl-k"&gt;|&lt;/span&gt;--min-value]&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; seq 1 100 &lt;span class="pl-k"&gt;|&lt;/span&gt; bsv &lt;span class="pl-k"&gt;|&lt;/span&gt; bschema a:i64 &lt;span class="pl-k"&gt;|&lt;/span&gt; bquantile-sketch i64 &lt;span class="pl-k"&gt;|&lt;/span&gt; bquantile-merge .2,.5,.7 &lt;span class="pl-k"&gt;|&lt;/span&gt; bschema f64:a,f64:a &lt;span class="pl-k"&gt;|&lt;/span&gt; csv
0.2,19.88667024086646
0.5,49.90296094906742
0.7,70.11183939140405&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id="bschema"&gt;&lt;a class="heading-link" href="#bschema"/&gt;&lt;a href="https://github.com/nathants/bsv/blob/master/src/bschema.c"&gt;bschema&lt;/a&gt;&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/h3&gt;
&lt;p&gt;validate and converts row data with a schema of columns&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;usage: ... &lt;span class="pl-k"&gt;|&lt;/span&gt; bschema SCHEMA [--filter]&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;  --filter remove bad rows instead of erroring

  example schemas:
    &lt;span class="pl-k"&gt;*&lt;/span&gt;,&lt;span class="pl-k"&gt;*&lt;/span&gt;,&lt;span class="pl-k"&gt;*&lt;/span&gt;             = 3 columns of any size
    8,&lt;span class="pl-k"&gt;*&lt;/span&gt;               = a column with 8 bytes followed by a column of any size
    8,&lt;span class="pl-k"&gt;*&lt;/span&gt;,...           = same as above, but ignore any trailing columns
    a:u16,a:i32,a:f64 = convert ascii to numerics
    u16:a,i32:a,f64:a = convert numerics to ascii
    4&lt;span class="pl-k"&gt;*&lt;/span&gt;,&lt;span class="pl-k"&gt;*&lt;/span&gt;4             = keep the first 4 bytes of column 1 and the last 4 of column 2

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; aa,bbb,cccc &lt;span class="pl-k"&gt;|&lt;/span&gt; bsv &lt;span class="pl-k"&gt;|&lt;/span&gt; bschema 2,3,4 &lt;span class="pl-k"&gt;|&lt;/span&gt; csv
aa,bbb,cccc&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id="bsort"&gt;&lt;a class="heading-link" href="#bsort"/&gt;&lt;a href="https://github.com/nathants/bsv/blob/master/src/bsort.c"&gt;bsort&lt;/a&gt;&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/h3&gt;
&lt;p&gt;timsort rows by the first column&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;usage: ... &lt;span class="pl-k"&gt;|&lt;/span&gt; bsort [-r&lt;span class="pl-k"&gt;|&lt;/span&gt;--reversed] [TYPE]&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;3&lt;/span&gt;
&lt;span class="pl-s"&gt;2&lt;/span&gt;
&lt;span class="pl-s"&gt;1&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; bsv &lt;span class="pl-k"&gt;|&lt;/span&gt; bschema a:i64 &lt;span class="pl-k"&gt;|&lt;/span&gt; bsort i64 &lt;span class="pl-k"&gt;|&lt;/span&gt; bschema i64:a &lt;span class="pl-k"&gt;|&lt;/span&gt; csv
1
2
3&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id="bsplit"&gt;&lt;a class="heading-link" href="#bsplit"/&gt;&lt;a href="https://github.com/nathants/bsv/blob/master/src/bsplit.c"&gt;bsplit&lt;/a&gt;&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/h3&gt;
&lt;p&gt;split a stream into multiple files&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;usage: ... &lt;span class="pl-k"&gt;|&lt;/span&gt; bsplit PREFIX [chunks_per_file&lt;span class="pl-k"&gt;=&lt;/span&gt;1]&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; -n a,b,c &lt;span class="pl-k"&gt;|&lt;/span&gt; bsv &lt;span class="pl-k"&gt;|&lt;/span&gt; bsplit prefix
prefix_0000000000&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id="bsum"&gt;&lt;a class="heading-link" href="#bsum"/&gt;&lt;a href="https://github.com/nathants/bsv/blob/master/src/bsum.c"&gt;bsum&lt;/a&gt;&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/h3&gt;
&lt;p&gt;sum the first column&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;usage: ... &lt;span class="pl-k"&gt;|&lt;/span&gt; bsum TYPE&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;1&lt;/span&gt;
&lt;span class="pl-s"&gt;2&lt;/span&gt;
&lt;span class="pl-s"&gt;3&lt;/span&gt;
&lt;span class="pl-s"&gt;4&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; bsv &lt;span class="pl-k"&gt;|&lt;/span&gt; bschema a:i64 &lt;span class="pl-k"&gt;|&lt;/span&gt; bsum i64 &lt;span class="pl-k"&gt;|&lt;/span&gt; bschema i64:a &lt;span class="pl-k"&gt;|&lt;/span&gt; csv
10&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id="bsumeach"&gt;&lt;a class="heading-link" href="#bsumeach"/&gt;&lt;a href="https://github.com/nathants/bsv/blob/master/src/bsumeach.c"&gt;bsumeach&lt;/a&gt;&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/h3&gt;
&lt;p&gt;sum the second column of each contiguous identical row by the first column&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;usage: ... &lt;span class="pl-k"&gt;|&lt;/span&gt; bsumeach TYPE&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;a,1&lt;/span&gt;
&lt;span class="pl-s"&gt;a,2&lt;/span&gt;
&lt;span class="pl-s"&gt;b,3&lt;/span&gt;
&lt;span class="pl-s"&gt;b,4&lt;/span&gt;
&lt;span class="pl-s"&gt;b,5&lt;/span&gt;
&lt;span class="pl-s"&gt;a,6&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; bsv &lt;span class="pl-k"&gt;|&lt;/span&gt; bschema &lt;span class="pl-k"&gt;*&lt;/span&gt;,a:i64 &lt;span class="pl-k"&gt;|&lt;/span&gt; bsumeach i64 &lt;span class="pl-k"&gt;|&lt;/span&gt; bschema &lt;span class="pl-k"&gt;*&lt;/span&gt;,i64:a &lt;span class="pl-k"&gt;|&lt;/span&gt; csv
a,3
b,12
a,6&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id="bsumeach-hash"&gt;&lt;a class="heading-link" href="#bsumeach-hash"/&gt;&lt;a href="https://github.com/nathants/bsv/blob/master/src/bsumeach_hash.c"&gt;bsumeach-hash&lt;/a&gt;&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/h3&gt;
&lt;p&gt;sum as i64 the second column by hash of the first column&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;usage: ... &lt;span class="pl-k"&gt;|&lt;/span&gt; bsumeach-hash i64&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;a,1&lt;/span&gt;
&lt;span class="pl-s"&gt;a,2&lt;/span&gt;
&lt;span class="pl-s"&gt;b,3&lt;/span&gt;
&lt;span class="pl-s"&gt;b,4&lt;/span&gt;
&lt;span class="pl-s"&gt;b,5&lt;/span&gt;
&lt;span class="pl-s"&gt;a,6&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; bsv &lt;span class="pl-k"&gt;|&lt;/span&gt; bschema &lt;span class="pl-k"&gt;*&lt;/span&gt;,a:i64 &lt;span class="pl-k"&gt;|&lt;/span&gt; bsumeach-hash i64 &lt;span class="pl-k"&gt;|&lt;/span&gt; bschema &lt;span class="pl-k"&gt;*&lt;/span&gt;,i64:a &lt;span class="pl-k"&gt;|&lt;/span&gt; csv
a,3
b,12
a,6&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id="bsv-1"&gt;&lt;a class="heading-link" href="#bsv-1"/&gt;&lt;a href="https://github.com/nathants/bsv/blob/master/src/bsv.c"&gt;bsv&lt;/a&gt;&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/h3&gt;
&lt;p&gt;convert csv to bsv&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;usage: ... &lt;span class="pl-k"&gt;|&lt;/span&gt; bsv&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; a,b,c &lt;span class="pl-k"&gt;|&lt;/span&gt; bsv &lt;span class="pl-k"&gt;|&lt;/span&gt; bcut 3,2,1 &lt;span class="pl-k"&gt;|&lt;/span&gt; csv
c,b,a&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id="btake"&gt;&lt;a class="heading-link" href="#btake"/&gt;&lt;a href="https://github.com/nathants/bsv/blob/master/src/btake.c"&gt;btake&lt;/a&gt;&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/h3&gt;
&lt;p&gt;take while the first column is VALUE&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;usage: ... &lt;span class="pl-k"&gt;|&lt;/span&gt; btake VALUE&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;a&lt;/span&gt;
&lt;span class="pl-s"&gt;b&lt;/span&gt;
&lt;span class="pl-s"&gt;c&lt;/span&gt;
&lt;span class="pl-s"&gt;d&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; bsv &lt;span class="pl-k"&gt;|&lt;/span&gt; bdropntil c &lt;span class="pl-k"&gt;|&lt;/span&gt; btake c &lt;span class="pl-k"&gt;|&lt;/span&gt; csv
c&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id="btakeuntil"&gt;&lt;a class="heading-link" href="#btakeuntil"/&gt;&lt;a href="https://github.com/nathants/bsv/blob/master/src/btakeuntil.c"&gt;btakeuntil&lt;/a&gt;&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/h3&gt;
&lt;p&gt;for sorted input, take until the first column is gte to VALUE&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;usage: ... &lt;span class="pl-k"&gt;|&lt;/span&gt; btakeuntil VALUE [TYPE]&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;a&lt;/span&gt;
&lt;span class="pl-s"&gt;b&lt;/span&gt;
&lt;span class="pl-s"&gt;c&lt;/span&gt;
&lt;span class="pl-s"&gt;d&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; bsv &lt;span class="pl-k"&gt;|&lt;/span&gt; btakeuntil c &lt;span class="pl-k"&gt;|&lt;/span&gt; csv
a
b&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id="btopn"&gt;&lt;a class="heading-link" href="#btopn"/&gt;&lt;a href="https://github.com/nathants/bsv/blob/master/src/btopn.c"&gt;btopn&lt;/a&gt;&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/h3&gt;
&lt;p&gt;accumulate the top n rows in a heap by first column value&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;usage: ... &lt;span class="pl-k"&gt;|&lt;/span&gt; btopn N [TYPE] [-r&lt;span class="pl-k"&gt;|&lt;/span&gt;--reversed]&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;1&lt;/span&gt;
&lt;span class="pl-s"&gt;3&lt;/span&gt;
&lt;span class="pl-s"&gt;2&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; bsv &lt;span class="pl-k"&gt;|&lt;/span&gt; bschema a:i64 &lt;span class="pl-k"&gt;|&lt;/span&gt; btopn 2 i64 &lt;span class="pl-k"&gt;|&lt;/span&gt; bschema i64:a &lt;span class="pl-k"&gt;|&lt;/span&gt; csv
3
2&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id="bunzip"&gt;&lt;a class="heading-link" href="#bunzip"/&gt;&lt;a href="https://github.com/nathants/bsv/blob/master/src/bunzip.c"&gt;bunzip&lt;/a&gt;&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/h3&gt;
&lt;p&gt;split a multi column input into single column outputs&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;usage: ... &lt;span class="pl-k"&gt;|&lt;/span&gt; bunzip PREFIX [-l&lt;span class="pl-k"&gt;|&lt;/span&gt;--lz4]&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;a,b,c&lt;/span&gt;
&lt;span class="pl-s"&gt;1,2,3&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; bsv &lt;span class="pl-k"&gt;|&lt;/span&gt; bunzip col &lt;span class="pl-k"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; col_1 col_3 &lt;span class="pl-k"&gt;|&lt;/span&gt; bzip &lt;span class="pl-k"&gt;|&lt;/span&gt; csv
a,c
1,3&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id="bzip"&gt;&lt;a class="heading-link" href="#bzip"/&gt;&lt;a href="https://github.com/nathants/bsv/blob/master/src/bzip.c"&gt;bzip&lt;/a&gt;&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/h3&gt;
&lt;p&gt;combine single column inputs into a multi column output&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;usage: ls column_&lt;span class="pl-k"&gt;*&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; bzip [COL1,...COLN] [-l&lt;span class="pl-k"&gt;|&lt;/span&gt;--lz4]&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;
&lt;span class="pl-s"&gt;a,b,c&lt;/span&gt;
&lt;span class="pl-s"&gt;1,2,3&lt;/span&gt;
&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; bsv &lt;span class="pl-k"&gt;|&lt;/span&gt; bunzip column &lt;span class="pl-k"&gt;&amp;amp;&amp;amp;&lt;/span&gt; ls column_&lt;span class="pl-k"&gt;*&lt;/span&gt; &lt;span class="pl-k"&gt;|&lt;/span&gt; bzip 1,3 &lt;span class="pl-k"&gt;|&lt;/span&gt; csv
a,c
1,3&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id="csv"&gt;&lt;a class="heading-link" href="#csv"/&gt;&lt;a href="https://github.com/nathants/bsv/blob/master/src/csv.c"&gt;csv&lt;/a&gt;&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/h3&gt;
&lt;p&gt;convert bsv to csv&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;usage: ... &lt;span class="pl-k"&gt;|&lt;/span&gt; csv&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; a,b,c &lt;span class="pl-k"&gt;|&lt;/span&gt; bsv &lt;span class="pl-k"&gt;|&lt;/span&gt; csv
a,b,c&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id="xxh3"&gt;&lt;a class="heading-link" href="#xxh3"/&gt;&lt;a href="https://github.com/nathants/bsv/blob/master/src/xxh3.c"&gt;xxh3&lt;/a&gt;&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/h3&gt;
&lt;p&gt;xxh3_64 hash stdin&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;usage: ... &lt;span class="pl-k"&gt;|&lt;/span&gt; xxh3 [--stream&lt;span class="pl-k"&gt;|&lt;/span&gt;--int]&lt;/pre&gt;&lt;/div&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;  --stream pass stdin through to stdout with &lt;span class="pl-c1"&gt;hash&lt;/span&gt; on stderr

  --int output &lt;span class="pl-c1"&gt;hash&lt;/span&gt; as int not &lt;span class="pl-c1"&gt;hash&lt;/span&gt;

&lt;span class="pl-k"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="pl-c1"&gt;echo&lt;/span&gt; abc &lt;span class="pl-k"&gt;|&lt;/span&gt; xxh3
079364cbfdf9f4cb&lt;/pre&gt;&lt;/div&gt;

                      
            </description>
      <guid isPermaLink="false">https://nathants.com/projects/bsv</guid>
    </item>
    <item>
      <title>tiny-snitch</title>
      <link>https://nathants.com/projects/tiny-snitch</link>
      <description>
                
&lt;h2 id="why"&gt;&lt;a class="heading-link" href="#why"&gt;why&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;it should be easy to monitor and control inbound and outbound connections.&lt;/p&gt;
&lt;h2 id="what"&gt;&lt;a class="heading-link" href="#what"&gt;what&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;an interactive firewall for inbound and outbound connections.&lt;/p&gt;
&lt;p&gt;the rules are global, but the prompt always shows the pid/path/args of the program requesting a new rule.&lt;/p&gt;
&lt;p&gt;based on the excellent &lt;a href="https://github.com/evilsocket/opensnitch"&gt;opensnitch&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="demo"&gt;&lt;a class="heading-link" href="#demo"&gt;demo&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;a target="_blank" rel="noopener noreferrer" href="https://github.com/nathants/tiny-snitch/raw/master/docs/demo.gif"&gt;&lt;img src="https://github.com/nathants/tiny-snitch/raw/master/docs/demo.gif" alt="" style="max-width: 100%;"/&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a target="_blank" rel="noopener noreferrer" href="https://github.com/nathants/tiny-snitch/raw/master/docs/prompt.png"&gt;&lt;img src="https://github.com/nathants/tiny-snitch/raw/master/docs/prompt.png" alt="" style="max-width: 100%;"/&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a target="_blank" rel="noopener noreferrer" href="https://github.com/nathants/tiny-snitch/raw/master/docs/prompt_legend.png"&gt;&lt;img src="https://github.com/nathants/tiny-snitch/raw/master/docs/prompt_legend.png" alt="" style="max-width: 100%;"/&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a target="_blank" rel="noopener noreferrer" href="https://github.com/nathants/tiny-snitch/raw/master/docs/prompt_help.png"&gt;&lt;img src="https://github.com/nathants/tiny-snitch/raw/master/docs/prompt_help.png" alt="" style="max-width: 100%;"/&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;a split screen monitoring setup for a second monitor using &lt;a href="https://github.com/nathants/ptop"&gt;ptop&lt;/a&gt;, &lt;a href="https://gist.github.com/nathants/336bc5e501ad174aeeb7986f2b0633e4"&gt;color&lt;/a&gt;, &lt;a href="https://gist.github.com/nathants/741b066af9faa15f3ed50ed6cf677d67"&gt;pys&lt;/a&gt;, and a &lt;a href="https://gist.github.com/nathants/daa1aa0dee88bc6dc8710c82965b4704"&gt;oneliner&lt;/a&gt; to tail tiny-snitch logs into a small and colorful format.&lt;/p&gt;
&lt;p&gt;&lt;a target="_blank" rel="noopener noreferrer" href="https://github.com/nathants/tiny-snitch/raw/master/docs/demo.png"&gt;&lt;img src="https://github.com/nathants/tiny-snitch/raw/master/docs/demo.png" alt="" style="max-width: 100%;"/&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2 id="dependencies"&gt;&lt;a class="heading-link" href="#dependencies"&gt;dependencies&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;there are two components with separate dependencies:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;tiny-snitch:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://archlinux.org/packages/community/x86_64/go/" rel="nofollow"&gt;go&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.archlinux.org/packages/extra/x86_64/libnetfilter_queue/" rel="nofollow"&gt;libnetfilter_queue&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://archlinux.org/packages/extra/x86_64/nftables/" rel="nofollow"&gt;nftables&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;tiny-snitch-prompt&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.python.org/" rel="nofollow"&gt;python3&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pypi.org/project/PyQt5/" rel="nofollow"&gt;pyqt5&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="install"&gt;&lt;a class="heading-link" href="#install"&gt;install&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;setup nftables with &lt;code&gt;sudo nft -f nftables.conf&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;build with: &lt;code&gt;make&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;put &lt;code&gt;tiny-snitch/&lt;/code&gt; on your &lt;code&gt;$PATH&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id="usage"&gt;&lt;a class="heading-link" href="#usage"&gt;usage&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;tiny-snitch should be launched with &lt;code&gt;sudo -E&lt;/code&gt;, so the qt5 prompt can use your DISPLAY.&lt;/p&gt;
&lt;p&gt;either run it in a background terminal: &lt;code&gt;sudo -E tiny-snitch&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;or automatically run it with cron: &lt;code&gt;* * * * * sudo -E auto-restart tiny-snitch 2&amp;gt;&amp;amp;1 | rotate-logs /tmp/tinynitch.log&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://gist.github.com/nathants/dc5d43c1e57b9bbb3a654491df93e4d6"&gt;auto-restart&lt;/a&gt; and &lt;a href="https://gist.github.com/nathants/72968aaa7d9ab7c008fe32e399426d2c"&gt;rotate-logs&lt;/a&gt; are not required.&lt;/p&gt;
&lt;h2 id="rules"&gt;&lt;a class="heading-link" href="#rules"&gt;rules&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;permanent rules are stored in &lt;code&gt;/etc/tiny-snitch.rules&lt;/code&gt; and &lt;code&gt;/etc/tiny-snitch.adblock&lt;/code&gt;. edit those files and &lt;code&gt;tiny-snitch&lt;/code&gt; will reload.&lt;/p&gt;
&lt;p&gt;some example rules:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;action address port proto&lt;/code&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;allow google.com             443 tcp
deny  *.google-analytics.com *   tcp
allow 172.17.*.*             *   tcp
allow 172.17.*.*             *   udp
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;temporary rules can be added by appending lines to &lt;code&gt;/tmp/tiny-snitch.temp&lt;/code&gt;, which will be loaded and then truncated.&lt;/p&gt;
&lt;p&gt;some example temporary rules:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;action duration address port proto&lt;/code&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;1-hour   allow google.com             443 tcp
9-minute deny  *.google-analytics.com *   tcp
24-hour  allow 172.17.*.*             *   tcp
1-minute allow 172.17.*.*             *   udp
&lt;/code&gt;&lt;/pre&gt;

                      
            </description>
      <guid isPermaLink="false">https://nathants.com/projects/tiny-snitch</guid>
    </item>
    <item>
      <title>mighty-snitch</title>
      <link>https://nathants.com/projects/mighty-snitch</link>
      <description>
                
&lt;h2 id="why"&gt;&lt;a class="heading-link" href="#why"&gt;why&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;noticing and preventing network requests should be easy.&lt;/p&gt;
&lt;h2 id="how"&gt;&lt;a class="heading-link" href="#how"&gt;how&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;interactively filter network requests with rules and visual prompts.&lt;/p&gt;
&lt;h2 id="what"&gt;&lt;a class="heading-link" href="#what"&gt;what&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;a &lt;a href="https://www.kernel.org/doc/html/latest/security/lsm.html" rel="nofollow"&gt;linux security module&lt;/a&gt; communicates via &lt;a href="https://man7.org/linux/man-pages/man7/netlink.7.html" rel="nofollow"&gt;netlink&lt;/a&gt; with the userspace &lt;a href="https://github.com/nathants/mighty-snitch/blob/master/snitch/snitch.c"&gt;snitch&lt;/a&gt; on each &lt;a href="https://man7.org/linux/man-pages/man3/sendmsg.3p.html" rel="nofollow"&gt;sendmsg&lt;/a&gt;/&lt;a href="https://man7.org/linux/man-pages/man3/recvmsg.3p.html" rel="nofollow"&gt;recvmsg&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;snitch decides whether to allow or deny the network request.&lt;/p&gt;
&lt;p&gt;rules are checked. if no rule exists, a visual prompt is displayed to the user.&lt;/p&gt;
&lt;p&gt;finally snitch responds to the kernel and the request is allowed or denied.&lt;/p&gt;
&lt;h2 id="demo"&gt;&lt;a class="heading-link" href="#demo"&gt;demo&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;a target="_blank" rel="noopener noreferrer" href="https://github.com/nathants/mighty-snitch/raw/master/demo.gif"&gt;&lt;img src="https://github.com/nathants/mighty-snitch/raw/master/demo.gif" alt="" style="max-width: 100%;"/&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a target="_blank" rel="noopener noreferrer" href="https://github.com/nathants/mighty-snitch/raw/master/mobile.jpg"&gt;&lt;img src="https://github.com/nathants/mighty-snitch/raw/master/mobile.jpg" alt="" style="max-width: 100%;"/&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2 id="hardware"&gt;&lt;a class="heading-link" href="#hardware"&gt;hardware&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;the primary test environments are &lt;a href="https://alpinelinux.org/" rel="nofollow"&gt;alpine&lt;/a&gt;, &lt;a href="https://archlinux.org/" rel="nofollow"&gt;arch&lt;/a&gt;, and &lt;a href="https://postmarketos.org/" rel="nofollow"&gt;postmarketos&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;the primary test devices are &lt;a href="https://www.lenovo.com/us/en/c/laptops/thinkpad/thinkpadx1" rel="nofollow"&gt;thinkpad x1&lt;/a&gt;, &lt;a href="https://rog.asus.com/motherboards/rog-strix/rog-strix-x670e-i-gaming-wifi-model/" rel="nofollow"&gt;rog x670e-i&lt;/a&gt;, and &lt;a href="https://www.oneplus.com/6t" rel="nofollow"&gt;oneplus 6t&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="prior-art"&gt;&lt;a class="heading-link" href="#prior-art"&gt;prior art&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://www.obdev.at/products/littlesnitch/index.html" rel="nofollow"&gt;little-snitch&lt;/a&gt; which introduced me to this concept.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/evilsocket/opensnitch"&gt;open-snitch&lt;/a&gt; which introduced me to &lt;a href="https://www.netfilter.org/projects/libnetfilter_queue/" rel="nofollow"&gt;nfq&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/nathants/tiny-snitch"&gt;tiny-snitch&lt;/a&gt; which helped me understand what is possible with &lt;a href="https://www.netfilter.org/projects/libnetfilter_queue/" rel="nofollow"&gt;nfq&lt;/a&gt; and &lt;a href="https://github.com/iovisor/bpftrace"&gt;bpftrace&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/argussecurity/ulsm"&gt;uslm&lt;/a&gt; which helped me understand what is possible with &lt;a href="https://www.kernel.org/doc/html/latest/security/lsm.html" rel="nofollow"&gt;lsm&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="design"&gt;&lt;a class="heading-link" href="#design"&gt;design&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;mighty-snitch uses &lt;a href="https://www.kernel.org/doc/html/latest/security/lsm.html" rel="nofollow"&gt;lsm&lt;/a&gt; instead of &lt;a href="https://www.netfilter.org/projects/libnetfilter_queue/" rel="nofollow"&gt;nfq&lt;/a&gt; to filter network requests.&lt;/p&gt;
&lt;p&gt;the primary advantage is that it has direct access to the pid, executable, and commandline of the process making the request.&lt;/p&gt;
&lt;p&gt;the primary disadvantage is that it requires a custom kernel.&lt;/p&gt;
&lt;p&gt;the visual prompt is a terminal &lt;a href="https://github.com/nathants/mighty-snitch/blob/master/snitch-prompt/snitch-prompt"&gt;application&lt;/a&gt; which responds to keyboard input. a new terminal is launched for each prompt and exits after y/n are pressed. &lt;a href="https://st.suckless.org/" rel="nofollow"&gt;st&lt;/a&gt; is used on x86_64 and &lt;a href="https://codeberg.org/dnkl/foot" rel="nofollow"&gt;foot&lt;/a&gt; is used on arm64, though any terminal should work.&lt;/p&gt;
&lt;p&gt;the systems fails closed. when snitch isn't running, network requests are not possible.&lt;/p&gt;
&lt;p&gt;dns packets received on udp 53 are read via &lt;a href="https://www.netfilter.org/projects/libnetfilter_queue/" rel="nofollow"&gt;nfq&lt;/a&gt; so that rules can specify domains in addition to ipv4 addresses.&lt;/p&gt;
&lt;h2 id="constraints"&gt;&lt;a class="heading-link" href="#constraints"&gt;constraints&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;the following are simplifying constraints. other configurations should be possible.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;ipv6 is disabled.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;io_uring is disabled.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;nftables rules are replaced when snitch starts.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;iptables rules should be empty.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;all other lsm are disabled.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;kernel commandline parameters for lsm are ignored.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="rules"&gt;&lt;a class="heading-link" href="#rules"&gt;rules&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;snitch creates a rules file: &lt;code&gt;~/.snitch.rules&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;when this file is edited, snitch reloads the rules.&lt;/p&gt;
&lt;p&gt;typically rules are created by choosing the &lt;code&gt;forever&lt;/code&gt; duration in the visual prompt, but can also be directly added to the rules file.&lt;/p&gt;
&lt;p&gt;address can be a wildcard up to three subdomains.&lt;/p&gt;
&lt;p&gt;commandline can be a wildcard.&lt;/p&gt;
&lt;p&gt;here are the rules for firefox to deny all the unprompted connections it makes:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;send  deny  /usr/lib/firefox/firefox  content-signature-2.cdn.mozilla.net    443  tcp  /usr/lib/firefox/firefox
send  deny  /usr/lib/firefox/firefox  content-signature-2.cdn.mozilla.net    80   tcp  /usr/lib/firefox/firefox
send  deny  /usr/lib/firefox/firefox  contile.services.mozilla.com           443  tcp  /usr/lib/firefox/firefox
send  deny  /usr/lib/firefox/firefox  firefox.settings.services.mozilla.com  443  tcp  /usr/lib/firefox/firefox
send  deny  /usr/lib/firefox/firefox  firefox.settings.services.mozilla.com  443  udp  /usr/lib/firefox/firefox
send  deny  /usr/lib/firefox/firefox  getpocket.cdn.mozilla.net              443  tcp  /usr/lib/firefox/firefox
send  deny  /usr/lib/firefox/firefox  location.services.mozilla.com          443  tcp  /usr/lib/firefox/firefox
send  deny  /usr/lib/firefox/firefox  mozilla.cloudflare-dns.com             443  tcp  /usr/lib/firefox/firefox
send  deny  /usr/lib/firefox/firefox  normandy.cdn.mozilla.net               443  tcp  /usr/lib/firefox/firefox
send  deny  /usr/lib/firefox/firefox  push.services.mozilla.com              443  tcp  /usr/lib/firefox/firefox
send  deny  /usr/lib/firefox/firefox  shavar.services.mozilla.com            443  tcp  /usr/lib/firefox/firefox
&lt;/code&gt;&lt;/pre&gt;
&lt;h2 id="install-alpine"&gt;&lt;a class="heading-link" href="#install-alpine"&gt;install alpine&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;copy latest wget urls from: &lt;a href="https://github.com/nathants/mighty-snitch/releases"&gt;https://github.com/nathants/mighty-snitch/releases&lt;/a&gt;&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-c1"&gt;cd&lt;/span&gt; /tmp
wget linux-edge-&lt;span class="pl-k"&gt;*&lt;/span&gt;.apk
wget linux-edge-dev-&lt;span class="pl-k"&gt;*&lt;/span&gt;.apk
wget me@nathants.com-&lt;span class="pl-k"&gt;*&lt;/span&gt;.rsa.pub
sudo mv &lt;span class="pl-k"&gt;*&lt;/span&gt;.pub /etc/apk/keys/
sudo apk add &lt;span class="pl-k"&gt;*&lt;/span&gt;.apk
sudo reboot

&lt;span class="pl-c1"&gt;cd&lt;/span&gt; &lt;span class="pl-k"&gt;~&lt;/span&gt;
git clone https://github.com/nathants/mighty-snitch

&lt;span class="pl-c1"&gt;cd&lt;/span&gt; &lt;span class="pl-k"&gt;~&lt;/span&gt;/mighty-snitch/snitch-prompt
sudo pip install &lt;span class="pl-c1"&gt;.&lt;/span&gt;

&lt;span class="pl-c1"&gt;cd&lt;/span&gt; &lt;span class="pl-k"&gt;~&lt;/span&gt;/mighty-snitch/snitch
bash snitch.sh&lt;/pre&gt;&lt;/div&gt;
&lt;h2 id="install-postmarketos"&gt;&lt;a class="heading-link" href="#install-postmarketos"&gt;install postmarketos&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;copy latest wget urls from: &lt;a href="https://github.com/nathants/mighty-snitch/releases"&gt;https://github.com/nathants/mighty-snitch/releases&lt;/a&gt;&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-c1"&gt;cd&lt;/span&gt; /tmp
wget linux-postmarketos-qcom-sdm845-&lt;span class="pl-k"&gt;*&lt;/span&gt;.apk
wget pmos@local-&lt;span class="pl-k"&gt;*&lt;/span&gt;.rsa.pub
sudo mv &lt;span class="pl-k"&gt;*&lt;/span&gt;.pub /etc/apk/keys/
sudo apk add &lt;span class="pl-k"&gt;*&lt;/span&gt;.apk
sudo reboot

&lt;span class="pl-c1"&gt;cd&lt;/span&gt; &lt;span class="pl-k"&gt;~&lt;/span&gt;
git clone https://github.com/nathants/mighty-snitch

&lt;span class="pl-c1"&gt;cd&lt;/span&gt; &lt;span class="pl-k"&gt;~&lt;/span&gt;/mighty-snitch/snitch-prompt
sudo pip install &lt;span class="pl-c1"&gt;.&lt;/span&gt;

&lt;span class="pl-c1"&gt;cd&lt;/span&gt; &lt;span class="pl-k"&gt;~&lt;/span&gt;/mighty-snitch/snitch
bash snitch.sh&lt;/pre&gt;&lt;/div&gt;
&lt;h2 id="install-arch"&gt;&lt;a class="heading-link" href="#install-arch"&gt;install arch&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;copy latest wget urls from: &lt;a href="https://github.com/nathants/mighty-snitch/releases"&gt;https://github.com/nathants/mighty-snitch/releases&lt;/a&gt;&lt;/p&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-c1"&gt;cd&lt;/span&gt; /tmp
wget linux-&lt;span class="pl-k"&gt;*&lt;/span&gt;.zst
wget linux-docs-&lt;span class="pl-k"&gt;*&lt;/span&gt;.zst
wget linux-headers-&lt;span class="pl-k"&gt;*&lt;/span&gt;.zst
sudo pacman -U &lt;span class="pl-k"&gt;*&lt;/span&gt;.zst
sudo reboot

&lt;span class="pl-c1"&gt;cd&lt;/span&gt; &lt;span class="pl-k"&gt;~&lt;/span&gt;
git clone https://github.com/nathants/mighty-snitch

&lt;span class="pl-c1"&gt;cd&lt;/span&gt; &lt;span class="pl-k"&gt;~&lt;/span&gt;/mighty-snitch/snitch-prompt
sudo pip install &lt;span class="pl-c1"&gt;.&lt;/span&gt;

&lt;span class="pl-c1"&gt;cd&lt;/span&gt; &lt;span class="pl-k"&gt;~&lt;/span&gt;/mighty-snitch/snitch
bash snitch.sh&lt;/pre&gt;&lt;/div&gt;
&lt;h2 id="build-alpine-on-aws-and-install"&gt;&lt;a class="heading-link" href="#build-alpine-on-aws-and-install"&gt;build alpine on aws and install&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;sudo apk add go
go install github.com/nathants/libaws@latest
&lt;span class="pl-k"&gt;export&lt;/span&gt; PATH=&lt;span class="pl-smi"&gt;$PATH&lt;/span&gt;:&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;go env GOPATH&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;/bin

&lt;span class="pl-k"&gt;export&lt;/span&gt; MIGHTY_SNITCH_S3_BUCKET=&lt;span class="pl-smi"&gt;$NAME&lt;/span&gt;
&lt;span class="pl-k"&gt;export&lt;/span&gt; MIGHTY_SNITCH_AWS_ACCOUNT=&lt;span class="pl-smi"&gt;$ACCOUNT_NUMBER&lt;/span&gt;
&lt;span class="pl-k"&gt;export&lt;/span&gt; MIGHTY_SNITCH_PUBKEY_CONTENT=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;cat &lt;span class="pl-k"&gt;~&lt;/span&gt;/.ssh/id_ed25519.pub&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;

&lt;span class="pl-c1"&gt;cd&lt;/span&gt; &lt;span class="pl-k"&gt;~&lt;/span&gt;
git clone https://github.com/nathants/mighty-snitch

&lt;span class="pl-c1"&gt;cd&lt;/span&gt; &lt;span class="pl-k"&gt;~&lt;/span&gt;/mighty-snitch/kernel/alpine
bash build.sh
sudo mv /tmp/abuild/&lt;span class="pl-k"&gt;*&lt;/span&gt;.pub /etc/apk/keys/
sudo apk add /tmp/packages/&lt;span class="pl-k"&gt;*&lt;/span&gt;/&lt;span class="pl-k"&gt;*&lt;/span&gt;/&lt;span class="pl-k"&gt;*&lt;/span&gt;.apk
sudo reboot

&lt;span class="pl-c1"&gt;cd&lt;/span&gt; &lt;span class="pl-k"&gt;~&lt;/span&gt;/mighty-snitch/snitch-prompt
sudo pip install &lt;span class="pl-c1"&gt;.&lt;/span&gt;

&lt;span class="pl-c1"&gt;cd&lt;/span&gt; &lt;span class="pl-k"&gt;~&lt;/span&gt;/mighty-snitch/snitch
bash snitch.sh&lt;/pre&gt;&lt;/div&gt;
&lt;h2 id="build-alpine-on-aws-and-install-1"&gt;&lt;a class="heading-link" href="#build-alpine-on-aws-and-install-1"&gt;build alpine on aws and install&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;sudo apk add go
go install github.com/nathants/libaws@latest
&lt;span class="pl-k"&gt;export&lt;/span&gt; PATH=&lt;span class="pl-smi"&gt;$PATH&lt;/span&gt;:&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;go env GOPATH&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;/bin

&lt;span class="pl-k"&gt;export&lt;/span&gt; MIGHTY_SNITCH_S3_BUCKET=&lt;span class="pl-smi"&gt;$NAME&lt;/span&gt;
&lt;span class="pl-k"&gt;export&lt;/span&gt; MIGHTY_SNITCH_AWS_ACCOUNT=&lt;span class="pl-smi"&gt;$ACCOUNT_NUMBER&lt;/span&gt;
&lt;span class="pl-k"&gt;export&lt;/span&gt; MIGHTY_SNITCH_PUBKEY_CONTENT=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;$(&lt;/span&gt;cat &lt;span class="pl-k"&gt;~&lt;/span&gt;/.ssh/id_ed25519.pub&lt;span class="pl-pds"&gt;)&lt;/span&gt;&lt;/span&gt;

&lt;span class="pl-c1"&gt;cd&lt;/span&gt; &lt;span class="pl-k"&gt;~&lt;/span&gt;
git clone https://github.com/nathants/mighty-snitch

&lt;span class="pl-c1"&gt;cd&lt;/span&gt; &lt;span class="pl-k"&gt;~&lt;/span&gt;/mighty-snitch/kernel/alpine-sdm845
bash build.sh
sudo mv /tmp/&lt;span class="pl-k"&gt;*&lt;/span&gt;.pub /etc/apk/keys/
sudo apk add /tmp/&lt;span class="pl-k"&gt;*&lt;/span&gt;.apk
sudo reboot

&lt;span class="pl-c1"&gt;cd&lt;/span&gt; &lt;span class="pl-k"&gt;~&lt;/span&gt;/mighty-snitch/snitch-prompt
sudo pip install &lt;span class="pl-c1"&gt;.&lt;/span&gt;

&lt;span class="pl-c1"&gt;cd&lt;/span&gt; &lt;span class="pl-k"&gt;~&lt;/span&gt;/mighty-snitch/snitch
bash snitch.sh&lt;/pre&gt;&lt;/div&gt;
&lt;h2 id="build-arch-and-install"&gt;&lt;a class="heading-link" href="#build-arch-and-install"&gt;build arch and install&lt;span aria-hidden="true" class="octicon octicon-link"/&gt;&lt;/a&gt;&lt;/h2&gt;
&lt;div class="highlight highlight-source-shell"&gt;&lt;pre&gt;&lt;span class="pl-c1"&gt;cd&lt;/span&gt; &lt;span class="pl-k"&gt;~&lt;/span&gt;
git clone https://github.com/nathants/mighty-snitch

&lt;span class="pl-c1"&gt;cd&lt;/span&gt; &lt;span class="pl-k"&gt;~&lt;/span&gt;/mighty-snitch/kernel/arch
makepkg -sCf
sudo pacman -U &lt;span class="pl-k"&gt;*&lt;/span&gt;.zst

&lt;span class="pl-c1"&gt;cd&lt;/span&gt; &lt;span class="pl-k"&gt;~&lt;/span&gt;/mighty-snitch/snitch-prompt
sudo pip install &lt;span class="pl-c1"&gt;.&lt;/span&gt;

&lt;span class="pl-c1"&gt;cd&lt;/span&gt; &lt;span class="pl-k"&gt;~&lt;/span&gt;/mighty-snitch/snitch
bash snitch.sh&lt;/pre&gt;&lt;/div&gt;

                      
            </description>
      <guid isPermaLink="false">https://nathants.com/projects/mighty-snitch</guid>
    </item>
  </channel>
</rss>

