Perform self joins¶
If you’re a relational data guru, you may have wondered how it’s possible to join tables with themselves, because joins clauses involve column references back to the original table.
Consider the SQL
SELECT t1.key, sum(t1.value - t2.value) AS metric
FROM my_table t1
JOIN my_table t2
ON t1.key = t2.subkey
GROUP BY 1
Here, we have an unambiguous way to refer to each of the tables through aliasing.
Let’s consider the TPC-H database, and support we want to compute year-over-year change in total order amounts by region using joins.
>>> region = con.table('tpch_region')
>>> nation = con.table('tpch_nation')
>>> customer = con.table('tpch_customer')
>>> orders = con.table('tpch_orders')
>>> orders.limit(5)
o_orderkey o_custkey o_orderstatus o_totalprice o_orderdate \
0 1 36901 O 173665.47 1996-01-02
1 2 78002 O 46929.18 1996-12-01
2 3 123314 F 193846.25 1993-10-14
3 4 136777 O 32151.78 1995-10-11
4 5 44485 F 144659.20 1994-07-30
o_orderpriority o_clerk o_shippriority \
0 5-LOW Clerk#000000951 0
1 1-URGENT Clerk#000000880 0
2 5-LOW Clerk#000000955 0
3 5-LOW Clerk#000000124 0
4 5-LOW Clerk#000000925 0
o_comment
0 nstructions sleep furiously among
1 foxes. pending accounts at the pending, silen...
2 sly final accounts boost. carefully regular id...
3 sits. slyly regular warthogs cajole. regular, ...
4 quickly. bold deposits sleep slyly. packages u...
First, let’s join all the things and select the fields we care about:
>>> fields_of_interest = [region.r_name.name('region'),
... nation.n_name.name('nation'),
... orders.o_totalprice.name('amount'),
... orders.o_orderdate.cast('timestamp').name('odate') # these are strings
... ]
>>> joined_all = (region.join(nation, region.r_regionkey == nation.n_regionkey)
... .join(customer, customer.c_nationkey == nation.n_nationkey)
... .join(orders, orders.o_custkey == customer.c_custkey)
... [fields_of_interest])
Okay, great, let’s have a look:
>>> joined_all.limit(5)
region nation amount odate
0 AMERICA UNITED STATES 160843.35 1992-06-22
1 MIDDLE EAST IRAN 78307.91 1996-04-19
2 EUROPE FRANCE 103237.90 1994-10-12
3 EUROPE FRANCE 201463.59 1997-09-12
4 ASIA JAPAN 166098.86 1995-09-12
Sweet, now let’s aggregate by year and region:
>>> year = joined_all.odate.year().name('year')
>>> total = joined_all.amount.sum().cast('float').name('total')
>>> annual_amounts = (joined_all
... .group_by(['region', year])
... .aggregate(total))
>>> annual_amounts.limit(5)
region year total
0 EUROPE 1994 6.979473e+09
1 EUROPE 1996 7.015421e+09
2 ASIA 1997 6.910663e+09
3 ASIA 1998 4.058824e+09
4 EUROPE 1992 6.926705e+09
Looking good so far. Now, we need to join this table on itself, by subtracting 1 from one of the year columns.
We do this by creating a “joinable” view of a table that is considered a
distinct object within Ibis. To do this, use the view
function:
>>> current = annual_amounts
>>> prior = annual_amounts.view()
>>> yoy_change = (current.total - prior.total).name('yoy_change')
>>> results = (current.join(prior, ((current.region == prior.region) &
... (current.year == (prior.year - 1))))
... [current.region, current.year, yoy_change])
>>> df = results.to_pandas()
>>> df['yoy_pretty'] = df.yoy_change.map(lambda x: '$%.2fmm' % (x / 1000000.))
If you’re being fastidious and want to consider the first year occurring in the dataset for each region to have 0 for the prior year, you will instead need to do an outer join and treat nulls in the prior side of the join as zero:
>>> yoy_change = (current.total - prior.total.zeroifnull()).name('yoy_change')
>>> results = (current.outer_join(prior, ((current.region == prior.region) &
... (current.year == (prior.year - 1))))
... [current.region, current.year, current.total,
... prior.total.zeroifnull().name('prior_total'),
... yoy_change])
>>> results.limit(5)
region year total prior_total yoy_change
0 ASIA 1998 4.058824e+09 0.000000e+00 4.058824e+09
1 AFRICA 1994 6.837587e+09 6.908429e+09 -7.084172e+07
2 AMERICA 1996 6.883057e+09 6.922465e+09 -3.940791e+07
3 AFRICA 1996 6.878112e+09 6.848983e+09 2.912979e+07
4 AFRICA 1992 6.873319e+09 6.859733e+09 1.358699e+07