Analytics Tools¶

Setup¶

In [1]:

            
                Copied!
                
!curl -LsS -o geography.db 'https://storage.googleapis.com/ibis-tutorial-data/geography.db'
!curl -LsS -o geography.db 'https://storage.googleapis.com/ibis-tutorial-data/geography.db'

In [2]:

            
                Copied!
                
import os
import tempfile

import ibis

ibis.options.interactive = True

connection = ibis.sqlite.connect(
    'geography.db'
)
import os
import tempfile

import ibis

ibis.options.interactive = True

connection = ibis.sqlite.connect(
    'geography.db'
)

Frequency tables¶

Ibis provides the value_counts API, just like pandas, for computing a frequency table for a table column or array expression. You might have seen it used already earlier in the tutorial.

In [3]:

            
                Copied!
                
countries = connection.table('countries')
countries.continent.value_counts()
countries = connection.table('countries')
countries.continent.value_counts()

Out[3]:

┏━━━━━━━━━━━┳━━━━━━━┓
┃ continent ┃ count ┃
┡━━━━━━━━━━━╇━━━━━━━┩
│ string    │ int64 │
├───────────┼───────┤
│ AF        │    58 │
│ AN        │     5 │
│ AS        │    51 │
│ EU        │    54 │
│ NA        │    42 │
│ OC        │    28 │
│ SA        │    14 │
└───────────┴───────┘

This can be customized, of course:

In [4]:

            
                Copied!
                
                    
                    
                
                

        
freq = countries.group_by(countries.continent).aggregate(
    [
        countries.count().name('# countries'),
        countries.population.sum().name('total population'),
    ]
)
freq
freq = countries.group_by(countries.continent).aggregate(
    [
        countries.count().name('# countries'),
        countries.population.sum().name('total population'),
    ]
)
freq

Out[4]:

┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ continent ┃ # countries ┃ total population ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ string    │ int64       │ int64            │
├───────────┼─────────────┼──────────────────┤
│ AF        │          58 │       1021238685 │
│ AN        │           5 │              170 │
│ AS        │          51 │       4130584841 │
│ EU        │          54 │        750724554 │
│ NA        │          42 │        540204371 │
│ OC        │          28 │         36067549 │
│ SA        │          14 │        400143568 │
└───────────┴─────────────┴──────────────────┘

Binning and histograms¶

Numeric array expressions (columns with numeric type and other array expressions) have bucket and histogram methods which produce different kinds of binning. These produce category values (the computed bins) that can be used in grouping and other analytics.

Some backends implement the .summary() method, which can be used to see the general distribution of a column.

Let's have a look at a few examples.

Alright then, now suppose we want to split the countries up into some buckets of our choosing for their population:

In [5]:

            
                Copied!
                
buckets = [0, 1e6, 1e7, 1e8, 1e9]
buckets = [0, 1e6, 1e7, 1e8, 1e9]

The bucket function creates a bucketed category from the prices:

In [6]:

            
                Copied!
                
bucketed = countries.population.bucket(buckets).name('bucket')
bucketed = countries.population.bucket(buckets).name('bucket')

Let's have a look at the value counts:

In [7]:

            
                Copied!
                
bucketed.value_counts()
bucketed.value_counts()

Out[7]:

┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ bucket               ┃ count ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
│ category             │ int64 │
├──────────────────────┼───────┤
│ ∅                    │     2 │
│ 0.0                  │    93 │
│ 1.0                  │    76 │
│ 2.0                  │    72 │
│ 3.0                  │     9 │
└──────────────────────┴───────┘

The buckets we wrote down define 4 buckets numbered 0 through 3. The NaN is a pandas NULL value (since that's how pandas represents nulls in numeric arrays), so don't worry too much about that. Since the bucketing ends at 100000, we see there are 4122 values that are over 100000. These can be included in the bucketing with include_over:

In [8]:

            
                Copied!
                
bucketed = countries.population.bucket(buckets, include_over=True).name(
    'bucket'
)
bucketed.value_counts()
bucketed = countries.population.bucket(buckets, include_over=True).name(
    'bucket'
)
bucketed.value_counts()

Out[8]:

┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ bucket               ┃ count ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
│ category             │ int64 │
├──────────────────────┼───────┤
│ 0                    │    93 │
│ 1                    │    76 │
│ 2                    │    72 │
│ 3                    │     9 │
│ 4                    │     2 │
└──────────────────────┴───────┘

The bucketed object here is a special category type

In [9]:

            
                Copied!
                
bucketed.type()
bucketed.type()

Out[9]:

Category(cardinality=5)

Category values can either have a known or unknown cardinality. In this case, there's either 4 or 5 buckets based on how we used the bucket function.

Labels can be assigned to the buckets at any time using the label function:

In [10]:

            
                Copied!
                
bucket_counts = bucketed.value_counts()

labeled_bucket = bucket_counts.bucket.label(
    ['< 1M', '> 1M', '> 10M', '> 100M', '> 1B']
).name('bucket_name')

expr = bucket_counts[labeled_bucket, bucket_counts].order_by('bucket')
expr
bucket_counts = bucketed.value_counts()

labeled_bucket = bucket_counts.bucket.label(
    ['< 1M', '> 1M', '> 10M', '> 100M', '> 1B']
).name('bucket_name')

expr = bucket_counts[labeled_bucket, bucket_counts].order_by('bucket')
expr

Out[10]:

┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ bucket_name ┃ bucket               ┃ count ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
│ string      │ category             │ int64 │
├─────────────┼──────────────────────┼───────┤
│ < 1M        │ 0                    │    93 │
│ > 1M        │ 1                    │    76 │
│ > 10M       │ 2                    │    72 │
│ > 100M      │ 3                    │     9 │
│ > 1B        │ 4                    │     2 │
└─────────────┴──────────────────────┴───────┘

Nice, huh?

Some backends implement histogram(num_bins), a linear (fixed size bin) equivalent.

Last update: January 5, 2023