+ - 0:00:00
Notes for current slide
Notes for next slide

A Shallow Dive into Pandas' BlockManager

Thomas J. Fan
scikit-learn core developer
@thomasjpfan
This talk on Github: thomasjpfan/scipy-2020-lightning-talk-pandas-blockmanager

Pandas users?

Press Y for Yes

Press N for No

Lets create a DataFrame

df = pd.DataFrame({
'int_1': np.arange(1000000, dtype=int),
'float_1': np.arange(1000000, dtype=float),
'int_2': np.arange(10, 1000010, dtype=int),
'bool_1': np.ones(1000000, dtype=bool),
'int_3': np.arange(20, 1000020, dtype=int),
'float_2': np.arange(10, 1000010, dtype=float),
})
df.head(2)
int_1 float_1 int_2 bool_1 int_3 float_2
0 0 0 10 True 20 10
1 1 1 11 True 21 11

How is this stored?

int_1 float_1 int_2 bool_1 int_3 float_2
0 0 0 10 True 20 10
1 1 1 11 True 21 11

Pandas API For BlockManager

>>> df._data
BlockManager
Items: Index(['int_1', 'float_1', 'int_2', 'bool_1', 'int_3', 'float_2'], dtype='object')
Axis 1: RangeIndex(start=0, stop=1000000, step=1)
FloatBlock: slice(1, 9, 4), 2 x 1000000, dtype: float64
IntBlock: slice(0, 6, 2), 3 x 1000000, dtype: int64
BoolBlock: slice(3, 4, 1), 1 x 1000000, dtype: bool
>>> df._data.nblocks
3

Query the dataframe

A view!

>>> df['int_1'].values
array([ 0, 1, 2, ..., 999997, 999998, 999999])

Underlying numpy array

>>> df['int_1'].values.base
array([[ 0, 1, 2, ..., 999997, 999998, 999999],
[ 10, 11, 12, ..., 1000007, 1000008, 1000009],
[ 20, 21, 22, ..., 1000017, 1000018, 1000019]])

Add a new column

Recall

>>> df._data.nblocks
3

Add new column

df['int_4'] = df['int_1'].values

How many blocks now?

>>> df._data.nblocks
4

Add another column for fun

>>> %time df['int_5'] = df['int_1'].values

How long does it take?

  1. 1.3 seconds
  2. 6.1 milliseconds

Press 1 for 1.3 seconds

Press 2 for 6.1 milliseconds

Add another column for fun

>>> %time df['int_5'] = df['int_1'].values

How long does it take?

  1. 1.3 seconds
  2. 6.1 milliseconds

Press 2 for 6.1 milliseconds

Wall time: 6.18 ms
>>> df._data.nblocks
5

Add 94 more columns

%%time
for i in range(94):
df[f'int_{i + 6}'] = df['int_1'].values
# Wall time: 560 ms
>>> df._data.nblocks
100

Add one more int column

%time df['int_100'] = df['int_1'].values

How long does it take?

A. 1.3 seconds
B. 6.1 milliseconds

Press A for 1.3 seconds

Press B for 6.1 milliseconds

Add one more int column

%time df['int_100'] = df['int_1'].values

How long does it take?

A. 1.3 seconds
B. 6.1 milliseconds

Press A for 1.3 seconds

Wall time: 1.3 s

What happened? (Consolidation!)

>>> df._data.nblocks
3

Add another column

>>> %time df.loc[:, "int_102"] = 1
Wall time: 2.03 ms

How many blocks now?

>>> df._data.nblocks
4

Using loc?

How about?

%time df.loc[0:10, "int_102"] = 2

How long does it take?

  1. 1.3 seconds
  2. 2 milliseconds

Press 1 for 1.3 seconds

Press 2 for 2 milliseconds

Using loc?

How about?

%time df.loc[0:10, "int_102"] = 2

How long does it take?

  1. 1.3 seconds
  2. 2 milliseconds

Press 1 for 1.3 seconds

Wall time: 1.3 s
>>> df._data.nblocks
# 3

Conclusion

Pandas users?

Press Y for Yes

Press N for No

Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow