Homepage

Pandas is a python library for manipulating tabular data. It is probably the most popular library of its kind.

Pandas is primarily written in C and Cython source. It has been criticised for its lack of multiprocessing support.

Useful Snippets

Batch Iterate Over Pandas DF

It may be useful to batch a dataframe into groups or chunks for processing (or parallel processing).

import numpy as np
from tqdm.auto import tqdm
 
results = []
 
for i, group in tqdm(df.groupby( np.arange(len(df)) // rows_per_chunk)):
    results.append(group)

Progress over Pandas Dataframes

Pandas integrates with the tqdm progress library for python via hooks:

from tqdm.auto import tqdm
    	  
tqdm.pandas()
df.progress_apply(somefunc)

Stratified Sampling

We can use this groupby trick to separate data into groups by label and sample by the given amount. Then we can generate the test_df by taking the original df and removing rows that are in the new df.

train_df = shuffled.groupby('label', group_keys=False)[['text','label']].apply(lambda x: x.sample(frac=0.7, random_state=42), include_groups=True)
test_df = shuffled[~shuffled.index.isin(train_df.index)]

Pandas to_sql and custom data types

We might need to map a column containing some complex Python data type onto a SQL native data type. For example, whilst working with Google Cloud SQL Vectors I need to be able to insert vector values into a table I’ve pre-created using the guidance here.

A full write up for the Google Cloud Vectors is given here.

Batch Inserts

By default Pandas does not do batch inserts when you use to_sql which means it can take a really long time (one query per insert).