Pandas is a python library for manipulating tabular data. It is probably the most popular library of its kind.
Pandas is primarily written in C and Cython source. It has been criticised for its lack of multiprocessing support.
Useful Snippets
Batch Iterate Over Pandas DF
It may be useful to batch a dataframe into groups or chunks for processing (or parallel processing).
import numpy as np
from tqdm.auto import tqdm
results = []
for i, group in tqdm(df.groupby( np.arange(len(df)) // rows_per_chunk)):
results.append(group)
Progress over Pandas Dataframes
Pandas integrates with the tqdm progress library for python via hooks:
from tqdm.auto import tqdm
tqdm.pandas()
df.progress_apply(somefunc)
Stratified Sampling
We can use this groupby trick to separate data into groups by label and sample by the given amount. Then we can generate the test_df by taking the original df and removing rows that are in the new df.
train_df = shuffled.groupby('label', group_keys=False)[['text','label']].apply(lambda x: x.sample(frac=0.7, random_state=42), include_groups=True)
test_df = shuffled[~shuffled.index.isin(train_df.index)]
Pandas to_sql
and custom data types
We might need to map a column containing some complex Python data type onto a SQL native data type. For example, whilst working with Google Cloud SQL Vectors I need to be able to insert vector values into a table I’ve pre-created using the guidance here.
A full write up for the Google Cloud Vectors is given here.
Batch Inserts
By default Pandas does not do batch inserts when you use to_sql
which means it can take a really long time (one query per insert).