Tuesday, December 19, 2023

[FIXED] Pandas Groupby Sample on a large dataset

December 19, 2023 numpy, pandas, python No comments

Issue

I am trying to randomly sample a relatively large dataset (consists of 90 [mn] or so data points)

I would like to sample the dataset based on column "a" (which has roughly 100k unique values) with each a having a different n value.

I know something like this exists:

df.groupby("a").sample(n=1, random_state=1)

But this does not take into different n values into account.

Next thought was filtering df by a then sampling after filtering on a loop ('m' is unique values of 'a' I am sampling):

filter_df = df.loc[(df['a'] == f)]

filter_df = filter_df.sample(n=m, random_state=6)

To add another layer of potential complication, I would like to sample the data if it exceeds total number of samples per group "a", to use replace = True otherwise if the sample is larger replace = False so that I am picking as many unique rows as possible.

So, within a for loop that is enumerated (e.g. 4 separate lists that's 100k long including column "a"'s unique values lets call it variable 'kk' which is an integer:

if kk >= m:
        filter_df = filter_df.sample(n=m, random_state=6)
        print("sampling okay")
        test["flag"] = "ok"
    else:
        filter_df = filter_df.sample(n=m, random_state=6,replace=True)

Was used, then concat was used at the end to combine.

This is the basic idea, and code works fine however, the performance is sub-optimal. Was wondering if there is a vectorized potential solution I could use.

Just for brevity the number of samples "m" is count of first df:

`a`	`count`
`1`	`1`
`2`	`3`
`3`	`2`

original df I am trying to sample from:

a	x
1	a
1	b
1	c
2	d
2	e
3	f
3	g

Desired sampled output (of course the sampled x will be "random"):

`a`	`x`
1	a
2	d
2	e
2	e
3	f
3	g

Solution

Make a dictionary from the first dataframe and create custom grouping function:

def get_sample(df, dct, random_state):
    a = df["a"].iat[0]
    n = dct.get(a)

    if n is None:
        return

    return df.sample(n=n, random_state=random_state, replace=len(df) <= n)


dct = df1.set_index("a")["count"].to_dict()

out = df2.groupby("a", group_keys=False).apply(get_sample, dct=dct, random_state=6)
print(out)

Prints:

Input dataframes:

# df1
   a  count
0  1      1
1  2      3
2  3      2

# df2
   a  x
0  1  a
1  1  b
2  1  c
3  2  d
4  2  e
5  3  f
6  3  g

Answered By - Andrej Kesely

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, December 19, 2023

[FIXED] Pandas Groupby Sample on a large dataset

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels