Issue
I am trying to randomly sample a relatively large dataset (consists of 90 [mn] or so data points)
I would like to sample the dataset based on column "a" (which has roughly 100k unique values) with each a having a different n value.
I know something like this exists:
df.groupby("a").sample(n=1, random_state=1)
But this does not take into different n values into account.
Next thought was filtering df by a then sampling after filtering on a loop ('m' is unique values of 'a' I am sampling):
filter_df = df.loc[(df['a'] == f)]
filter_df = filter_df.sample(n=m, random_state=6)
To add another layer of potential complication, I would like to sample the data if it exceeds total number of samples per group "a", to use replace = True
otherwise if the sample is larger replace = False
so that I am picking as many unique rows as possible.
So, within a for loop that is enumerated (e.g. 4 separate lists that's 100k long including column "a"'s unique values lets call it variable 'kk' which is an integer:
if kk >= m:
filter_df = filter_df.sample(n=m, random_state=6)
print("sampling okay")
test["flag"] = "ok"
else:
filter_df = filter_df.sample(n=m, random_state=6,replace=True)
Was used, then concat was used at the end to combine.
This is the basic idea, and code works fine however, the performance is sub-optimal. Was wondering if there is a vectorized potential solution I could use.
Just for brevity the number of samples "m" is count of first df:
a |
count |
---|---|
1 |
1 |
2 |
3 |
3 |
2 |
original df I am trying to sample from:
a | x |
---|---|
1 | a |
1 | b |
1 | c |
2 | d |
2 | e |
3 | f |
3 | g |
Desired sampled output (of course the sampled x will be "random"):
a |
x |
---|---|
1 | a |
2 | d |
2 | e |
2 | e |
3 | f |
3 | g |
Solution
Make a dictionary from the first dataframe and create custom grouping function:
def get_sample(df, dct, random_state):
a = df["a"].iat[0]
n = dct.get(a)
if n is None:
return
return df.sample(n=n, random_state=random_state, replace=len(df) <= n)
dct = df1.set_index("a")["count"].to_dict()
out = df2.groupby("a", group_keys=False).apply(get_sample, dct=dct, random_state=6)
print(out)
Prints:
a x
0 1 a
3 2 d
4 2 e
4 2 e
5 3 f
6 3 g
Input dataframes:
# df1
a count
0 1 1
1 2 3
2 3 2
# df2
a x
0 1 a
1 1 b
2 1 c
3 2 d
4 2 e
5 3 f
6 3 g
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.