Issue
I have a dataframe with the following structure:
Time QuantityMeasured Value
0 t1 A 7
1 t1 B 2
2 t1 C 8
3 t1 D 9
4 t1 E 5
... ... ... ...
18482 tn A 5
18483 tn C 3
18484 tn E 4
18485 tn B 5
18486 tn D 1
I am pulling this data from a msql database, and I don't have the means to change how this data is stored. I need the structure of the data to be in the following format (as numpy arrays):
list_of_time = ['t1', ..., 'tn']
list_of_A = [7, ..., 5]
list_of_B = [2, ..., 5]
list_of_C = [8, ..., 3]
list_of_D = [9, ..., 8]
I am not sure of the number of quantity measured's that may be in this dataset, but I know A-D are in the dataset, and they are the only ones that I need. The set of quantity measured's would be consistent within a dataset, but may not be consistent across data sets, other that A-D. In case this would be relevant, there are probably around 3-4 quantity measured's that I do not need for each timestamp. I am not 100% sure that the quantity measured's would be in the same order for each timestamp in a given dataset, but it has so far appeared to have been the case.
Given a datasets, I have been able to do this restructuring in ~0.18 - 0.22 seconds on average. I first tried using a for loop to build a list of dictionaries in the following format:
[{Time: x, A: 7, B: 2, ...}, ..., {Time: x, A: 5, C: 3, ...}]
Which I then loop back through to append to the individual lists that I want. I have tried using some pandas functions, such as groupby and apply, but those seemed to perform worse. I also tried to use the pivot function, like so:
pivot_df = data.pivot(index='Time', columns='QuantityMeasured', values='Value')
time = pivot_df.index.tolist()
A = pivot_df['A'].tolist()
B = pivot_df['B'].tolist()
C = pivot_df['C'].tolist()
D = pivot_df['D'].tolist()
And while this is faster, its only ~.03 seconds on faster on average. Am I just running into a limit of how fast I can reasonable parse this data? I would ideally like to reduce it by an order of magnitude, but even just 2x as fast would be more or less sufficient for my use case.
Any help here would be appreciated, even if its just contextualizing what I should expect the performance to be. Is expecting to be able to unpivot 18.5k data points and create arrays in 0.02 seconds reasonable in Python?
Solution
If you only need A-D in the final lists, selecting for those first before the pivoting could improve performance because the reshaped dataframe would be much smaller especially if there are a lot more categories of quality measured. This is similar to FILTER in Excel.
agg_df = (
df.query("QuantityMeasured in ['A', 'B', 'C', 'D']")
.pivot(index='QuantityMeasured', columns='Time', values='Value')
)
time = agg_df.columns.tolist()
list_of_A, list_of_B, list_of_C, list_of_D = agg_df.values.tolist()
Another way to pivot is to use set_index
and unstack
consecutively (which is actually what pivot
does under the hood), so in some cases, this is a bit faster.
agg_df = (
df
.query("QuantityMeasured in ['A', 'B', 'C', 'D']")
.set_index(['QuantityMeasured', 'Time'])['Value'].unstack()
)
But in any case, you can't get orders of magnitude speed up here, this is as good as it gets imo.
In my opinion, better speedup would come from doing the pivoting in SQL. Using a query like the following would pull a pivoted table from the database that can then be assigned to variables as lists.
SELECT
Time,
SUM(CASE WHEN QuantityMeasured = 'A' THEN Value ELSE 0 END) AS A,
SUM(CASE WHEN QuantityMeasured = 'B' THEN Value ELSE 0 END) AS B,
SUM(CASE WHEN QuantityMeasured = 'C' THEN Value ELSE 0 END) AS C,
SUM(CASE WHEN QuantityMeasured = 'D' THEN Value ELSE 0 END) AS D
FROM df
GROUP BY Time;
Answered By - cottontail
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.