Tuesday, December 19, 2023

[FIXED] Faster data restructuring from SQL table via Pandas

December 19, 2023 dataframe, mysql, numpy, pandas, python No comments

Issue

I have a dataframe with the following structure:

        Time    QuantityMeasured       Value
0       t1          A                       7
1       t1          B                       2
2       t1          C                       8
3       t1          D                       9
4       t1          E                       5
...     ...         ...                     ...
18482   tn          A                       5
18483   tn          C                       3
18484   tn          E                       4
18485   tn          B                       5
18486   tn          D                       1

I am pulling this data from a msql database, and I don't have the means to change how this data is stored. I need the structure of the data to be in the following format (as numpy arrays):

list_of_time = ['t1', ..., 'tn']
list_of_A    = [7, ..., 5]
list_of_B    = [2, ..., 5]
list_of_C    = [8, ..., 3]
list_of_D    = [9, ..., 8]

I am not sure of the number of quantity measured's that may be in this dataset, but I know A-D are in the dataset, and they are the only ones that I need. The set of quantity measured's would be consistent within a dataset, but may not be consistent across data sets, other that A-D. In case this would be relevant, there are probably around 3-4 quantity measured's that I do not need for each timestamp. I am not 100% sure that the quantity measured's would be in the same order for each timestamp in a given dataset, but it has so far appeared to have been the case.

Given a datasets, I have been able to do this restructuring in ~0.18 - 0.22 seconds on average. I first tried using a for loop to build a list of dictionaries in the following format:

[{Time: x, A: 7, B: 2, ...}, ...,  {Time: x, A: 5, C: 3, ...}]

Which I then loop back through to append to the individual lists that I want. I have tried using some pandas functions, such as groupby and apply, but those seemed to perform worse. I also tried to use the pivot function, like so:

pivot_df = data.pivot(index='Time', columns='QuantityMeasured', values='Value')
time = pivot_df.index.tolist()
A = pivot_df['A'].tolist()
B = pivot_df['B'].tolist()
C = pivot_df['C'].tolist()
D = pivot_df['D'].tolist()

And while this is faster, its only ~.03 seconds on faster on average. Am I just running into a limit of how fast I can reasonable parse this data? I would ideally like to reduce it by an order of magnitude, but even just 2x as fast would be more or less sufficient for my use case.

Any help here would be appreciated, even if its just contextualizing what I should expect the performance to be. Is expecting to be able to unpivot 18.5k data points and create arrays in 0.02 seconds reasonable in Python?

Solution

If you only need A-D in the final lists, selecting for those first before the pivoting could improve performance because the reshaped dataframe would be much smaller especially if there are a lot more categories of quality measured. This is similar to FILTER in Excel.

agg_df = (
    df.query("QuantityMeasured in ['A', 'B', 'C', 'D']")
    .pivot(index='QuantityMeasured', columns='Time', values='Value')
)
time = agg_df.columns.tolist()
list_of_A, list_of_B, list_of_C, list_of_D = agg_df.values.tolist()

Another way to pivot is to use set_index and unstack consecutively (which is actually what pivot does under the hood), so in some cases, this is a bit faster.

agg_df = (
    df
    .query("QuantityMeasured in ['A', 'B', 'C', 'D']")
    .set_index(['QuantityMeasured', 'Time'])['Value'].unstack()
)

But in any case, you can't get orders of magnitude speed up here, this is as good as it gets imo.

In my opinion, better speedup would come from doing the pivoting in SQL. Using a query like the following would pull a pivoted table from the database that can then be assigned to variables as lists.

SELECT
  Time,
  SUM(CASE WHEN QuantityMeasured = 'A' THEN Value ELSE 0 END) AS A,
  SUM(CASE WHEN QuantityMeasured = 'B' THEN Value ELSE 0 END) AS B,
  SUM(CASE WHEN QuantityMeasured = 'C' THEN Value ELSE 0 END) AS C,
  SUM(CASE WHEN QuantityMeasured = 'D' THEN Value ELSE 0 END) AS D
FROM df
GROUP BY Time;

Answered By - cottontail

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, December 19, 2023

[FIXED] Faster data restructuring from SQL table via Pandas

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels