Issue
I'm trying to reduce the size of ~300 csv files (about a billion rows) by replacing lengthy fields with shorter, categorical, values.
I'm making use of pandas, and I've iterated through each of the files to build an array that includes all of the unique values I'm trying to replace. I can't individually just use pandas.factorize on each file, because I need (for example) '3001958145' to map to the same value on file1.csv as well as file244.csv. I've created an array of what I'd like to replace these values with just by creating another array of incremented integers.
In [1]: toreplace = data['col1'].unique()
Out[1]: array([1000339602, 1000339606, 1000339626, ..., 3001958145, 3001958397,
3001958547], dtype=int64)
In [2]: replacewith = range(0,toreplace))
Out[2]: [0, 1, 2,...]
Now, how do I go about efficiently swapping in my 'replacewith' variable for each corresponding 'toreplace' value for each of the files I need to iterate through?
With as capable as pandas is with dealing with categories, I assume there has to be a method out there that can accomplish this that I simply just can't find. The function I wrote to do this works (it relies on a pandas.factorized input rather than the arrangement I described above), but it relies on the replace function and iterating through the series so it's quite slow.
def powerreplace(pdseries,factorized):
i = 0
uniques = pdseries.unique()
for unique in uniques:
print '%i/%i' % (i,len(uniques))
i=i+1
pdseries.replace(to_replace=unique,
value=np.where(factorized[1]==unique)[0][0],
inplace=True)
Can anyone recommend a better way to go about doing this?
Solution
This requires at least pandas 0.15.0; (however the .astype
syntax is a bit more friendly in 0.16.0, so better to use that). Here are the docs for categoricals
Imports
In [101]: import pandas as pd
In [102]: import string
In [103]: import numpy as np
In [104]: np.random.seed(1234)
In [105]: pd.set_option('max_rows',10)
Create a sample set to create some data
In [106]: uniques = np.array(list(string.ascii_letters))
In [107]: len(uniques)
Out[107]: 52
Create some data
In [109]: df1 = pd.DataFrame({'A' : uniques.take(np.random.randint(0,len(uniques)/2+5,size=1000000))})
In [110]: df1.head()
Out[110]:
A
0 p
1 t
2 g
3 v
4 m
In [111]: df1.A.nunique()
Out[111]: 31
In [112]: df2 = pd.DataFrame({'A' : uniques.take(np.random.randint(0,len(uniques),size=1000000))})
In [113]: df2.head()
Out[113]:
A
0 I
1 j
2 b
3 A
4 m
In [114]: df2.A.nunique()
Out[114]: 52
So we now have 2 frames that we want to categorize; the first frame happens to have less than the full set of categories. This is on purpose; you don't have to know the complete set upfront.
Convert the A columns to B columns that are a Categorical
In [116]: df1['B'] = df1['A'].astype('category')
In [118]: i = df1['B'].cat.categories
In [124]: i
Out[124]: Index([u'A', u'B', u'C', u'D', u'E', u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l', u'm', u'n', u'o', u'p', u'q', u'r', u's', u't', u'u', u'v', u'w', u'x', u'y', u'z'], dtype='object')
If we are iteratively processing these frames, we use the first ones to start. To get each successive one, we add the symmetric difference with the existing set. This keeps the categories in the same order, so when we factorize we get the same numbering scheme.
In [119]: cats = i.tolist() + i.sym_diff(df2['A'].astype('category').cat.categories).tolist()
We have now gained back the original set
In [120]: (np.array(sorted(cats)) == sorted(uniques)).all()
Out[120]: True
Set the next frames B column to be a categorical, BUT we specify the categories, so when it is factorized the same values are used
In [121]: df2['B'] = df2['A'].astype('category',categories=cats)
To prove it, we select the codes (the factorized map) from each. These codes match; df2 has an additional code (as Z is in the 2nd frame but not the first).
In [122]: df1[df1['B'].isin(['A','a','z','Z'])].B.cat.codes.unique()
Out[122]: array([30, 0, 5])
In [123]: df2[df2['B'].isin(['A','a','z','Z'])].B.cat.codes.unique()
Out[123]: array([ 0, 30, 5, 51])
You can simply then store the codes in lieu of the object dtyped data.
Note that it is actually quite efficient to serialize these to HDF5 as Categoricals are natively stored, see here
Note that we are creating a pretty memory efficient way of storing this data. Noting that the memory usage of in [154], the object
dtype is actually MUCH higher the longer the string gets because this is just the memory for a pointer; the actual values are stored on the heap. While [155] is ALL the memory used.
In [153]: df2.dtypes
Out[153]:
A object
B category
dtype: object
In [154]: df2.A.to_frame().memory_usage()
Out[154]:
A 8000000
dtype: int64
In [155]: df2.B.to_frame().memory_usage()
Out[155]:
B 1000416
dtype: int64
Answered By - Jeff
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.