Issue
I'm trying to read in json files into dataframes.
df = pd.read_json('test.log', lines=True)
However there are values which are int64 and Pandas raises:
ValueError: Value is too big
I tried setting precise_float
to True
, but this didn't solve it.
It works when I do it line by line:
df = pd.DataFrame()
with open('test.log') as f:
for line in f:
data = json.loads(line)
df = df.append(data, ignore_index=True)
However this is very slow. Already for files around 50k lines it takes a very long time.
Is there a way I can set the value of certain columns to use int64?
Solution
After updating pandas to a newer version (tested with 1.0.3), this workaround by artdgn can be applied to overwrite the loads()
function in pandas.io.json._json
, which is ultimately used when pd.read_json()
is called.
Copying the workaround in case the links above stop working:
import pandas as pd
# monkeypatch using standard python json module
import json
pd.io.json._json.loads = lambda s, *a, **kw: json.loads(s)
# monkeypatch using faster simplejson module
import simplejson
pd.io.json._json.loads = lambda s, *a, **kw: simplejson.loads(s)
# normalising (unnesting) at the same time (for nested jsons)
pd.io.json._json.loads = lambda s, *a, **kw: pandas.json_normalize(simplejson.loads(s))
After overwriting the loads()
function with 1 of the 3 methods described by artdgn, read_json()
also works with int64
.
Answered By - Marcel
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.