Issue
I'm trying to extract data from several 1,000 XML files and compose a single df from it.
The code I have so far is for a single XML extraction.
from lxml import etree
import pandas as pd
serial = ["S1.xml"]
content = serial.encode('utf-8')
doc = etree.XML(content)
targets = doc.xpath('/reiXmlPrenos')
data = []
for target in targets:
data.append(target.xpath("./@A")[0])
data.append(target.xpath("./@z")[0])
columns = ['A', 'Z']
pd.DataFrame([data],columns=columns)
The XML file looks like this:
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<reiXmlPrenos>
<Qf>255340</Qf>
<Qp>597451</Qp>
<CO2>126660</CO2>
<A>2362.8</A>
<Ht>0.336</Ht>
<f0>0.59</f0>
<z>0.105891</z>
</reiXmlPrenos>
For the final df I'd like for it to look like this:
A z
S1.xml 2362 0.105891
S2.xml ... ...
...
The error that i'm getting is
line 16, in <module>
content = serial.encode('utf-8')
AttributeError: 'list' object has no attribute 'encode'
Can you please find me the error that i'm making and then to expand the code, so it could load all xml files in the same folder?
Solution
from lxml import etree
import pandas as pd
serial = ["tmp.xml", "S2.xml"]
columns = ["file",'A', 'Z']
all_data = []
for item in serial:
data = []
data.append(item)
with open(item, 'r') as file:
content = file.read().encode('utf-8')
doc = etree.XML(content)
# add a predicate to make sure A and z exists
targets = doc.xpath('/reiXmlPrenos[A and z]')
for target in targets:
data.append(target.xpath("./A")[0].text)
data.append(target.xpath("./z")[0].text)
all_data.append(data)
df = pd.DataFrame(all_data,columns=columns)
print(df)
Result
file A Z
0 tmp.xml 2362.8 0.105891
1 S2.xml 2362.8 0.105891
Answered By - LMC
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.