Issue
This may seem like a strange way to deal with CSV file, but it is a study task: I need to open csv file as a text, read lines, create a list and then create a pandas df using that list
import pandas as pd
with open ('file.csv', 'r') as f:
lst = f.readlines()
for idx, line in enumerate(lst):
lst[idx] = line.strip('\n')
header = lst[0].replace('"', '').split(",")
for idx, line in enumerate(lst[1:]):
lst[idx] = line.split(',')
df = pd.DataFrame(data = lst, columns = header)
ValueError: 5 columns passed, passed data had 39 columns
It crashes, because pd.Dataframe adds (?) a bunch of Nones at the end of each row I checked it, when tried to run this without specifying 'columns' Please help me to understand where this Nones come from
Solution
The issue you're encountering is related to how you're processing the CSV file lines and subsequently trying to construct a Pandas DataFrame. Let's break down the steps and see where the problem might be:
- Reading the File: You correctly open and read the lines from the CSV file, storing them in a list.
- Stripping Newline Characters: You remove the newline characters from each line. This is also done correctly.
- Processing the Header: You correctly process the header, but the replacement of double quotes (") is not always necessary unless you are sure your header contains double quotes.
- Processing the Data Rows: Here's where the issue likely originates. You’re iterating over lst[1:] but assigning the split lines back to lst[idx]. Because lst[1:] is shorter than lst, this doesn’t overwrite all the entries in lst. As a result, the original, unsplit lines from lst remain in your list, leading to more columns than expected when you create the DataFrame.
import pandas as pd with open('file.csv', 'r') as f: lines = f.readlines() # Remove newline characters and strip quotes if needed lines = [line.strip('\n').replace('"', '') for line in lines] # Split the header header = lines[0].split(',') # Split the data rows data = [line.split(',') for line in lines[1:]] # Create the DataFrame df = pd.DataFrame(data, columns=header)
This script should correctly process the CSV file into a DataFrame. If your CSV contains quoted fields with commas inside, this simple split approach may not work correctly, and you might need to use a CSV parser, like the one built into Pandas (pandas.read_csv()) or Python's csv module.
Answered By - Aditya Dube
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.