Wednesday, December 6, 2023

[FIXED] Python script to remove erroneous characters from a CSV file

December 06, 2023 csv, parsing, python, python-2.7, python-2.x No comments

Issue

I have an automatically generated CSV file which provides the data I need, but some individual cells have extra characters in the cell.

COLUMN1           COLUMN2          COLUMN3
'Foo325GoodData'  Bar:388GoodData  383GoodData
'Foo123GoodData'  Bar:998GoodData  293GoodData

I need to remove the foo, the bar, the quotes and the colon without altering the 3 digit data.

Here's the code I tried, but it just deletes all of my data:

 import csv
 import string

 input_file = open('data.csv', 'r')
 output_file = open('data_cleaned.csv', 'w')
 data = csv.reader(input_file)
 writer = csv.writer(output_file,quoting=csv.QUOTE_ALL)
 foo = 'foo'
 bar = '"bar:u'

 for line in data:
     line = str(line)
     new_line = str.replace(line,foo,'')
     new_line2 = str.replace(line,bar,'')
     writer.writerow(new_line.split(','))
     writer.writerow(new_line2.split(','))

Per gboffi's instructions below, I tried the following:

cleaner.py < bad.csv > good.csv

 from __future__ import print_function
 from sys import stdin


 q = '"' + "'"
 number = 'foo:'

 print(next(stdin) , end='')

 for line in stdin:

     toks = [tok.strip(q).lstrip(number) for tok in line.split()]

     print(' '.join(toks))

This generates a new CSV, but none of the bad data seems to be getting removed.

Solution

If you want to remove everything that is not a digit or a space, why don't you use a variation on this idea

for line in data:
    print(''.join(c for c in line if c in '0123456789 '))

where we stick together (''.join(...)) only the wanted characters in each of the input line?

Update

The previous answer remains because it's perfect for the requirements the OP originally expressed (see the edit history of the question).

However, given the new requirements the OP has given, they could try to use the following code

$ cat nofubar.py 
from sys import stdin

q = '"' + "'"             # we want to remove the 'q'outes
foobar = 'FooBar:'        # and also Foo, Bar and ':' too...

print(next(stdin), end='') # print the header line to stdout, note end=''

for line in stdin:

    # strip quotes on both sides,
    # strip (all) the characters in foobar * only on the left *
    toks = [tok.strip(q).lstrip(foobar) for tok in line.split()]

    print(' '.join(toks))
$

Let's put this new code to test

$ python nofubar.py << END > good.csv
COLUMN1           COLUMN2          COLUMN3
'Foo325GoodData'  Bar:388GoodData  383GoodData
'Foo123GoodData'  Bar:998GoodData  293GoodData
END
$ cat good.csv 
COLUMN1           COLUMN2          COLUMN3
325GoodData 388GoodData 383GoodData
123GoodData 998GoodData 293GoodData
$

Assuming that the file with extraneous characters is named bad.csv you can connect it to stdin using the input redirection symbol < on the shell command line

$ python nofubar.py < bad.csv > good.csv
$

Answered By - gboffi

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, December 6, 2023

[FIXED] Python script to remove erroneous characters from a CSV file

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels