Issue
I have an automatically generated CSV file which provides the data I need, but some individual cells have extra characters in the cell.
COLUMN1 COLUMN2 COLUMN3
'Foo325GoodData' Bar:388GoodData 383GoodData
'Foo123GoodData' Bar:998GoodData 293GoodData
I need to remove the foo, the bar, the quotes and the colon without altering the 3 digit data.
Here's the code I tried, but it just deletes all of my data:
import csv
import string
input_file = open('data.csv', 'r')
output_file = open('data_cleaned.csv', 'w')
data = csv.reader(input_file)
writer = csv.writer(output_file,quoting=csv.QUOTE_ALL)
foo = 'foo'
bar = '"bar:u'
for line in data:
line = str(line)
new_line = str.replace(line,foo,'')
new_line2 = str.replace(line,bar,'')
writer.writerow(new_line.split(','))
writer.writerow(new_line2.split(','))
Per gboffi's instructions below, I tried the following:
cleaner.py < bad.csv > good.csv
from __future__ import print_function
from sys import stdin
q = '"' + "'"
number = 'foo:'
print(next(stdin) , end='')
for line in stdin:
toks = [tok.strip(q).lstrip(number) for tok in line.split()]
print(' '.join(toks))
This generates a new CSV, but none of the bad data seems to be getting removed.
Solution
If you want to remove everything that is not a digit or a space, why don't you use a variation on this idea
for line in data:
print(''.join(c for c in line if c in '0123456789 '))
where we stick together (''.join(...)
) only the wanted characters in each of the input line?
Update
The previous answer remains because it's perfect for the requirements the OP originally expressed (see the edit history of the question).
However, given the new requirements the OP has given, they could try to use the following code
$ cat nofubar.py
from sys import stdin
q = '"' + "'" # we want to remove the 'q'outes
foobar = 'FooBar:' # and also Foo, Bar and ':' too...
print(next(stdin), end='') # print the header line to stdout, note end=''
for line in stdin:
# strip quotes on both sides,
# strip (all) the characters in foobar * only on the left *
toks = [tok.strip(q).lstrip(foobar) for tok in line.split()]
print(' '.join(toks))
$
Let's put this new code to test
$ python nofubar.py << END > good.csv
COLUMN1 COLUMN2 COLUMN3
'Foo325GoodData' Bar:388GoodData 383GoodData
'Foo123GoodData' Bar:998GoodData 293GoodData
END
$ cat good.csv
COLUMN1 COLUMN2 COLUMN3
325GoodData 388GoodData 383GoodData
123GoodData 998GoodData 293GoodData
$
Assuming that the file with extraneous characters is named bad.csv
you can connect it to stdin
using the input redirection symbol <
on the shell command line
$ python nofubar.py < bad.csv > good.csv
$
Answered By - gboffi
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.