Issue
I have a dataset with millions of text files with numbers saved as strings and using a variety of locales to format the number. What I am trying to do is guess which symbol is the decimal separator and which is the thousand separator.
This shouldn't be too hard but it seems the question hasn't been asked yet and for posterity it should be asked and answered here.
What I do know is that there is always a decimal separator and it is always the last non[0-9] symbol in the string.
As you can see below a simple numStr.replace(',', '.')
to fix the variations in decimal separators will conflict with the possible thousand separators.
I have seen ways of doing it if you know the locale but I do NOT know the locale in this instance.
Dataset:
1.0000 //1.0
1,0000 //1.0
10,000.0000 //10000.0
10.000,0000 //10000.0
1,000,000.0000 // 1000000.0
1.000.000,0000 // 1000000.0
//also possible
1 000 000.0000 //1000000.0 with spaces as thousand separators
Solution
One approach:
import re
with open('numbers') as fhandle:
for line in fhandle:
line = line.strip()
separators = re.sub('[0-9]', '', line)
for sep in separators[:-1]:
line = line.replace(sep, '')
if separators:
line = line.replace(separators[-1], '.')
print(line)
On your sample input (comments removed), the output is:
1.0000
1.0000
10000.0000
10000.0000
1000000.0000
1000000.0000
1000000.0000
Update: Handling Unicode
As NeoZenith points out in the comments, with modern unicode fonts, the venerable regular expression [0-9]
is not reliable. Use the following instead:
import re
with open('numbers') as fhandle:
for line in fhandle:
line = line.strip()
separators = re.sub(r'\d', '', line, flags=re.U)
for sep in separators[:-1]:
line = line.replace(sep, '')
if separators:
line = line.replace(separators[-1], '.')
print(line)
Without the re.U
flag, \d
is equivalent to [0-9]
. With that flag, \d
matches whatever is classified as a decimal digit in the Unicode character properties database. Alternatively, for handling unusual digit characters, one may want to consider using unicode.translate
.
Answered By - John1024
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.