How to replace all those special characters with white spaces in python ?
I have a list of names of a company . . .
Old Wine pvt
Here, as per the above example . . . I need all the special characters[-,",/,.] in the file
myfiles.txt must be replaced with a single white space and saved into another text file
Can anyone please help me out?
Assuming you mean to change everything non-alphanumeric, you can do this on the command line:
cat foo.txt | sed "s/[^A-Za-z0-99]/ /g" > bar.txt
Or in Python with the
import re original_string = open('foo.txt').read() new_string = re.sub('[^a-zA-Z0-9\n\.]', ' ', original_string) open('bar.txt', 'w').write(new_string)
import string specials = '-"/.' #etc trans = string.maketrans(specials, ' '*len(specials)) #for line in file cleanline = line.translate(trans)
>>> line = "Indo-American pvt/ltd" >>> line.translate(trans) 'Indo American pvt ltd'
import re strs = "how much for the maple syrup? $20.99? That's ricidulous!!!" strs = re.sub(r'[?|$|.|!]',r'',strs) #for remove particular special char strs = re.sub(r'[^a-zA-Z0-9 ]',r'',strs) #for remove all characters strs=''.join(c if c not in map(str,range(0,10)) else '' for c in strs) #for remove numbers strs = re.sub(' ',' ',strs) #for remove extra spaces print(strs) Ans: how much for the maple syrup Thats ricidulous
At first i thought to provide a string.maketrans/translate example, but maybe you are using some utf-8 encoded strings and the ord() sorted translate-table will blow in your face, so i thought about another solution:
conversion = '-"/.' text = f.read() newtext = '' for c in text: newtext += ' ' if c in conversion else c
It's not the fastest way, but easy to grasp and modify.
So if your text is non-ascii you could decode
conversion and the text-strings to unicode and afterwards reencode in whichever encoding you want to.
While maketrans is the fastes way to do it, I never remerber the syntax. Since speed is rarely an issue and I know regular expression, I would tend to do this:
>>> line = "-[myfiles.txt] MY company.INC" >>> import re >>> re.sub(r'[^a-zA-Z0-9]', ' ',line) ' myfiles txt MY company INC'
This has the additional benefit of declaring the character you accept instead of the one you reject, which feels easier in this case.
Of couse if you are using non ASCII caracters you'll have to go back to removing the characters you reject. If there are just punctuations sign, you can do:
>>> import string >>> chars = re.escape(string.punctuation) >>> re.sub(r'['+chars+']', ' ',line) ' myfiles txt MY company INC'
But you'll notice