I am trying to concatenate a number of .txt files in Hebrew from a single folder into a single file. The encoding is cp1255, for Hebrew. I specified the coding, so it succeeds in opening the file, but the coding then fails when trying to write the string to the file.
If I don't specify the encoding at the open command, the open itself fails (on line 7).
for f in files:
data=open(dirLoc+'/'+f, 'r', encoding="cp1255")
for line in data:
The error I get is the standard:
UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to undefined
Edit: Having played around with it some more, the problem seems to definitely be with writing a Hebrew string to the .txt file. This is true even if I resave the file in a different format (such as ANSI or utf-8) and change the encoding accordingly. It also works fine with .txt files in English.
Okay, having played around with this for another day, I found a solution, as follows:
dirLoc='source/folder' import os import codecs files=os.listdir(dirLoc) for f in files: if f.endswith('.txt'): data=codecs.open(dirLoc+'/'+f, 'r+', encoding='utf8') try: data1=data.read() out=codecs.open(dirLoc+'/outPut.txt', 'a+', encoding='utf8') try: out.write(data1) except: print('file ' +f+ ' failed to write') except: print('file '+f+' failed to read') out.close() data.close()
codecs.open allows me to specify encoding for the write function as well as the read - note that you have to import
codecs in order to use it. The exceptions are there because the encoding is still a problem and the occasional file throws an exception. The try allows me to skip any file that does fail to read or write without failing altogether.