当前位置: 动力学知识库 > 问答 > 编程问答 >

regex - R - Cannot Read File with Control Character [SUB]

问题描述:

I've had this issue before, but my previous solution doesn't fix it.

In my text-data, in Notepad++ when I show all characters, a character listed as [SUB] appears.

PREVIOUSLY, I deleted these by doing this...

## Read the file in as Binary

r = readBin( curFile, raw(), file.info(curFile)$size)

## Convert the pesky characters

if ((r[1]==as.raw(0x1a)))

{

## Find it

spot = which(r == as.raw(0x1a) )

r[r == as.raw(0x1a)] = as.raw(0x20)

}

However, this isn't working. It seems like every time I manage to escape an invisible character, within a week, another one causes me a problem. Is there a way to just "clean" a file effectively of all invisible control characters other than the new-lines separating my data entries?

Please let me know. This is maddening already.

Thanks!

I can make a limited CSV file for you all to try. It's the second line, 4th column that causes the crash.

http://www.megafileupload.com/6ead/stackOverflow.csv

The entire code I was using to do this is below....

library(stringr)

############# DO THIS FIRST

folder = "C:\\Twitter_TimeSeries\\Bernie_Practice\\"

## Get the file name of every file in the directory

file.names = dir(folder, pattern=".csv")

## Figure out how many files there are

numFiles = length(file.names)

## Loop through every file

for( i in 1:length(file.names))

{

## Which file are we on?

curFile = paste( folder, file.names[i], sep="" )

## Read the file in as Binary

r = readBin( curFile, raw(), file.info(curFile)$size)

## Convert the pesky characters

if ((r[1]==as.raw(0x1a)))

{

## Find it

spot = which(r == as.raw(0x1a) )

r[r == as.raw(0x1a)] = as.raw(0x20)

}

if ((r[1]==as.raw(0x0a))) {

## Find it

spot = which(r == as.raw(0x0a) )

r[r == as.raw(0x1a)] = as.raw(0x20)

} ## If

## Re-write the file

writeBin(r, curFile)

} ## For

curFile = stackOverflow.csv

rawData = read.csv(curFile, stringsAsFactors=FALSE)

网友答案:

Try using a regular expression to limit your data to only the allowable characters.

x = read.csv("foo.csv",colClasses="character") x = gsub("[^0-9\\.]","",x) # just numbers and '.' x = as.numeric(x) # Assuming your file really represents numeric data

分享给朋友:
您可能感兴趣的文章:
随机阅读: