当前位置: 动力学知识库 > 问答 > 编程问答 >

python - Capture warning(s) and skip row(s) for any long or short rows in Pandas dataframe

问题描述:

So I'm writing something that intends to analyze hundreds of different flat files submitted by clients. It uses csv_sniffer to check for headers, footers, delimiters, text separators -- the usual, in other words -- and then applies the resulting dialect to a pd.read_csv.

This works pretty well, bearing in mind that we have to resort to chunks and df.merge for the larger filesizes.

But one of the goals of this project is to get a pretty list of all the rows that have an unexpected number of columns, and NOT LOAD those rows.

While Pandas will by default skip long rows -- rows with more columns than expected -- it writes this warning to the screen and does not appear to use the Python warnings suite; when I throw a read_csv inside catch_warnings, I can't capture the resulting warning.

So the first question is:

  • do I need to capture stdout or stderr to a string array in order to be able to record, parse, and output a reworded version of this warning?

The second question is harder. I want identical behavior for SHORT rows, rows that have too few columns. By default, Pandas jams short rows into the DataFrame and just fills null columns with NaN. This is problematic behavior for a variety of reasons; since I'm doing merge on chunked dataframes, a datatype that suddenly shifts because one row somewhere in the middle of the file is missing two columns in the middle causes merge() to bomb out (Also, because part of our goal is to report on the "quality" of these files, we want to consider these short rows to be errors and report on them in the same manner that we want to report on long ones).

I can't just iterate through rows to see whether the last column is null, because these are data files with unknown values and consequently the last column might well be legitimately empty in a majority of cases.

I'm hoping that Pandas has some feature I'm simply not aware of that notes when a row has the wrong number of field separators, because that's actually what I'm trying to track (for both issues, really):

  • in the event that a row has too many or too few fields. is it possible to capture that row number to a string array and skip loading the row?

It seems like this should be a relatively common request, but I can't find any documentation on such functionality.

分享给朋友:
您可能感兴趣的文章:
随机阅读: