So I'm writing something that intends to analyze hundreds of different flat files submitted by clients. It uses
csv_sniffer to check for headers, footers, delimiters, text separators -- the usual, in other words -- and then applies the resulting dialect to a
This works pretty well, bearing in mind that we have to resort to chunks and
df.merge for the larger filesizes.
But one of the goals of this project is to get a pretty list of all the rows that have an unexpected number of columns, and NOT LOAD those rows.
While Pandas will by default skip long rows -- rows with more columns than expected -- it writes this warning to the screen and does not appear to use the Python warnings suite; when I throw a
catch_warnings, I can't capture the resulting warning.
So the first question is:
The second question is harder. I want identical behavior for SHORT rows, rows that have too few columns. By default, Pandas jams short rows into the DataFrame and just fills null columns with
NaN. This is problematic behavior for a variety of reasons; since I'm doing
merge on chunked dataframes, a datatype that suddenly shifts because one row somewhere in the middle of the file is missing two columns in the middle causes merge() to bomb out (Also, because part of our goal is to report on the "quality" of these files, we want to consider these short rows to be errors and report on them in the same manner that we want to report on long ones).
I can't just iterate through rows to see whether the last column is null, because these are data files with unknown values and consequently the last column might well be legitimately empty in a majority of cases.
I'm hoping that Pandas has some feature I'm simply not aware of that notes when a row has the wrong number of field separators, because that's actually what I'm trying to track (for both issues, really):
It seems like this should be a relatively common request, but I can't find any documentation on such functionality.