当前位置: 动力学知识库 > 问答 > 编程问答 >

perl - Parsing a file by summing up different columns of each row separated by blank line

问题描述:

I have a file input as below;

#

volume stats

start_time 1

length 2

--------

ID

0x00a,1,2,3,4

0x00b,11,12,13,14

0x00c,21,22,23,24

volume stats

start_time 2

length 2

--------

ID

0x00a,31,32,33,34

0x00b,41,42,43,44

0x00c,51,52,53,54

volume stats

start_time 3

length 2

--------

ID

0x00a,61,62,63,64

0x00b,71,72,73,74

0x00c,81,82,83,84

#

I need output in below format;

1 33 36 39 42

2 123 126 129 132

3 213 216 219 222

#

Below is my code;

#!/usr/bin/perl

use strict;

use warnings;

#use File::Find;

# Define file names and its location

my $input = $ARGV[0];

# Grab the vols stats for different intervals

open (INFILE,"$input") or die "Could not open sample.txt: $!";

my $date_time;

my $length;

my $col_1;

my $col_2;

my $col_3;

my $col_4;

foreach my $line (<INFILE>)

{

if ($line =~ m/start/)

{

my @date_fields = split(/ /,$line);

$date_time = $date_fields[1];

}

if ($line =~ m/length/i)

{

my @length_fields = split(/ /,$line);

$length = $length_fields[1];

}

if ($line =~ m/0[xX][0-9a-fA-F]+/)

{

my @volume_fields = split(/,/,$line);

$col_1 += $volume_fields[1];

$col_2 += $volume_fields[2];

$col_3 += $volume_fields[3];

$col_4 += $volume_fields[4];

#print "$col_1\n";

}

if ($line =~ /^$/)

{

print "$date_time $col_1 $col_2 $col_3 $col_4\n";

$col_1=0;$col_2=0;$col_3=0;$col_4=0;

}

}

close (INFILE);

#

my code result is;

1

33 36 39 42

2

123 126 129 132

#

BAsically, for each time interval, it just sums up the columns for all the lines and displays all the columns against each time interval.

网友答案:

$/ is your friend here. Try setting it to '' to enable paragraph mode (separating your data by blank lines).

#!/usr/bin/env perl

use strict;
use warnings;

local $/ = ''; 

while ( <> ) {
    my ( $start ) = m/start_time\s+(\d+)/;
    my ( $length ) = m/length\s+(\d+)/;
    my @row_sum; 
    for ( m/(0x.*)/g )  {
        my ( $key, @values ) = split /,/; 
        for my $index ( 0..$#values ) {
           $row_sum[$index] += $values[$index];
        }
    }
    print join ( "\t", $start, @row_sum ), "\n";
}

Output:

1       33      36      39      42
2       123     126     129     132
3       213     216     219     222

NB - using tab stops for output. Can use sprintf if you need more flexible options.

I would also suggest that instead of:

my $input = $ARGV[0]; 
open (my $input_fh, '<', $input) or die "Could not open $input: $!";

You would be better off with:

while ( <> ) { 

Because <> is the magic filehandle in perl, that - opens files specified on command line, and reads them one at a time, and if there isn't one, reads STDIN. This is just like how grep/sed/awk do it.

So you can still run this with scriptname.pl sample.txt or you can do curl http://somewebserver/sample.txt | scriptname.pl or scriptname.pl sample.txt anothersample.txt moresample.txt

Also - if you want to open the file yourself, you're better off using lexical vars and 3 arg open:

open ( my $input_fh, '<', $ARGV[0] ) or die $!; 

And you really shouldn't ever be using 'numbered' variables like $col_1 etc. If there's numbers, then an array is almost always better.

网友答案:

Basically, a block begins with start_time and ends with a line of of whitespace. If instead end of block is always assured to be an empty line, you can change the test below.

It helps to use arrays instead of variables with integer suffixes.

When you hit the start of a new block, record the start_time value. When you hit a stat line, update column sums, and when you hit a line of whitespace, print the column sums, and clear them.

This way, you keep your program's memory footprint proportional to the longest line of input as apposed to the largest block of input. In this case, there isn't a huge difference, but, in real life, there can be. Your original program was reading the entire file into memory as a list of lines which would really cause your program's memory footprint to balloon when used with large input sizes.

#!/usr/bin/env perl

use strict;
use warnings;

my $start_time;
my @cols;

while (my $line = <DATA>) {
    if ( $line =~ /^start_time \s+ ([0-9]+)/x) {
        $start_time = $1;
    }
    elsif ( $line =~ /^0x/ ) {
        my ($id, @vals) = split /,/, $line;
        for my $i (0 .. $#vals) {
            $cols[ $i ] += $vals[ $i ];
        }
    }
    elsif ( !($line =~ /\S/) ) {
        # guard against the possibility of
        # multiple blank/whitespace lines between records
        if ( @cols ) {
            print join("\t", $start_time, @cols), "\n";
            @cols = ();
        }
    }
}

# in case there is no blank/whitespace line after last record
if ( @cols ) {
    print join("\t", $start_time, @cols), "\n";
}

__DATA__
volume stats
start_time  1
length      2
--------
ID
0x00a,1,2,3,4
0x00b,11,12,13,14
0x00c,21,22,23,24

volume stats
start_time  2
length      2
--------
ID
0x00a,31,32,33,34
0x00b,41,42,43,44
0x00c,51,52,53,54

volume stats
start_time  3
length      2
--------
ID
0x00a,61,62,63,64
0x00b,71,72,73,74
0x00c,81,82,83,84

Output:

1  33  36  39  42
2   123 126 129 132
3   213 216 219 222
网友答案:

When I run your code, I get warnings:

Use of uninitialized value $date_time in concatenation (.) or string

I fixed it by using \s+ instead of / /.

I also added a print after your loop in case the file does not end with a blank line.

Here is minimally-changed code to produce your desired output:

use strict;
use warnings;

# Define file names and its location
my $input = $ARGV[0];

# Grab the vols stats for different intervals
open (INFILE,"$input") or die "Could not open sample.txt: $!";
my $date_time;
my $length;
my $col_1;
my $col_2;
my $col_3;
my $col_4;
foreach my $line (<INFILE>)
{
    if ($line =~ m/start/)
        {
            my @date_fields = split(/\s+/,$line);
            $date_time = $date_fields[1];
        }
    if ($line =~ m/length/i)
        {
            my @length_fields = split(/\s+/,$line);
            $length = $length_fields[1];
        }
    if ($line =~ m/0[xX][0-9a-fA-F]+/)
        {
            my @volume_fields = split(/,/,$line);
            $col_1 += $volume_fields[1];
            $col_2 += $volume_fields[2];
            $col_3 += $volume_fields[3];
            $col_4 += $volume_fields[4];
        }
    if ($line =~ /^$/)
        {
            print "$date_time $col_1 $col_2 $col_3 $col_4\n";
            $col_1=0;$col_2=0;$col_3=0;$col_4=0;
        }
}
print "$date_time $col_1 $col_2 $col_3 $col_4\n";
close (INFILE);


__END__

1 33 36 39 42
2 123 126 129 132
3 213 216 219 222
分享给朋友:
您可能感兴趣的文章:
随机阅读: