当前位置: 动力学知识库 > 问答 > 编程问答 >

modification of the PERL script to consume less memory with higher speed

问题描述:

I have written a script which compares multiple files and give the number of occurrence of each paragraph in each file. The script is working fine with smaller files but when applied to large files the program is stuck with no output. I need some help in modifying the script so that it can run on all files even if its very large. My script:

#!/usr/bin/env perl

use strict;

use warnings;

no warnings qw( numeric );

my %seen;

$/ = "";

while (<>) {

chomp;

my ($key, $value) = split ('\t', $_);

my @lines = split /\n/, $key;

my $key1 = $lines[1];

$seen{$key1} //= [ $key ];

push (@{$seen{$key1}}, $value);

}

my $tot;

my $file_count = @ARGV;

while ( my ( $key1, $aref ) = each %seen ) {

$tot = 0;

for my $val ( @{ $aref } ) {

$tot += $val;

}

if ( @{ $aref } >= $file_count ) {

print join "\t", @{ $aref };

print "\tcount:". $tot."\n\n";

}

}

I am providing sample file for understanding the situation better. data1.txt and data2.txt contains sample of data I have with me. I need to sum the occurrence of read in all the files if the second line of each read matches i.e the output the two files should be like shown in output.txt:

**data1.txt**

@NS500278

AGATCNGAAGAGCACACGTCTGAACTCCAGTCACAACGTGATATCTCGTATGCCGTCTTC

+

=CCGGGCGGG1GGJJCGJJCJJJCJJGGGJJGJGJJJCG8JGJJJJ1JGG8=JGCJGG$G 1 :data1.txt

@NS500278

CATTGNACCAAATGTAATCAGCTTTTTTCGTCGTCATTTTTCTTCCTTTTGCGCTCAGGC

+

CCCGGGGGGGGGGJGJJJJJJJJJJJJJGJG$JJJJ$GGJ>JJJGGG8$CGJJGGCJ8JJ 3 :data1.txt

@NS500278

TACAGNGAGCAAACTGAAATGAAAAAGAAATTAATCAGCGGACTGTTTCTGATGTTATGG

+

CCCGGGGGGGGGGJGJJJJJJJJJJJJJGJG$JJJJ$GGJJJJJGGG8$CGJJGGCJ8JJ 2 :data1.txt

**data2.txt**

@NS500278

AGATCNGAAGAGCACACGTCTGAACTCCAGTCACAACGTGATATCTCGTATGCCGTCTTC

+

AAAAA#EEEEEEEEEEEEEEEE6EEEEEAEEEAE/AEEEEEEEAE<EEEEA</AE<EE 1 :data2.txt

@NS500278

CATTGNACCAAATGTAATCAGCTTTTTTCGTCGTCATTTTTCTTCCTTTTGCGCTCAGGC

+

AAAAA#E/<EEEEEEEEEEAEEEEEEEEA/EAAEEEEEEEEEEEE/EEEE/A6<E<EEE 2 :data2.txt

@NS500278

TACAGNGAGCAAACTGAAATGAAAAAGAAATTAATCAGCGGACTGTTTCTGATGTTATGG

+

AAAAA#EEEEEEEEAEEEEEEEEEEEEEEEEEEEEAEEEEEEEE/EEEAE6AE<EAEEAE 2 :data2.txt

**output.txt**

@NS500278

AGATCNGAAGAGCACACGTCTGAACTCCAGTCACAACGTGATATCTCGTATGCCGTCTTC

+

=CCGGGCGGG1GGJJCGJJCJJJCJJGGGJJGJGJJJCG8JGJJJJ1JGG8=JGCJGG$G 1 :data1.txt 1 :data2.txt count:2

@NS500278

CATTGNACCAAATGTAATCAGCTTTTTTCGTCGTCATTTTTCTTCCTTTTGCGCTCAGGC

+

CCCGGGGGGGGGGJGJJJJJJJJJJJJJGJG$JJJJ$GGJ>JJJGGG8$CGJJGGCJ8JJ 3 :data1.txt 2 :data2.txt count:5

@NS500278

TACAGNGAGCAAACTGAAATGAAAAAGAAATTAATCAGCGGACTGTTTCTGATGTTATGG

+

CCCGGGGGGGGGGJGJJJJJJJJJJJJJGJG$JJJJ$GGJJJJJGGG8$CGJJGGCJ8JJ 2 :data1.txt 2 :data2.txt count:4

I was trying to tie my hash to a file but unable to understand the concept. It would be of great help if anyone can explain the solution with a short example. Any help will be appreciated.

网友答案:

I guess the scale problem is of bigger concern as stated by the others in comments. Tying the monster hashes to disk should be your next step on solving the problem. Please consult perltie and if you want an example, this Perl Cookbook script should be useful. If my memory serves, both keys and values in Berkeley DB can hold up to four gigabytes which should be more than ample for your need. By the way, a nice side effect of this approach is that you can easily reuse the previous results.

分享给朋友:
您可能感兴趣的文章:
随机阅读: