当前位置: 动力学知识库 > 问答 > 编程问答 >

perl - Searching for specific duplicate ID's

问题描述:

I've written a perl script which reads in 2 different files, compares the ID's in these two files and only prints out the data where the ID's match. The ID file is read into an array, while the data file is read line by line. This all works rather well, however now I need to add more to it. In my data file, I'll sometimes have rows where the ID is duplicated, as the subject has been for more than one visit to give samples. I therefore need to look for these duplicates and take only the latest date of visit.

So my data file looks something like this:

 ID DOV Data1 Data2 etc etc

Now I've seen hashes are the way to search for duplicates, however all the fixes I've seen have been to simply remove the duplicates indiscriminately, which isn't what I want.

Any ideas?

网友答案:
# read id file
my %id_hash;
while (<IDFILE>) {
  chomp;
  $id_hash{$_} = 1;
}

#read data file
while (<DATAFILE>) {
  my @arr = split(/\s+/, $_);
  if (defined $id_hash{$arr[0]}) { # only process if exists in id file
    # and only if this is the first data entry or a later visit
    if ( (not ref $id_hash{$arr[0]}) or ($id_hash{$arr[0]}[1] < $arr[1]) ) {
      # store all data in an array ref
      $id_hash{$arr[0]} = [ @arr ];
    }
  }
}

for my $id (keys %id_hash) {
  print join(" ", @{$id_hash{$id}}), "\n";
}
网友答案:

This will show the last DOV for each ID, making a lot of assumptions about the input data, so there's a good chance that it won't work out-of-the-box for you. (In particular, if your input data isn't sorted by date, it won't work at all because it just takes the last date seen for each ID. Also, if dates are formatted in a way that includes spaces, such as "Mon Jul 9 15:51:22 CEST 2012", it will only get the date up to the first space ("Mon" in this example).) The point here is just to demonstrate the basic technique, not to provide a full solution.

#!/usr/bin/env perl    

use strict;
use warnings;

my %visit;
while (<DATA>) {
  my ($id, $date) = split;
  $visit{$id} = $date;
} 

for my $id (sort keys %visit) {
  print "$id => $visit{$id}\n";
} 

__DATA__
1       2012-01-01
2       2012-01-02
1       2012-02-03
3       2012-02-04
2       2012-03-05
3       2012-03-06
4       2012-04-07
1       2012-04-08
5       2012-05-09
1       2012-05-10
分享给朋友:
您可能感兴趣的文章:
随机阅读: