Issue
I have a text file with the following structure
ID,operator,a,b,c,d,true
WCBP12236,J1,75.7,80.6,65.9,83.2,82.1
WCBP12236,J2,76.3,79.6,61.7,81.9,82.1
WCBP12236,S1,77.2,81.5,69.4,84.1,82.1
WCBP12236,S2,68.0,68.0,53.2,68.5,82.1
WCBP12234,J1,63.7,67.7,72.2,71.6,75.3
WCBP12234,J2,68.6,68.4,41.4,68.9,75.3
WCBP12234,S1,81.8,82.7,67.0,87.5,75.3
WCBP12234,S2,66.6,67.9,53.0,70.7,75.3
WCBP12238,J1,78.6,79.0,56.2,82.1,84.1
WCBP12239,J2,66.6,72.9,79.5,76.6,82.1
WCBP12239,S1,86.6,87.8,23.0,23.0,82.1
WCBP12239,S2,86.0,86.9,62.3,89.7,82.1
WCBP12239,J1,70.9,71.3,66.0,73.7,82.1
WCBP12238,J2,75.1,75.2,54.3,76.4,84.1
WCBP12238,S1,65.9,66.0,40.2,66.5,84.1
WCBP12238,S2,72.7,73.2,52.6,73.9,84.1
Each ID
corresponds to a dataset which is analysed by an operator several times. i.e J1
and J2
are the first and second attempt by operator J. The measures a
, b
, c
and d
use 4 slightly different algorithms to measure a value whose true value lies in the column true
What I would like to do is to create 3 new text files comparing the results for J1
vs J2
, S1
vs S2
and J1
vs S1
. Example output for J1
vs J2
:
ID,operator,a1,a2,b1,b2,c1,c2,d1,d2,true
WCBP12236,75.7,76.3,80.6,79.6,65.9,61.7,83.2,81.9,82.1
WCBP12234,63.7,68.6,67.7,68.4,72.2,41.4,71.6,68.9,75.3
where a1
is measurement a
for J1
, etc.
Another example is for S1
vs S2
:
ID,operator,a1,a2,b1,b2,c1,c2,d1,d2,true
WCBP12236,77.2,68.0,81.5,68.0,69.4,53.2,84.1,68.5,82.1
WCBP12234,81.8,66.6,82.7,67.9,67.0,53,87.5,70.7,75.3
The IDs will not be in alphanumerical order nor will the operators be clustered for the same ID. I'm not certain how best to approach this task - using linux tools or a scripting language like perl/python.
My initial attempt using linux quickly hit a brick wall
First find all unique IDs (sorted)
awk -F, '/^WCBP/ {print $1}' file | uniq | sort -k 1.5n > unique_ids
Loop through these IDs and sort J1
, J2
:
foreach i (`more unique_ids`)
grep $i test.txt | egrep 'J[1-2]' | sort -t',' -k2
end
This gives me the data sorted
WCBP12234,J1,63.7,67.7,72.2,71.6,75.3
WCBP12234,J2,68.6,68.4,41.4,68.9,80.4
WCBP12236,J1,75.7,80.6,65.9,83.2,82.1
WCBP12236,J2,76.3,79.6,61.7,81.9,82.1
WCBP12238,J1,78.6,79.0,56.2,82.1,82.1
WCBP12238,J2,75.1,75.2,54.3,76.4,82.1
WCBP12239,J1,70.9,71.3,66.0,73.7,75.3
WCBP12239,J2,66.6,72.9,79.5,76.6,75.3
I'm not sure how to rearrange this data to get the desired structure. I tried adding an additional pipe to awk
in the foreach
loop awk 'BEGIN {RS="\n\n"} {print $1, $3,$10,$4,$11,$5,$12,$6,$13,$7}'
Any ideas? I'm sure this can be done in a less cumbersome manner using awk
, although it may be better using a proper scripting language.
Solution
You can use the Perl csv module Text::CSV to extract the fields, and then store them in a hash, where ID is the main key, the second field is the secondary key and all the fields are stored as the value. It should then be trivial to do whatever comparisons you want. If you want to retain the original order of your lines, you can use an array inside the first loop.
use strict;
use warnings;
use Text::CSV;
my %data;
my $csv = Text::CSV->new({
binary => 1, # safety precaution
eol => $/, # important when using $csv->print()
});
while ( my $row = $csv->getline(*ARGV) ) {
my ($id, $J) = @$row; # first two fields
$data{$id}{$J} = $row; # store line
}
Answered By - TLP
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.