Thursday, March 17, 2022

[FIXED] Combine lines with matching keys

March 17, 2022 awk, linux, perl, python No comments

Issue

I have a text file with the following structure

ID,operator,a,b,c,d,true
WCBP12236,J1,75.7,80.6,65.9,83.2,82.1
WCBP12236,J2,76.3,79.6,61.7,81.9,82.1
WCBP12236,S1,77.2,81.5,69.4,84.1,82.1
WCBP12236,S2,68.0,68.0,53.2,68.5,82.1
WCBP12234,J1,63.7,67.7,72.2,71.6,75.3
WCBP12234,J2,68.6,68.4,41.4,68.9,75.3
WCBP12234,S1,81.8,82.7,67.0,87.5,75.3
WCBP12234,S2,66.6,67.9,53.0,70.7,75.3
WCBP12238,J1,78.6,79.0,56.2,82.1,84.1
WCBP12239,J2,66.6,72.9,79.5,76.6,82.1
WCBP12239,S1,86.6,87.8,23.0,23.0,82.1
WCBP12239,S2,86.0,86.9,62.3,89.7,82.1
WCBP12239,J1,70.9,71.3,66.0,73.7,82.1
WCBP12238,J2,75.1,75.2,54.3,76.4,84.1
WCBP12238,S1,65.9,66.0,40.2,66.5,84.1
WCBP12238,S2,72.7,73.2,52.6,73.9,84.1

Each ID corresponds to a dataset which is analysed by an operator several times. i.e J1 and J2 are the first and second attempt by operator J. The measures a, b, c and d use 4 slightly different algorithms to measure a value whose true value lies in the column true

What I would like to do is to create 3 new text files comparing the results for J1 vs J2, S1 vs S2 and J1 vs S1. Example output for J1 vs J2:

ID,operator,a1,a2,b1,b2,c1,c2,d1,d2,true
WCBP12236,75.7,76.3,80.6,79.6,65.9,61.7,83.2,81.9,82.1
WCBP12234,63.7,68.6,67.7,68.4,72.2,41.4,71.6,68.9,75.3

where a1 is measurement a for J1, etc.

Another example is for S1 vs S2:

ID,operator,a1,a2,b1,b2,c1,c2,d1,d2,true
WCBP12236,77.2,68.0,81.5,68.0,69.4,53.2,84.1,68.5,82.1
WCBP12234,81.8,66.6,82.7,67.9,67.0,53,87.5,70.7,75.3

The IDs will not be in alphanumerical order nor will the operators be clustered for the same ID. I'm not certain how best to approach this task - using linux tools or a scripting language like perl/python.

My initial attempt using linux quickly hit a brick wall

First find all unique IDs (sorted)

awk -F, '/^WCBP/ {print $1}' file | uniq | sort -k 1.5n > unique_ids

Loop through these IDs and sort J1, J2:

foreach i (`more unique_ids`)
    grep $i test.txt | egrep 'J[1-2]' | sort -t',' -k2
end

This gives me the data sorted

WCBP12234,J1,63.7,67.7,72.2,71.6,75.3
WCBP12234,J2,68.6,68.4,41.4,68.9,80.4
WCBP12236,J1,75.7,80.6,65.9,83.2,82.1
WCBP12236,J2,76.3,79.6,61.7,81.9,82.1
WCBP12238,J1,78.6,79.0,56.2,82.1,82.1
WCBP12238,J2,75.1,75.2,54.3,76.4,82.1
WCBP12239,J1,70.9,71.3,66.0,73.7,75.3
WCBP12239,J2,66.6,72.9,79.5,76.6,75.3

I'm not sure how to rearrange this data to get the desired structure. I tried adding an additional pipe to awk in the foreach loop awk 'BEGIN {RS="\n\n"} {print $1, $3,$10,$4,$11,$5,$12,$6,$13,$7}'

Any ideas? I'm sure this can be done in a less cumbersome manner using awk, although it may be better using a proper scripting language.

Solution

You can use the Perl csv module Text::CSV to extract the fields, and then store them in a hash, where ID is the main key, the second field is the secondary key and all the fields are stored as the value. It should then be trivial to do whatever comparisons you want. If you want to retain the original order of your lines, you can use an array inside the first loop.

use strict;
use warnings;
use Text::CSV;

my %data;
my $csv = Text::CSV->new({
            binary => 1,      # safety precaution
            eol    => $/,     # important when using $csv->print()
    });
while ( my $row = $csv->getline(*ARGV) ) {
    my ($id, $J) = @$row;   # first two fields
    $data{$id}{$J} = $row;  # store line
}

Answered By - TLP

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, March 17, 2022

[FIXED] Combine lines with matching keys

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels