Handling Missing Data in Inputs
By Tony Lawrence2005-05-03
Handling Missing Data in Inputs
Let's take a comman data format, a TAB delimited file. A simplistic Perl program to read such a file might be:
#!/usr/bin/perlAn equivalent shell script might be
while (<>) {
#split on tab into @x array
@x=split /\t/;
#print first three elements
print "$x[0]\t$x[1]\t$x[2]\n";
}
IFS="(tab here)"The Perl script works, but the shell script doesn't. Here's the output if the imput file looks like this:
while read a b c d
do
echo "$a $b $c"
done
$ cat t;hexdump -c tThe Perl script produces
1 2 3 4
1 3 4
2 3 4
1 2 4
3 4
0000000 1 \t 2 \t 3 \t 4 \n 1 \t \t 3 \t 4 \n \t
0000010 2 \t 3 \t 4 \n 1 \t 2 \t \t 4 \n \t \t 3
0000020 \t 4 \n
0000023
1 2 3but the shell script messes up:
1 3
2 3
1 2
3
1 2 3If this were a problem with Perl, we'd handle it like this:
1 3 4
2 3 4
1 2 4
3 4
#!/usr/bin/perlBut things can be worse. For example, if we are processing what was once a report format, we may have no delimiters, just empty space. We might see something like this:
while (<>) {
# make sure there is at least one space between adjacent tabs
s/\t\t/\t \t/g;
#split on tab into @x array
@x=split /\t/;
#print first three elements
print "$x[0]\t$x[1]\t$x[2]\n";
}
Date Customer Phone Terms BalanceYou can't process that with delimiters, but you can use unpack:
09/04/04 ABCD Corp. PPD 0.00
09/04/04 Abba Corp. 555-5555 Net 30 985.00
#!/usr/bin/perlWhich will produce:
while(<>) {
@x=unpack("A8A6A20A17A9A12",$_);
print "$x[0]:$x[2]:$x[3]:$x[4]:$x[5]\n";
}
Date:Customer: Phone:Terms: BalanceComma separated value files can be annoying if they also contain commas within quoted fields. You can't use split because of that. There are at least two ways to handle that: either use the Text::Parsewords module:
::::
09/04/04:ABCD Corp.::PPD: 0.00
09/04/04:Abba Corp.:555-5555:Net 30: 985.00
#!/usr/bin/perlOr (assuming the data is regular enough), replace commas not inside quotes with a different delimiter and then split it. I think ParseWords is easier.
use Text::ParseWords;
while(<>) {
@x=quotewords(",",0,$_);
foreach (@x) {
print " $_";
}
print "\n";
}
But sometimes none of that is going to work either. I'm working on a project right now where the input data can have up to three fields, but any of the three can be missing and there are no delimiters and no spacing. The only way to determine what we have is to know that the field one, if present, is alpha, field two is a whole integer, and field three will always have decimal points. So
ABC 982.00means that I have 1 and 3 on line 1, only 2 on line 2, and only 3 on line 3. It's actually much worse than this; there are other fields, some of which are always present and some which are not, and it is quite a challenge to normalize this stuff to be able to massage the data. The way to handle it is to do splits on / /, and then determine what we got. So it's something like this:
8
15.45
#!/usr/bin/perl
while(<>) {
s/\s+/ /g;
@x=split / /;
foreach (@x) {
.. determine what we have based on previous field(s) seen and content
}
Tutorial pages:
|
© Copyright 2005 A.P. Lawrence
|
|||||||||
You might also want to check these out:
|
Leave a Comment on "Handling Missing Data in Inputs"
You must be logged in to post a comment.
Link to This Tutorial Page!

