spacer
Web Development Tutorials CGI-PERL Tutorials
 Developer Newsletter

Tutorials
AJAX
ASP
CGI & Perl
CSS
Flash
HTML
Illustrator
Java
JavaScript
Linux
MySQL
PHP
Photoshop
Python
Wireless
XML
Miscellaneous


Scripts Directory
AJAX Scripts
ASP Scripts
ASP.NET Scripts
CGI & Perl Scripts
Flash Scripts
Java Scripts
JavaScript Scripts
PHP Scripts
Python Scripts
Remotely Hosted Scripts
Tools & Utilities Scripts
XML Scripts

Web Hosting Directory
ASP.NET
Budget
Dedicated Servers
Ecommerce
Linux
Resellers
Shared
Small Business
Windows

Developer Manuals
Learn HTML
Learn PHP
Learn CSS
Learn AJAX
Learn JavaScript
Learn Pear
Free White Papers

Developer Resources
Developer Tools
Developer Content
Survey Software
Dedicated Servers




Handling Missing Data in Inputs

By Tony Lawrence
2005-05-03


Handling Missing Data in Inputs

Missing data can be very annoying to a programmer. In fact, it is so annoying that very often we'll write separate programs to clean up data and eliminate unpleasant conditions so that the main program doesn't have to deal with it. Here, I'll show some examples of the kind of problems we see.

Let's take a comman data format, a TAB delimited file. A simplistic Perl program to read such a file might be:

#!/usr/bin/perl

while (<>) {
#split on tab into @x array
@x=split /\t/;
#print first three elements
print "$x[0]\t$x[1]\t$x[2]\n";
}
An equivalent shell script might be

IFS="(tab here)"

while read a b c d
do
echo "$a $b $c"
done
The Perl script works, but the shell script doesn't. Here's the output if the imput file looks like this:

$ cat t;hexdump -c t

1 2 3 4
1 3 4
2 3 4
1 2 4
3 4
0000000 1 \t 2 \t 3 \t 4 \n 1 \t \t 3 \t 4 \n \t
0000010 2 \t 3 \t 4 \n 1 \t 2 \t \t 4 \n \t \t 3
0000020 \t 4 \n
0000023
The Perl script produces

1       2       3

1 3
2 3
1 2
3
but the shell script messes up:

1       2        3

1 3 4
2 3 4
1 2 4
3 4
If this were a problem with Perl, we'd handle it like this:

#!/usr/bin/perl

while (<>) {
# make sure there is at least one space between adjacent tabs
s/\t\t/\t \t/g;
#split on tab into @x array
@x=split /\t/;
#print first three elements
print "$x[0]\t$x[1]\t$x[2]\n";
}
But things can be worse. For example, if we are processing what was once a report format, we may have no delimiters, just empty space. We might see something like this:

Date          Customer             Phone           Terms     Balance


09/04/04 ABCD Corp. PPD 0.00
09/04/04 Abba Corp. 555-5555 Net 30 985.00
You can't process that with delimiters, but you can use unpack:

#!/usr/bin/perl

while(<>) {
@x=unpack("A8A6A20A17A9A12",$_);
print "$x[0]:$x[2]:$x[3]:$x[4]:$x[5]\n";
}
Which will produce:

Date:Customer: Phone:Terms: Balance

::::
09/04/04:ABCD Corp.::PPD: 0.00
09/04/04:Abba Corp.:555-5555:Net 30: 985.00
Comma separated value files can be annoying if they also contain commas within quoted fields. You can't use split because of that. There are at least two ways to handle that: either use the Text::Parsewords module:

#!/usr/bin/perl

use Text::ParseWords;
while(<>) {
@x=quotewords(",",0,$_);
foreach (@x) {
print " $_";
}
print "\n";
}
Or (assuming the data is regular enough), replace commas not inside quotes with a different delimiter and then split it. I think ParseWords is easier.

But sometimes none of that is going to work either. I'm working on a project right now where the input data can have up to three fields, but any of the three can be missing and there are no delimiters and no spacing. The only way to determine what we have is to know that the field one, if present, is alpha, field two is a whole integer, and field three will always have decimal points. So

ABC  982.00

8
15.45
means that I have 1 and 3 on line 1, only 2 on line 2, and only 3 on line 3. It's actually much worse than this; there are other fields, some of which are always present and some which are not, and it is quite a challenge to normalize this stuff to be able to massage the data. The way to handle it is to do splits on / /, and then determine what we got. So it's something like this:

#!/usr/bin/perl

while(<>) {
s/\s+/ /g;
@x=split / /;
foreach (@x) {
.. determine what we have based on previous field(s) seen and content
}


Tutorial Pages:
» Handling Missing Data in Inputs


© Copyright 2005 A.P. Lawrence


 | Bookmark Print |   Write For Us
Related Tutorials:
» Random subroutines in Perl
» Log Script Use
» Creating Perl Modules for Web Sites
» Bit Vector, Using Perl Vec
» Build a Perl/CGI Voting System
» Perl Range Operator



About the NetVisits, Inc Network | Write For Us | Advertise
Copyright ©2007 NetVisits, Inc Network. All Rights Reserved. Privacy Policy.
Visit other NetVisits, Inc. sites: