Solved: Re: How do I read messy raw data as one observation per line?

jwinkle · Posted 07-25-2016 09:33 AM

Hi all,

I am having an issue reading raw data into Enterprise Guide due to the nature of the data. For reasons that are too long to explain in this forum, I cannot use the import wizard.

Essentially, I have multiple variables per line of data, but the placement of delimiters (and even amount of variables per line) is inconsistent. What I would just like to do at this point is import the data so that each line of data is its own observation for ONE variable.

I have gotten to the point where I have a variable that contains the beginning of each line, but once it reaches the first delimiter, it skips the rest of that line and begins again on the next line.

Here is a sample of the data (2 consecutive dashes represent tabs for the purpose of demonstration):

SPAIN -- -- -- -- -- --

Catalunya -- -- -- -- -- --

Mercedes Benz -- 4992 -- Barcelona, ES -- CR -- -- -- --

123 Tapas Street -- Paraque -- 21441 -- 12 -- 9842

456 Toro Avenue -- NoSe -- 78443 -- 97 -- 821

Z 789 Del Mar Road -- Refor -- 12743 -- 50 -- 6612

TOTALS FOR ENTITY 4992 -- 3 Office(s) -- 17,275

This would be one sample section of data, but some other chunks may have only 1 address or 10 addresses per entity (ie there is very little consistency). Also, there is sometimes additional data in between each of these country sections (like seen above) that is very different. The point is, I would like to just read it as one line=one observation with a total of one variable. So, the data above would look something like this:

Obs VAR1

1 SPAIN

2 Catalunya

3 Mercedes Benz 4992 Barcelona, ES CD

.

7 TOTALS FOR ENTITY 3 Office(s) 17,275

------------------------------------------------------------------------------

THANKS IN ADVANCE FOR THE HELP!!

ballardw · Posted 07-25-2016 10:59 AM

Without more complete examples it will be hard to address specific items.

It is very likely that you want to start with

input @;

To hold the input line at the current location.

The parse out the value

if index(_infile_,' ') then region = input(_infile_,$50.);

Which gets a bit messier for the the other rows.

Then test for the presence of specific. Also if your data is irregular in the number of lines you will need to read for a specific record you may end up wanting to RETAIN variables (such as what appears to be Country and Region) and have an OUTPUT statement only after you have read the detail line of the data..

Note: messy data often equals messy code to deal with it.

And no need to apologize for not using a wizard. The wizards are set up for fairly regular data, which this isnt'.

View solution in original post

LinusH · Posted 07-25-2016 09:40 AM

After an INPUT statement, the automatic variable _infile_ is created for you, which you could assign to var1.

Be sure to specify a length to var1 that can hod the entire line.

Data never sleeps

jwinkle · Posted 07-25-2016 10:02 AM

Thanks for the quick reply!

The only problem I am having with this solution is that I'm trying to write it so that if a line contains a certain string, it will print that into a specific variable, but when it finds the string I specify, it prints the NEXT line to that variable.....

EXAMPLE :

if index(_infile_,' ') then
input Region $CHAR50.;

So in this example, it does not print Catalunya as the region, but rather the line starting with "mercedes benz ..."

ballardw · Posted 07-25-2016 10:59 AM