How do I read messy raw data as one observation per line?

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 5
Accepted Solution

How do I read messy raw data as one observation per line?

Hi all,

 

I am having an issue reading raw data into Enterprise Guide due to the nature of the data.  For reasons that are too long to explain in this forum, I cannot use the import wizard.  

 

Essentially, I have multiple variables per line of data, but the placement of delimiters (and even amount of variables per line) is inconsistent.  What I would just like to do at this point is import the data so that each line of data is its own observation for ONE variable.  

 

I have gotten to the point where I have a variable that contains the beginning of each line, but once it reaches the first delimiter, it skips the rest of that line and begins again on the next line.  

 

Here is a sample of the data (2 consecutive dashes represent tabs for the purpose of demonstration):

 

SPAIN -- -- -- -- -- --

  Catalunya -- -- -- -- -- --

      Mercedes Benz -- 4992 -- Barcelona, ES -- CR -- -- -- --

           123 Tapas Street -- Paraque -- 21441 -- 12 -- 9842

           456 Toro Avenue -- NoSe -- 78443 -- 97 -- 821

Z     789 Del Mar Road -- Refor -- 12743 -- 50 -- 6612

           TOTALS FOR ENTITY 4992 -- 3 Office(s) -- 17,275

 

 

This would be one sample section of data, but some other chunks may have only 1 address or 10 addresses per entity (ie there is very little consistency).  Also, there is sometimes additional data in between each of these country sections (like seen above) that is very different.  The point is, I would like to just read it as one line=one observation with a total of one variable.  So, the data above would look something like this:

 

Obs        VAR1

 

1             SPAIN

2                Catalunya

3                     Mercedes Benz  4992  Barcelona, ES CD

.

.

.

7                     TOTALS FOR ENTITY  3 Office(s)  17,275

------------------------------------------------------------------------------

 

THANKS IN ADVANCE FOR THE HELP!!

 

 


Accepted Solutions
Solution
‎07-26-2016 08:03 AM
Super User
Posts: 11,343

Re: How do I read messy raw data as one observation per line?

Without more complete examples it will be hard to address specific items.

It is very likely that you want to start with

 

input @;

To hold the input line at the current location.

The parse out the value

if index(_infile_,'      ') then region = input(_infile_,$50.);

Which gets a bit messier for the the other rows.

 

Then test for the presence of specific. Also if your data is irregular in the number of lines you will need to read for a specific record you may end up wanting to RETAIN variables (such as what appears to be Country and Region) and have an OUTPUT statement only after you have read the detail line of the data..

 

Note: messy data often equals messy code to deal with it.

 

And no need to apologize for not using a wizard. The wizards are set up for fairly regular data, which this isnt'.

View solution in original post


All Replies
Super User
Posts: 5,431

Re: How do I read messy raw data as one observation per line?

After an INPUT statement, the automatic variable _infile_ is created for you, which you could assign to var1.

Be sure to specify a length to var1 that can hod the entire line.

Data never sleeps
Occasional Contributor
Posts: 5

Re: How do I read messy raw data as one observation per line?

Thanks for the quick reply!

 

The only problem I am having with this solution is that I'm trying to write it so that if a line contains a certain string, it will print that into a specific variable, but when it finds the string I specify, it prints the NEXT line to that variable.....

 

EXAMPLE :

 

if index(_infile_,'      ') then
input Region $CHAR50.;

 

So in this example, it does not print Catalunya as the region, but rather the line starting with "mercedes benz ..."

Solution
‎07-26-2016 08:03 AM
Super User
Posts: 11,343

Re: How do I read messy raw data as one observation per line?

Without more complete examples it will be hard to address specific items.

It is very likely that you want to start with

 

input @;

To hold the input line at the current location.

The parse out the value

if index(_infile_,'      ') then region = input(_infile_,$50.);

Which gets a bit messier for the the other rows.

 

Then test for the presence of specific. Also if your data is irregular in the number of lines you will need to read for a specific record you may end up wanting to RETAIN variables (such as what appears to be Country and Region) and have an OUTPUT statement only after you have read the detail line of the data..

 

Note: messy data often equals messy code to deal with it.

 

And no need to apologize for not using a wizard. The wizards are set up for fairly regular data, which this isnt'.

SAS Super FREQ
Posts: 8,868

Re: How do I read messy raw data as one observation per line?

I am curious about the note in an earlier post about the use of TAB characters. Are there TAB characters in the data? If the TAB characters are regularly spaced, then you may be able to use the TAB as a delimiter to parse out the value you want.

If you are not using the TAB, or they are just random in the input file, then that idea will not be useful to you.

cynthia
Super User
Posts: 10,035

Re: How do I read messy raw data as one observation per line?

You'd better attach a TEXT file to let us test it . and of course, not forget to post the output you want to see.


☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 5 replies
  • 687 views
  • 0 likes
  • 5 in conversation