11-03-2017 05:09 AM
I'm currently studying for the Base programming Sas certification and as I try to always go a little further on each issue, i got stuck at "Chapter 19 - Creating a Single Observation from Multiple Records". Indeed, it is explained that in order to use Line Pointer controls to read raw datasets, "the raw data file must contain the same number of records for each observation" (which make perfect sense). But no explanation is given on how to deal with missing records. Only this sentence : "For more information about working with raw data files that contain missing records, see the SAS documentation."
So I searched through the SAS documentation/the web, but couldn't find any solution. This is why I'm posting here my question : How can I manage raw datasets when observations are divided on multiple records and there are missing records? Should I use other controls than line pointers? Or maybe this is not possible through the use of pointer controls and I need to apply more "complex" methods to this kind of raw dataset?
Here is the example raw dataset used in the book.
|1215 RAINTREE CIRCLE|
|PHOENIX AZ 85044|
|1751 DIEHL ROAD|
|131 DONERAIL DRIVE|
|ATLANTA GA 30363|
First observation is divided on 3 records, same for the third, but the second has a missing record. How can i deal with it?
Thank you for any enlightenment you could give me on this.
11-03-2017 05:30 AM
This is more a theory question than a coding one. How do you know (well more how would tell someone else) that the third row is missing for the second observation, and it is not MYRON BARKER? If it was just those three then a simple if check would get you the result:
data want; infile "c:\abc.txt"; length name addr city $200; retain name addr city; if _n_=1 then i=1; input; select(i); when(1) do; name=_input_; i=2; end; when(2) do; addr=_input_; i=3; when(3) do; if _input_ in ("PHOENIX","ATLANTA") then do; city=_input_; output; i=1; end; else do; output; name=_input_; i=2; end; otherwise; end; run;
11-03-2017 06:53 AM
The programming is tedious, and beyond what I have patience for. But the theory isn't too bad. For example:
It's entirely possible that you need to retain the three variables that you want, and wipe out their values after outputting. It's much simpler if you are certain that the middle line of data is the only one that might be missing. I would suggest that this assumption is sufficient for "taking it a step further".
11-03-2017 10:19 AM
In my personal opinion any multiline record format that does not include a line role identifier is asking for problems. And yes I've had to deal with more than a few of those.
|Name: LEE ATHNOS|
|Street: 1215 RAINTREE CIRCLE|
|City: PHOENIX AZ 85044|
|Name: HEIDIE BAKER|
|Street: 1751 DIEHL ROAD|
|Name: MYRON BARKER|
|Street: 131 DONERAIL DRIVE|
|City: ATLANTA GA 30363|
Then you have a consistent string to look for and allows cleaner reading code.
Of course the original problem will be traced back to data entry and the format used to collect the information and/or create the file you use. If this type of data is too be handled frequently it will be worth the effort to go back in the chain and fix it. If you are contracting work based on someone else's data then showing them the $$$ impact of you having to fix their process may get some attention.
11-07-2017 08:49 AM
Thank you all for your advices. I suspected there will be no global solution and that it will more likely be a case by case assessment. I guess I was misleaded by what is written in the "SAS Cert Prep Guide - Base programming" book, suggesting that if I searched thoroughly the SAS documentation, I would find some indication on how to process raw data files with missing records. I mean, there aren't several ways to understand "For more information about working with raw data files that contains missing records, see the SAS documentation".
Again, thank you all for your replies.