DATA Step, Macro, Functions and more

Solved
Contributor
Posts: 33

I could use help reading a file that looks like this (below)

The first six columns are irrelevant.  The ID starts at column 7, and is 9 digits long (7-15). At 16-19 is irrelevent.  The variable titles (or Item Codes, variable codes) are four digits long beginning at Column 20-23, followed by the value associated with the variable code in columns 24-33 (9 digits long); followed by another Var item code (length 4), followed by the value of item code (9 columns)… repeated 4 times (four item codes per line of data).   Each ID (7-15) has up to 191 possible item codes (variables) and can take up several lines in the data file. The following IDs are truncated (varied in length).  No important information existes beyond column 72 (111,211 are irrelevent).   Any ideas on reading this.  My code doesn't work.

data batchX;
array ic (1:191) p0001-p0191; /* These are the variable names (Item Codes) */
infile "F:\DATA\batch01.dat" lrecl=85 missover;
input @7 ID 9. @;
x=20;
do while (x<72);
input @x code 4. @;
x=x+4;
if code gt 0 then do;
input @x ic{code} 9. @;
x=x+9;
end;
end;
run;

19000190703674001010170000000006017900000000201800000000020181000000003 211
19000190703674001010182000000002018300000000201840000000020185000000003 211
19000190703674001010186000000002018700000000201880000000020189000000002 211
190001907036740010101900000000020191000000001                                                     211
19000194754213001010001000000004000200000000100030000000100004000002015 111
19000194754213001010005000002000000600000018300070000000010008000000300 211
19000194754213001010009000000012001000000000200110000000500012000000015 211
19000194754213001010014000000006001500000000500230000000010026000000001 211
19000194754213001010165000000001016700000000301730000000090179000000002 211
19000194754213001010184000000002018500000000101860000000030187000000003 211
19000194754213001010188000000003018900000000201900000000020191000000001 211
1900019475421300101                                                                                                         211
19000195700697001010001000000003000200000000201920000000140191000000002 111
1900019570069700101                                                                                                          211
19000130127654001010001000000001000200000000201690000000050179000000003 111
19000130127654001010180000000003018100000000301820000000030183000000003 211
19000130127654001010184000000003018500000000301860000000010187000000001 211
19000130127654001010188000000001018900000000101900000000010191000000002 211
1900013012765400101                                                                                                         211
19000190701425001010001000000005000200000000201920000000140191000000002 111
1900019070142500101                                                                                                         211
19000186716534001010001000000006000200000000201650000000010167000000003 111
19000186716534001010169000000005017900000000301800000000020181000000002 211
19000186716534001010182000000002018300000000201840000000010185000000001 211
19000186716534001010186000000002018700000000201880000000020189000000003 211
190001867165340010101900000000030191000000001                                                     211

Accepted Solutions
Solution
‎03-23-2017 03:40 PM
Super User
Posts: 7,932

[ Edited ]

I would read it into a vertical table.  So this program will read the CODE/VALUE pairs as separate observations. It adds the variable N to count which position it was in.

``````data want ;
input id 7-15 @20 @ ;
retain n ;
if id ne lag(id) then n=1 ;
do n=n to n+3 ;
input code 4. value 9. @;
if n(code,value) > 0 then output;
end;
cards;
19000190703674001010170000000006017900000000201800000000020181000000003 211
19000190703674001010182000000002018300000000201840000000020185000000003 211
19000190703674001010186000000002018700000000201880000000020189000000002 211
190001907036740010101900000000020191000000001                           211
19000194754213001010001000000004000200000000100030000000100004000002015 111
19000194754213001010005000002000000600000018300070000000010008000000300 211
19000194754213001010009000000012001000000000200110000000500012000000015 211
19000194754213001010014000000006001500000000500230000000010026000000001 211
19000194754213001010165000000001016700000000301730000000090179000000002 211
19000194754213001010184000000002018500000000101860000000030187000000003 211
19000194754213001010188000000003018900000000201900000000020191000000001 211
1900019475421300101                                                     211
19000195700697001010001000000003000200000000201920000000140191000000002 111
1900019570069700101                                                     211
19000130127654001010001000000001000200000000201690000000050179000000003 111
19000130127654001010180000000003018100000000301820000000030183000000003 211
19000130127654001010184000000003018500000000301860000000010187000000001 211
19000130127654001010188000000001018900000000101900000000010191000000002 211
1900013012765400101                                                     211
19000190701425001010001000000005000200000000201920000000140191000000002 111
1900019070142500101                                                     211
19000186716534001010001000000006000200000000201650000000010167000000003 111
19000186716534001010169000000005017900000000301800000000020181000000002 211
19000186716534001010182000000002018300000000201840000000010185000000001 211
19000186716534001010186000000002018700000000201880000000020189000000003 211
190001867165340010101900000000030191000000001                           211
;;;;``````

All Replies
Super User
Posts: 13,299

What would the desired output look like for that data? Are your codes actually numeric or would a leading zero be needed later on in the processing? For instance, if I count correctly, on the first row of example data you have 0601 in columns 32-35. Do you need "0601" or 601?

If you have a potential of 191 codes I would tend to think of creating an array of boolean values that indicate whether that code is present or not. Or if the number of times a code appears within and ID value is critical then a count. Which would turn this, at least partially into a table look up type problem as you grab pieces of the input line past column 19. If so then may help to provide a list of all of the possible legal values, the 191 codes.

And in you actual data file are the "blanks" in the line, such as on line4 actually TAB characters or spaces? When I paste your example data into a text editor I get tab behavior. Which bit would be important to know.

And considering the what I think of as a pretty stupid file format this seems likely to be medical data of some form.

Contributor
Posts: 33

Ballardw, the end result will be a dataset with an ID, and 191 vars for a total of 192 columns.  There hundreds of IDs. And an unknown number of batches waiting.

the var list will be p0001 through p0191.

The value of 6 that you are referring to in column 32 belongs to the var code 0170 in columns 20-23

Thanks!

Solution
‎03-23-2017 03:40 PM
Super User
Posts: 7,932

[ Edited ]

I would read it into a vertical table.  So this program will read the CODE/VALUE pairs as separate observations. It adds the variable N to count which position it was in.

``````data want ;
input id 7-15 @20 @ ;
retain n ;
if id ne lag(id) then n=1 ;
do n=n to n+3 ;
input code 4. value 9. @;
if n(code,value) > 0 then output;
end;
cards;
19000190703674001010170000000006017900000000201800000000020181000000003 211
19000190703674001010182000000002018300000000201840000000020185000000003 211
19000190703674001010186000000002018700000000201880000000020189000000002 211
190001907036740010101900000000020191000000001                           211
19000194754213001010001000000004000200000000100030000000100004000002015 111
19000194754213001010005000002000000600000018300070000000010008000000300 211
19000194754213001010009000000012001000000000200110000000500012000000015 211
19000194754213001010014000000006001500000000500230000000010026000000001 211
19000194754213001010165000000001016700000000301730000000090179000000002 211
19000194754213001010184000000002018500000000101860000000030187000000003 211
19000194754213001010188000000003018900000000201900000000020191000000001 211
1900019475421300101                                                     211
19000195700697001010001000000003000200000000201920000000140191000000002 111
1900019570069700101                                                     211
19000130127654001010001000000001000200000000201690000000050179000000003 111
19000130127654001010180000000003018100000000301820000000030183000000003 211
19000130127654001010184000000003018500000000301860000000010187000000001 211
19000130127654001010188000000001018900000000101900000000010191000000002 211
1900013012765400101                                                     211
19000190701425001010001000000005000200000000201920000000140191000000002 111
1900019070142500101                                                     211
19000186716534001010001000000006000200000000201650000000010167000000003 111
19000186716534001010169000000005017900000000301800000000020181000000002 211
19000186716534001010182000000002018300000000201840000000010185000000001 211
19000186716534001010186000000002018700000000201880000000020189000000003 211
190001867165340010101900000000030191000000001                           211
;;;;``````
Contributor
Posts: 33

Tom, it reads in great. But I can't get it to transpose so that the 'code' becomes the header for row 1.

Any help would be appreciated.

jake

Super User
Posts: 7,932

Use PROC TRANSPOSE

``````proc transpose data=have out=want ;
by id ;
id code ;
var value ;
run;``````
PROC Star
Posts: 8,145

If your example data are correct, then your codes actually go up to 192. Guesstimating that your example data represent six subjects, then the following (I think) does what you want:

```data batchX;
array ic (1:192) p0001-p0192; /* These are the variable names (Item Codes) */
infile "F:\DATA\batch01.dat" lrecl=85 end=eof;
retain ic;
input @7 ID 9. @;
if id ne lag(id) and _n_ gt 1 then output;
x=20;
do while (x<72);
input @x code 4. @;
x=x+4;
if code gt 0 then do;
input @x ic{code} 9. @;
x=x+9;
end;
end;
input;
if eof then output;
run;
```

Art, CEO, AnalystFinder.com

Contributor
Posts: 33

Art, I'm getting an ERROR: Array subscript out of range at line 32, column 16.

do you understand the reason?

thank you!

Super User
Posts: 13,299

jakestat wrote:

Art, I'm getting an ERROR: Array subscript out of range at line 32, column 16.

do you understand the reason?

thank you!

Most likely is that somehow a value greater than 192 or less than 1 ended up in the CODE position.

Suppose the code was supposed to be 107 and some accidentally typed a 1007 for the code value. Then the program looks for an array element 1007 but the array has a maximum of 192 spaces assigned. So the array index of 1007 is greater than 192 and you get that message. Similar if the value is 0000, 0 was not defined as an expected index for this array.

PROC Star
Posts: 8,145

@jakestat: your error message should have also included the value of _n_ where the error occured. Knowing which record where the discrepancy existed should make it easy to discover what the offending value was

As I'd mentioned in my original post, that was how I discovered the code 192 in your example data.

Art, CEO, AnalystFinder.com

☑ This topic is solved.