Hello,
I have a situation in which I have a data set that contains respondent ID numbers in a single column and corresponding respondent ages in multiple columns. I need to match each individual respondent ID (PID) with its corresponding respondent age (a1-a5). A variable named TY is a binary variable that is coded as "1" or "2." TY as "1" is primary row that contains ages a1-a5. When "TY" is coded as "2" there are no row value for age. As you can see there are more ages than PIDs. I am working on a large dataset and seek a clean and quick method of matching each PID to each age. Any suggestions? Thanks. Here's how the data looks like:
data one;
input id$ PID$ TY a1 a2 a3 a4 a5;
cards;
CT0121 CT0121A1 1 23 35 45 65 76
CT0121 CT0121A2 2 . . . . .
CT0121 CT0121A3 2 . . . . .
CT0121 CT0121A4 2 . . . . .
CT0200 CT0200A1 1 34 36 38 40 42
CT0200 CT0200A2 2 . . . . .
NS10500 NS10500A1 1 21 23 24 25 90
NS10500 NS10500A2 2 . . . . .
NS10500 NS10500A3 2 . . . . .
;
run; quit;
Awesome! Will do!
And what would the data look like afterward?
From this data structure:
CT0121 CT0121A1 1 23 35 45 65 76
CT0121 CT0121A2 2 . . . . .
CT0121 CT0121A3 2 . . . . .
CT0121 CT0121A4 2 . . . . .
CT0200 CT0200A1 1 34 36 38 40 42
CT0200 CT0200A2 2 . . . . .
NS10500 NS10500A1 1 21 23 24 25 90
NS10500 NS10500A2 2 . . . . .
NS10500 NS10500A3 2 . . . . .
To here's how it needs to look like:
ID PID Age
CT0121 CT0121A1 23
CT0121 CT0121A2 35
CT0121 CT0121A3 45
CT0121 CT0121A4 65
CT0200 CT0200A1 34
CT0200 CT0200A2 36
NS10500 NS10500A1 21
NS10500 NS10500A2 23
NS10500 NS10500A3 24
Thanks for the reply!
A simple guess..
data two;
set one;
array a{5};
do i = 1 to dim(a);
if not missing(a{i}) then do;
age = a{i};
output;
end;
end;
keep id pid age;
run;
Still not resolved. PIDs are changed to single value when I began with individual values. Please refer to last post of what data should look like. Thanks.
Obs id PID age
1 CT0121 CT0121A1 23
2 CT0121 CT0121A1 35
3 CT0121 CT0121A1 45
4 CT0121 CT0121A1 65
5 CT0121 CT0121A1 76
6 CT0200 CT0200A1 34
7 CT0200 CT0200A1 36
8 CT0200 CT0200A1 38
9 CT0200 CT0200A1 40
10 CT0200 CT0200A1 42
11 NS10500 NS10500A 21
12 NS10500 NS10500A 23
13 NS10500 NS10500A 24
14 NS10500 NS10500A 25
15 NS10500 NS10500A 90
Less of a guess this time:
data one;
input id :$10. PID :$12. TY a1 a2 a3 a4 a5;
cards;
CT0121 CT0121A1 1 23 35 45 65 76
CT0121 CT0121A2 2 . . . . .
CT0121 CT0121A3 2 . . . . .
CT0121 CT0121A4 2 . . . . .
CT0200 CT0200A1 1 34 36 38 40 42
CT0200 CT0200A2 2 . . . . .
NS10500 NS10500A1 1 21 23 24 25 90
NS10500 NS10500A2 2 . . . . .
NS10500 NS10500A3 2 . . . . .
;
data two;
set one;
array a{5};
array v{5} _temporary_;
if ty = 1 then
do i = 1 to dim(a);
v{i} = a{i};
end;
i = input(scan(pid, -1, "A"), 2.);
if i > 0 and i <= dim(v) then do;
age = v{i};
output;
end;
keep id pid age;
run;
HI! Thanks but I don't get all the observations processed in the array you sent out.
I run this code:
data two;
set one;
array a{5};
array v{5} _temporary_;
if ty = 1 then
do i = 1 to dim(a);
v{i} = a{i};
end;
i = input(scan(pid, -1, "A"), 2.);
if i > 0 and i <= dim(v) then do;
age = v{i};
output;
end;
keep id pid age;
run;
proc print data=two; var id pid age; run;
This is the ouput I get:
The SAS System 08:40 Monday, April 4, 2016 30
Obs id PID age
1 CT0121 CT0121A1 23
2 CT0121 CT0121A2 35
3 CT0121 CT0121A3 45
4 CT0121 CT0121A4 65
5 CT0200 CT0200A1 34
6 CT0200 CT0200A2 36
And I get the following from the log output:
383
384 data two;
385 set one;
386 array a{5};
387 array v{5} _temporary_;
388 if ty = 1 then
389 do i = 1 to dim(a);
390 v{i} = a{i};
391 end;
392 i = input(scan(pid, -1, "A"), 2.);
393 if i > 0 and i <= dim(v) then do;
394 age = v{i};
395 output;
396 end;
397 keep id pid age;
398 run;
NOTE: Invalid argument to function INPUT at line 392 column 5.
id=NS10500 PID=NS10500A TY=1 a1=21 a2=23 a3=24 a4=25 a5=90 i=. age=. _ERROR_=1 _N_=7
NOTE: Invalid argument to function INPUT at line 392 column 5.
id=NS10500 PID=NS10500A TY=2 a1=. a2=. a3=. a4=. a5=. i=. age=. _ERROR_=1 _N_=8
NOTE: Invalid argument to function INPUT at line 392 column 5.
id=NS10500 PID=NS10500A TY=2 a1=. a2=. a3=. a4=. a5=. i=. age=. _ERROR_=1 _N_=9
NOTE: Mathematical operations could not be performed at the following places. The results of the
operations have been set to missing values.
Each place is given by: (Number of times) at (Line):(Column).
3 at 392:5
NOTE: There were 9 observations read from the data set WORK.ONE.
NOTE: The data set WORK.TWO has 6 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.01 seconds
399
400 proc print data=two; var id pid age; run;
NOTE: There were 6 observations read from the data set WORK.TWO.
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
I don't understand the error message about the INPUT Statement. Any suggestions? For some reason the last observations with a "NS" prefix are not included due to the error in the PUT statment.
Thanks a lot!
The problem is with the way you read the data in. Variables id and pid are too short to contain the longer codes. My program uses the last part if the pid code (after the "A") to decide which array element to output. When that part of the pid code is missing, an error occurs.
Make sure you read the data in properly, as I did in the example ( input id :$10. PID :$12. ), and the rest of the program will work.
Awesome! Will do!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.