Hi,
I have a large dataset with >1,000,000 observations for ~10,000 patients.
Each patient has a unique ID.
data have;
input ID ICD10 $;
datalines;
11111 E0800
11111 E0801
11111 E0803
22222 J45909
22222 J45908
33333 G4001
33333 G4002
33333 G4003
44444 E8883
44444 E8882
;
run;
I need to add an age variable to the main dataset with >1,000,000 observations.
I have a separate age dataset which has the age variable and corresponding patient ID.
data age;
input ID age;
datalines;
1 57
2 64
3 67
4 81
;
Run;
Before merging the two datasets, I have set the length and format for the ID variable on which I am merging,
data have;
length ID $ 5;
set have;
format ID $8.;
informat ID $10.;
run;
data age;
length ID $ 5;
set age;
format ID $8.;
informat ID $10.;
run;
/*merge have and age datasets*/
Proc sort data=have; by ID; run;
Proc sort data=age; by ID; run;
data merged;
merge have (in=a) age(in=b);
by ID;
if a and b;
run;
However, I am encountering issues where age does not appear in the merged dataset for all observations (even though the ID does have an age value available in the age dataset), for example:
Data merged;
Input ID ICD10 $ age;
Datalines;
11111 E0800 .
11111 E0801 57
11111 E0803 57
22222 J45909 64
22222 J45908 .
33333 G4001 67
33333 G4002 .
33333 G4003 .
44444 E8883 .
44444 E8882 81
;
run;
When I first discovered the issue, I also set the ID informat the same in both datasets (above code), in an attempt to see if this would resolve the issue (but it has not).
Any help would be greatly appreciated.
Thank you in advance,
Thanks Paige for replying.
Yes, I know, I can't share my dataset here, and the issue I was having was not replicated in the example data I provided... hence why I shared an example merged dataset of what I was getting.
However, I think I have now discovered the problem, in the have dataset, there was a variable called "age". When I dropped this, the merge worked seamlessly, as per normal!...
I do not get the output you get. I get output with no missing values.
data have;
input ID ICD10 $;
datalines;
11111 E0800
11111 E0801
11111 E0803
22222 J45909
22222 J45908
33333 G4001
33333 G4002
33333 G4003
44444 E8883
44444 E8882
;
data age;
input ID age;
datalines;
11111 57
22222 64
33333 67
44444 81
;
data have;
length ID $ 5;
set have;
format ID $8.;
informat ID $10.;
run;
data age;
length ID $ 5;
set age;
format ID $8.;
informat ID $10.;
run;
/*merge have and age datasets*/
Proc sort data=have; by ID; run;
Proc sort data=age; by ID; run;
data merged;
merge have (in=a) age(in=b);
by ID;
if a and b;
run;
proc print data=merged;
run;
I'm not sure why you would think an informat would make a difference. Informats only matter when you are reading in data. They have no impact if the data comes from a SET statement. In fact, if you remove the second DATA HAVE; step and second DATA AGE; step from the program, it still works fine.
Thanks Paige for replying.
Yes, I know, I can't share my dataset here, and the issue I was having was not replicated in the example data I provided... hence why I shared an example merged dataset of what I was getting.
However, I think I have now discovered the problem, in the have dataset, there was a variable called "age". When I dropped this, the merge worked seamlessly, as per normal!...
1) could you run something like this example to test if that ID are really the same?
data test;
ID='11111';
output;
iD='1l111';
output;
run;
data test2;
set test;
the_same=put(ID, $hex64.);
put _all_;
run;
2) is your code "just merge" or you have there some other data transformations too ?
Bart
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.