Hello Folks:
I have three tables sorted by two categorical values 'index' and 'description' and one table by 'description' only. Below is the desired output. I I did many-to-many merge while not having 'index' variable in the STANDARD dataset.
This seems to be working on small mock datasets. However, my actual datasets are large.
My question is: Do you have any comments if I'm doing merge right here? Is it safe approach to go ahead and apply to my actual large datasets?
DATA ONE;
INPUT INDEX DESCRIPTION $ VALUE1;
CARDS;
1 A 0.1
1 B 0.2
1 C 0.25
2 A 0.33
2 B 0.41
2 C 0.48
3 A 0.56
3 B 0.63
3 C 0.71
;
DATA TWO;
INPUT INDEX DESCRIPTION $ VALUE2;
CARDS;
1 A 0.1
1 B 0.2
1 C 0.25
2 A 0.33
2 B 0.41
2 C 0.48
3 A 0.56
3 B 0.63
3 C 0.71
;
DATA THREE;
INPUT INDEX DESCRIPTION $ VALUE3;
CARDS;
1 A 0.1
1 B 0.2
1 C 0.25
2 A 0.33
2 B 0.41
2 C 0.48
3 A 0.56
3 B 0.63
3 C 0.71
;
DATA STANDARD;
INPUT DESCRIPTION $ VALUE;
CARDS;
A 0.3
B 0.25
C 0.22
;
PROC SORT DATA=ONE;
BY DESCRIPTION INDEX;
PROC SORT DATA=TWO;
BY DESCRIPTION INDEX;
PROC SORT DATA=THREE;
BY DESCRIPTION INDEX;
PROC SORT DATA=STANDARD;
BY DESCRIPTION;
DATA MERGED;
MERGE ONE TWO THREE STANDARD;
BY DESCRIPTION;
RUN;
PROC PRINT; RUN;
PROC SORT DATA=MERGED;
BY INDEX DESCRIPTION;
RUN;
It's not safe, and can generate incorrect matches. For example, try removing an observation from ONE, TWO, or THREE, and see if you like the result.
In a small sample, everything matches up. In larger data sets, it's unlikely that there will be 100% matches for INDEX and DESCRIPTION.
Unfortunately, you need to match ONE, TWO, and THREE by both INDEX and DESCRIPTION. Then you can add STANDARD in a separate step.
It's not safe, and can generate incorrect matches. For example, try removing an observation from ONE, TWO, or THREE, and see if you like the result.
In a small sample, everything matches up. In larger data sets, it's unlikely that there will be 100% matches for INDEX and DESCRIPTION.
Unfortunately, you need to match ONE, TWO, and THREE by both INDEX and DESCRIPTION. Then you can add STANDARD in a separate step.
@Astounding this might explain why my summary plot is far off. How about to duplicate the STANDARD dataset by the level of INDEX in the three other datasets and merge? In other words, bringing STANDARD table to the same level of other datasets for one-to-one merge?
Will PROC SQL be a bad idea? Just curious .... also you did not mention how big is the data you have.
This is the question I have (to others seeing this message as well) .. is it better to use MERGE in DATA step or use PROC SQL?
Under what conditions we should sway towards one method rather than the other?
SAS experts here please.
My personal rule-of-thumb: use SQL when I positively need it (eg cartesian joins). Otherwise use proc sort and data steps.
Also see Maxim 10.
@Kurt_Bremser Thank you.
Create a format from table STANDARD.
Sort ONE, TWO and THREE by index and description.
Merge ONE, TWO and THREE by index and description, apply the format to description to derive value.
I understand execpt how to create format from STANDARD and applying the format to description to derive value.
@Cruise wrote:
I understand execpt how to create format from STANDARD and applying the format to description to derive value.
Since the value you want to create is a number then create an INFORMAT and use the INPUT() function to create the value.
To make a format/informat from data create a dataset in the format needed for the CNTLIN= option of PROC FORMAT.
DATA STANDARD;
INPUT DESCRIPTION $ VALUE;
CARDS;
A 0.3
B 0.25
C 0.22
;
data fmt;
fmtname='STANDARD';
type='I';
set standard;
rename description=start value=label;
run;
proc format cntlin=fmt;
run;
data test;
input description $ ;
value=input(description,standard.);
cards;
A
B
C
;
proc print;
run;
Obs description value 1 A 0.30 2 B 0.25 3 C 0.22
Sorry, if I wasn't clear. My goal was to duplicate STANDARD data to the level INDEX in the other datasets so that I can do one-to-one merge. Right now STANDARD has 3 rows and ONE, TWO and THREE datasets have total 9 rows (3 descriptions by 3 index). If I duplicate STANDARD 3 times by adding INDEX variable for each duplicates then I'll be able to do one-to-one merge.
However, I find Astounding's solution easier since I don't bother merging ONE, TWO, THREE by INDEX and DECSRIPTION variables first and then merge STANDARD to the resulting merged dataset by DESCRIPTION variable.
Please let me know if I'm missing something here.
data cntlin;
set standard (rename=(description=start value=label));
fmtname = 'myfmt';
type = 'C';
run;
proc format cntlin=cntlin;
run;
Now you can do
value = input(put(description,$myfmt.),best.);
It's probably better to create an informat and use only the input() function, but I'm not in front of SAS right now to test that. I only know formats by heart, but not informats 😉
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.