Solved: Merge multiple tables / many-to-many merge / unbalanced datasets

Cruise · Posted 06-28-2019 09:37 AM

Hello Folks:

I have three tables sorted by two categorical values 'index' and 'description' and one table by 'description' only. Below is the desired output. I I did many-to-many merge while not having 'index' variable in the STANDARD dataset.

This seems to be working on small mock datasets. However, my actual datasets are large.

My question is: Do you have any comments if I'm doing merge right here? Is it safe approach to go ahead and apply to my actual large datasets?

DATA ONE; 
INPUT INDEX	DESCRIPTION $ VALUE1;
CARDS;
1	A	0.1
1	B	0.2
1	C	0.25
2	A	0.33
2	B	0.41
2	C	0.48
3	A	0.56
3	B	0.63
3	C	0.71
;

DATA TWO; 
INPUT INDEX	DESCRIPTION $ VALUE2;
CARDS;
1	A	0.1
1	B	0.2
1	C	0.25
2	A	0.33
2	B	0.41
2	C	0.48
3	A	0.56
3	B	0.63
3	C	0.71
;

DATA THREE; 
INPUT INDEX	DESCRIPTION $ VALUE3;
CARDS;
1	A	0.1
1	B	0.2
1	C	0.25
2	A	0.33
2	B	0.41
2	C	0.48
3	A	0.56
3	B	0.63
3	C	0.71
;

DATA STANDARD; 
INPUT DESCRIPTION $ VALUE;
CARDS;
A	0.3
B	0.25
C	0.22
;

PROC SORT DATA=ONE;
BY DESCRIPTION INDEX;
PROC SORT DATA=TWO;
BY DESCRIPTION INDEX;
PROC SORT DATA=THREE;
BY DESCRIPTION INDEX;
PROC SORT DATA=STANDARD;
BY DESCRIPTION;

DATA MERGED;
MERGE ONE TWO THREE STANDARD;
BY DESCRIPTION;
RUN; 
PROC PRINT; RUN;

PROC SORT DATA=MERGED;
BY INDEX DESCRIPTION;
RUN;

Astounding · Posted 06-28-2019 09:43 AM

It's not safe, and can generate incorrect matches. For example, try removing an observation from ONE, TWO, or THREE, and see if you like the result.

In a small sample, everything matches up. In larger data sets, it's unlikely that there will be 100% matches for INDEX and DESCRIPTION.

Unfortunately, you need to match ONE, TWO, and THREE by both INDEX and DESCRIPTION. Then you can add STANDARD in a separate step.

View solution in original post

Astounding · Posted 06-28-2019 09:43 AM

It's not safe, and can generate incorrect matches. For example, try removing an observation from ONE, TWO, or THREE, and see if you like the result.

In a small sample, everything matches up. In larger data sets, it's unlikely that there will be 100% matches for INDEX and DESCRIPTION.

Unfortunately, you need to match ONE, TWO, and THREE by both INDEX and DESCRIPTION. Then you can add STANDARD in a separate step.

Cruise · Posted 06-28-2019 09:48 AM

@Astounding this might explain why my summary plot is far off. How about to duplicate the STANDARD dataset by the level of INDEX in the three other datasets and merge? In other words, bringing STANDARD table to the same level of other datasets for one-to-one merge?

koyelghosh · Posted 06-28-2019 10:11 AM

Will PROC SQL be a bad idea? Just curious .... also you did not mention how big is the data you have.

This is the question I have (to others seeing this message as well) .. is it better to use MERGE in DATA step or use PROC SQL?

Under what conditions we should sway towards one method rather than the other?

SAS experts here please.

Kurt_Bremser · Posted 06-28-2019 10:45 AM

My personal rule-of-thumb: use SQL when I positively need it (eg cartesian joins). Otherwise use proc sort and data steps.

Also see Maxim 10.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

koyelghosh · Posted 06-28-2019 10:48 AM

@Kurt_Bremser Thank you.

Kurt_Bremser · Posted 06-28-2019 10:50 AM

Create a format from table STANDARD.

Sort ONE, TWO and THREE by index and description.

Merge ONE, TWO and THREE by index and description, apply the format to description to derive value.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Cruise · Posted 06-28-2019 10:52 AM

@Kurt_Bremser

I understand execpt how to create format from STANDARD and applying the format to description to derive value.

Tom · Posted 06-28-2019 11:03 AM

@Cruise wrote:

@Kurt_Bremser

I understand execpt how to create format from STANDARD and applying the format to description to derive value.

Since the value you want to create is a number then create an INFORMAT and use the INPUT() function to create the value.

To make a format/informat from data create a dataset in the format needed for the CNTLIN= option of PROC FORMAT.

DATA STANDARD; 
INPUT DESCRIPTION $ VALUE;
CARDS;
A 0.3
B 0.25
C 0.22
;

data fmt;
 fmtname='STANDARD';
 type='I';
 set standard;
 rename description=start value=label;
run;
proc format cntlin=fmt;
run;


data test;
  input description $ ;
  value=input(description,standard.);
cards;
A
B
C
;

proc print;
run;

Obs    description    value

 1          A          0.30
 2          B          0.25
 3          C          0.22

Cruise · Posted 06-28-2019 01:28 PM

Sorry, if I wasn't clear. My goal was to duplicate STANDARD data to the level INDEX in the other datasets so that I can do one-to-one merge. Right now STANDARD has 3 rows and ONE, TWO and THREE datasets have total 9 rows (3 descriptions by 3 index). If I duplicate STANDARD 3 times by adding INDEX variable for each duplicates then I'll be able to do one-to-one merge.

However, I find Astounding's solution easier since I don't bother merging ONE, TWO, THREE by INDEX and DECSRIPTION variables first and then merge STANDARD to the resulting merged dataset by DESCRIPTION variable.

Please let me know if I'm missing something here.

Kurt_Bremser · Posted 06-28-2019 11:05 AM

data cntlin;
set standard (rename=(description=start value=label));
fmtname = 'myfmt';
type = 'C';
run;

proc format cntlin=cntlin;
run;

Now you can do

value = input(put(description,$myfmt.),best.);

It's probably better to create an informat and use only the input() function, but I'm not in front of SAS right now to test that. I only know formats by heart, but not informats 😉

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Astounding · Posted 06-28-2019 02:39 PM

If formats are too difficult, sort and merge BY DESCRIPTION INDEX (not BY INDEX DESCRIPTION). Then you can merge in summary data BY DESCRIPTION.

Cruise · Posted 06-28-2019 02:58 PM

Thank you. I will.

Merge multiple tables / many-to-many merge / unbalanced datasets

Re: Merge multiple tables / many-to-many merge / unbalanced datasets

Re: Merge multiple tables / many-to-many merge / unbalanced datasets

Re: Merge multiple tables / many-to-many merge / unbalanced datasets

Re: Merge multiple tables / many-to-many merge / unbalanced datasets

Re: Merge multiple tables / many-to-many merge / unbalanced datasets

Re: Merge multiple tables / many-to-many merge / unbalanced datasets

Re: Merge multiple tables / many-to-many merge / unbalanced datasets

Re: Merge multiple tables / many-to-many merge / unbalanced datasets

Re: Merge multiple tables / many-to-many merge / unbalanced datasets

Re: Merge multiple tables / many-to-many merge / unbalanced datasets

Re: Merge multiple tables / many-to-many merge / unbalanced datasets

Re: Merge multiple tables / many-to-many merge / unbalanced datasets

Re: Merge multiple tables / many-to-many merge / unbalanced datasets

Registration is open

SAS Training: Just a Click Away