Solved: Re: SAS Merge

singhsahab · Posted 09-26-2018 10:20 PM

Dear All,

I've tow dataset, dataset1 (contains ID & Name) & dataset2 (contains Id1,Id2 & Id3) ,values of dataset2 variable coming from dataset1 variables (ID). later i want to join them in a single DATA step. Can you please help me how i can perform SAS merging to get my final output metion in FinalDataset dataset. In real time datsets contains more then 10000+ records.

DATASET1:

ID NAME

1 ABC
2 CDF
3 KLM
4 JKL

DATASET2:

ID1 ID2 ID3
1 2 3
3 . 4

AFTER MERGING DATSET1 & DATASET2 THE FINAL DATASET WOULD BE:

FinalDataset:

ID1 ID2 ID3 NAME
1 2 3 ABC,CDF,KLM
3 . 4 KLM,JKL

Thank you !!

Kurt_Bremser · Posted 10-01-2018 08:10 AM

Expand the format with a value for "Other" and check for that before concatenating:

data dataset3;
input id site name $;
cards;
1 22 ABC
2 22 CDF
3 22 KLM
4 34 JKL
;
run;

data dataset4;
input id1 id2 id3 site;
cards;
1 2 3 22
3 . 4 34
4 . . 34
;
run;

data test (keep=id name);
set dataset3;
id=strip(id)||strip(site);
run;

data test1 (keep=id1 id2 id3);
set dataset4;
if id1 ne . and site ne . then id1=strip(id1)||strip(site);
if id2 ne . and site ne . then id2=strip(id2)||strip(site);
if id3 ne . and site ne . then id3=strip(id3)||strip(site);
run;

data cntlin2;
set test (rename=(id=start name=label)) end=eof ;
fmtname='idnamee';
type='N';
output;
if eof
then do;
  start = .;
  label = '***';
  hlo = 'O';
  output;
end;
run;

proc sort data=cntlin2 nodupkey;
by start;
run;

proc format cntlin=cntlin2;
run;

data want1;
set test1;
array ids{*} id:;
length name $20;
do i = 1 to dim(ids);
if not missing(ids{i}) and put(ids{i},idnamee.) ne '***'
then name=catx(',',name,put(ids{i},idnamee.));
end;
drop i;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

novinosrin · Posted 09-26-2018 10:48 PM

A very simple hash find()=0 pull

load dtset1 in hash table

loop through array id1-id3

use each of those array elements as keys & pull the names with find method

Kurt_Bremser · Posted 09-27-2018 02:20 AM

10000 can be handled with a format:

data dataset1;
input id name $;
cards;
1 ABC
2 CDF
3 KLM
4 JKL
;
run;

data dataset2;
input id1 id2 id3;
cards;
1 2 3
3 . 4
;
run;

data cntlin;
set dataset1;
rename id=start name=label;
fmtname='idname';
type='N';
run;

proc format cntlin=cntlin;
run;

data want;
set dataset2;
array ids{*} id:;
length name $20;
do i = 1 to dim(ids);
  if not missing(ids{i}) then name=catx(',',name,put(ids{i},idname.));
end;
drop i;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

singhsahab · Posted 09-29-2018 11:24 AM

Dear sir ,

Thank you so much for letting me know this approach of merging.

I tried same approach for more then one variable , but getting unexpected output. Below is my code.

data dataset3;
input id site name $;
cards;
1 22 ABC
2 22 CDF
3 22 KLM
4 34 JKL
;
run;

data dataset4;
input id1 id2 id3 site;
cards;
1 2 3 22
3 . 4 34
4 . . 34
;
run;

data test (keep=id name);
set dataset3;
id=strip(id)||strip(site);
run;

data test1 (keep=id1 id2 id3);
set dataset4;
if id1 ne . and site ne . then id1=strip(id1)||strip(site);
if id2 ne . and site ne . then id2=strip(id2)||strip(site);
if id3 ne . and site ne . then id3=strip(id3)||strip(site);
run;

data cntlin2;
set test;
rename id=start name=label;
fmtname='idnamee';
type='N';
run;

proc sort data=cntlin2 nodupkey;
by start;
run;

proc format cntlin=cntlin2;
run;

data want1;
set test1;
array ids{*} id:;
length name $20;
do i = 1 to dim(ids);
if not missing(ids{i}) then name=catx(',',name,put(ids{i},idnamee.));
end;
drop i;
run;

Final Output :

OBS

id1

id2

id3

name

1

122

222

322

ABC,CDF,KLM

2

334

.

434

334,JKL

3

434

.

JKL

Once again Thank you !!

Kurt_Bremser · Posted 10-01-2018 02:47 AM

You have an ID in test1/id1 that has no match in test (334), so the format will display the value as-is.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

singhsahab · Posted 10-01-2018 05:32 AM

Dear Sir,

Can you please elaborate , in this case how to get only matching record.

Many Thanks...

Kurt_Bremser · Posted 10-01-2018 08:10 AM

Expand the format with a value for "Other" and check for that before concatenating:

data dataset3;
input id site name $;
cards;
1 22 ABC
2 22 CDF
3 22 KLM
4 34 JKL
;
run;

data dataset4;
input id1 id2 id3 site;
cards;
1 2 3 22
3 . 4 34
4 . . 34
;
run;

data test (keep=id name);
set dataset3;
id=strip(id)||strip(site);
run;

data test1 (keep=id1 id2 id3);
set dataset4;
if id1 ne . and site ne . then id1=strip(id1)||strip(site);
if id2 ne . and site ne . then id2=strip(id2)||strip(site);
if id3 ne . and site ne . then id3=strip(id3)||strip(site);
run;

data cntlin2;
set test (rename=(id=start name=label)) end=eof ;
fmtname='idnamee';
type='N';
output;
if eof
then do;
  start = .;
  label = '***';
  hlo = 'O';
  output;
end;
run;

proc sort data=cntlin2 nodupkey;
by start;
run;

proc format cntlin=cntlin2;
run;

data want1;
set test1;
array ids{*} id:;
length name $20;
do i = 1 to dim(ids);
if not missing(ids{i}) and put(ids{i},idnamee.) ne '***'
then name=catx(',',name,put(ids{i},idnamee.));
end;
drop i;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

novinosrin · Posted 09-27-2018 11:11 AM

To illustrate what i said last night:

data dataset1;
input id name $;
cards;
1 ABC
2 CDF
3 KLM
4 JKL
;
run;

data dataset2;
input id1 id2 id3;
cards;
1 2 3
3 . 4
;
run;

data want(drop=id name);
if 0 then set dataset2  dataset1;
dcl hash h(dataset:'dataset1',ordered:'y');
h.definekey('id');
h.definedata('name');
h.definedone();
do until(lr);
set dataset2 end=lr;
array t id1-id3;
length new_name $20;
call missing(new_name);
do over t;
_iorc_=h.find(key:t);
if _iorc_=0 then new_name=catx(',',new_name,name);
end;
output;
end;
stop;
run;

mkeintz · Posted 09-27-2018 11:08 PM

@novinosrin

Just a nit. You don't need the "ordered:'a'" parameter for this code, since the hash object in not to be output. You only need the hash for lookup - adding the ordered attribute will slow down building the hash and retrieving from it. Compare the below with vs without the ordered attribute:

data _null_;
  declare hash h (ordered:'a');
    h.definekey('x');
	h.definedata('x','y');
	h.definedone();

  do x=1 to 10000000;y=-1*x;h.add();end;
  /*do x=1 to 10000000;h.find();end;*/
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

novinosrin · Posted 09-27-2018 11:22 PM

Thank you Sir @mkeintz. I want to take advantage of you to allow me a major request at a truly convenient time of yours. i know you may be busy with wharton, family and self etc. However, when and if time, heart, mind and soul permits, may i request you to pen down how you often efficiently

1. use self merge datastep

2. your classy wayof using lag on a condition

3. interleave akin to 1

If you do and when you do and supposing that you may use some well simulated use cases, kindly plug me plz.

My mom(retired forex analyst and doesn't have an account here) often reads your thread with greater attention to detail. So, If it's not a bother, I wish you may.Thank you!

PS If that happens, our library is probably the best place to post for the benefit of wider audience

singhsahab · Posted 10-19-2018 02:19 PM

@novinosrin : Hello Sir,

First of all , i'm thank full to you to share the new concept of data merge specially for me . Today i get a chance to learn hash concept and got stuck on solving the same question by adding one more matching variable (or more then one matching variable) . Here i'm not sure how i'll use new variable with array as key. Below is the sample dataset, where i added one more matching variable site and trying to getting same output as pasted below.

data dataset1;
input id site name $;
cards;
1 22 ABC
2 22 CDF
3 22 KLM
4 34 JKL
;
run;

data dataset2;
input id1 id2 id3 site;

cards;
1 2 3 22
3 . 4 34
4 . . 34
;
run;

novinosrin · Posted 10-19-2018 02:23 PM

Haha Plz don;t call me Sir. I haven't earned it/nor I deserve it.

To your question, give me sometime as I am away from SAS software now. Will respond soon as i can

novinosrin · Posted 10-19-2018 02:56 PM

@singhsahab Ok, adding site to the data will not produce any changes to your result because the key in the hash table is ID and data value id name. So the look up is basically from each element in your array i.e id1-id3 to id in hash table to get the name value. This is the existing logic.

You need to explain "Here i'm not sure how i'll use new variable with array as key. " in terms of what do you want to use as look up keys to fetch what as you have added site variable in both datasets.

Are you looking to use composite hash keys using id+site? Your requirement is not quite clear

singhsahab · Posted 10-20-2018 02:11 PM

@novinosrin : Yes,i,m looking to use composite hash keys using id+site to join dataset. I define site as a key but getting error. I'm not sure how i will use composite key to join.

novinosrin · Posted 10-20-2018 02:24 PM

data dataset1;
input id site name $;
cards;
1 22 ABC
2 22 CDF
3 22 KLM
4 34 JKL
;
run;

data dataset2;
input id1 id2 id3 site;

cards;
1 2 3 22
3 . 4 34
4 . . 34
;
run;


data want(drop=id name);
if 0 then set dataset2  dataset1;
dcl hash h(dataset:'dataset1');
h.definekey('id','site');
h.definedata('name');
h.definedone();
do until(lr);
set dataset2 end=lr;
array t id1-id3;
length new_name $20;
call missing(new_name);
do over t;
_iorc_=h.find(key:t,key:site);
if _iorc_=0 then new_name=catx(',',new_name,name);
end;
output;
end;
stop;
run;

SAS Innovate 2025: Save the Date