BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ihtishamsultan
Obsidian | Level 7

Hello community,

 

I am trying to do one is to many datasets merge and used the following code

 

DATA HB.IPADM;
MERGE SORT.ALLENRL2(IN=A KEEP=ENROLID)
IPADM.CCAEI103 (IN=B)
IPADM.CCAEI113 (IN=C)
IPADM.CCAEI123 (IN=D)
;
BY ENROLID;
IF A AND (B OR C OR D) ;
RUN;

The intent is to get all all data from B,C and D if they have a common enrolid of A.

When I run the  code above I get 455 observation

 

 

 

When I run the merge statement individually I get different number of observations.

I ran the following code:

DATA HB.IPADM1;
MERGE SORT.ALLENRL2(IN=A KEEP=ENROLID)
IPADM.CCAEI103 (IN=B);
BY ENROLID;
IF A AND B ;
RUN;

 

DATA HB.IPADM2;
MERGE SORT.ALLENRL2(IN=A KEEP=ENROLID)
IPADM.CCAEI113 (IN=B);
BY ENROLID;
IF A AND B ;
RUN;

 

DATA HB.IPADM3;
MERGE SORT.ALLENRL2(IN=A KEEP=ENROLID)
IPADM.CCAEI123 (IN=B);
BY ENROLID;
IF A AND B ;
RUN;

 

After running the above three codes and combining them I get 555 observations. What am I doing incorrectly in the first code of one to many merge?

 

Many thanks for the help.

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Use a VIEW.

 

data all / view=all;
   set b c d ;
   by enrolid;
run;

data want;
  merge a(in=ina) all (in=inall);
  by enrolid;
  if ina and inall;
run;

If A is tiny pull the list of ENROLID values into a macro variable and use it in a WHERE statement.

proc sql noprint;
select distinct quote(trim(enrolid)) into :list separated by ' ' from A;
quit;
data want;
  set b c d ;
  by enrolid;
  where enrolid in (&list);
run;

If A is small enough load it into a hash and use the FIND() method to check if the value is found.

 

If A is too large to load into a hash then create an index for A and use that in the data step to test if the id is in the A dataset.

data want;
  set b c d ;
  by enrolid;
  set a key=enrolid;
_error_=0; if _iorc_ then delete; run;

View solution in original post

12 REPLIES 12
ballardw
Super User

What happens in your first data step is that whenever more than one of the IPADM.CCAEI sets has the same Enrolid the all of those sets get merged so would have fewer records.

Example:

data example1;
  input enrolid x;
datalines;
1 1
2 2
;

data example2;
  input enrolid y;
datalines;
1  11
;

data example3;
  input enrolid z;
datalines;
1 111
;

data example4;
   input enrolid t;
datalines;
1 1111
;

data merged;
   merge 
      example1 (in=a)
      example2 (in=b)
      example3 (in=c)
      example4 (in=d)
    ;
    by enrolid;
    IF A AND (B OR C OR D) ;
run;

All 3 of the "other" datasets match but only one record is in the output.

 

You may need to provide some small examples, such as I did above, of your existing data and what you expect the output to look like.

 

If by "combining" the sets you create with the separate merges used a SET statement then you could set those 3 sets together and do a single merge (likely after sorting)

 

 

ihtishamsultan
Obsidian | Level 7
Thank you, I now understand that duplicates are likely to be represented in one row with the merge statement. Is there a way where I can have separate rows for all the duplicates?

For instance :
data example1;
input enrolid ;
datalines;
1
2
;

data example2;
input enrolid y;
datalines;
1 11
1 12
3 11
;

data example3;
input enrolid y;
datalines;
1 111
1 222
2 411
;

data example4;
input enrolid y;
datalines;
1 1111
4 111
;

data merged;
merge
example1 (in=a)
example2 (in=b)
example3 (in=c)
example4 (in=d)
;
by enrolid;
IF A AND (B OR C OR D) ;
run;


Dataset wanted:
Enrolid y
1 11
1 12
1 111
1 222
1 1111
2 411

I need all entrees of ids of dataset A present in B,C and D.
ballardw
Super User

This could be accomplished by using a SET statement to combine example2, 3 and 4. Then merge with example1.

However, this will fail if you have duplicates of the By variable in both example1 and the combined data.

 

data example1;
   input enrolid ;
datalines;
1
2
;

data example2;
   input enrolid y;
datalines;
1 11
1 12
3 11
;

data example3;
   input enrolid y;
datalines;
1 111
1 222
2 411
;

data example4;
   input enrolid y;
datalines;
1 1111
4 111
;

Data stacked;
   set example2
       example3
       example4
   ;
run;

proc sort data=stacked;
  by enrolid;
run;

data merged;
   merge
      example1 (in=a)
      stacked (in=b)
   ;
   by enrolid;
   IF A AND B ;
run;

Please post code in either a text box, opened on the forum with the </> icon that appears above the message window, or the code box opened with the "running man" icon. It helps to tell when something is code or part of a question or comment and they will not reformat pasted text like the main message window. The main message windows reformat text making any code indenting go away as well as other things that can result in not-seen characters that can impact whether code will run.

ihtishamsultan
Obsidian | Level 7
Apologies. I will use that from now on.

My problem with using the set statement to combine B,C and D is that the dataset B,C and D are too big to be combined on my computer (almost 200GB). Also there are no duplicates in dataset A but there are duplicates in B,C and D.

Is there an alternative?
Tom
Super User Tom
Super User

Use a VIEW.

 

data all / view=all;
   set b c d ;
   by enrolid;
run;

data want;
  merge a(in=ina) all (in=inall);
  by enrolid;
  if ina and inall;
run;

If A is tiny pull the list of ENROLID values into a macro variable and use it in a WHERE statement.

proc sql noprint;
select distinct quote(trim(enrolid)) into :list separated by ' ' from A;
quit;
data want;
  set b c d ;
  by enrolid;
  where enrolid in (&list);
run;

If A is small enough load it into a hash and use the FIND() method to check if the value is found.

 

If A is too large to load into a hash then create an index for A and use that in the data step to test if the id is in the A dataset.

data want;
  set b c d ;
  by enrolid;
  set a key=enrolid;
_error_=0; if _iorc_ then delete; run;
ballardw
Super User

Set before or three data steps with merge and then set (or proc append which would be faster generally). There is likely to be a Proc SQL solution but if your data sets are "too big" to combine first then likely the performance in SQL is going to be slower.

SASKiwi
PROC Star

The MERGE statement cannot handle multiple input datasets where there are duplicate ENROLIDs in more than one of those input datasets.

 

Test each of your input datasets for duplicate ENROLIDs and if more than one has this issue then that explains the drop in observation counts.

 

If you are wanting to keep all duplicate ENROLIDs then you will need to switch to using SQL.

ihtishamsultan
Obsidian | Level 7

Thank you, I want to keep all the duplicate records as well. Can you please help with the proc sql code. I haven't used sql much.

ChrisNZ
Tourmaline | Level 20

> I want to keep all the duplicate records as well.

How?  What output do you expect if there are duplicates in A and B but not C?  Or B and C and not A?

ihtishamsultan
Obsidian | Level 7
In the code below my dataset example1 represents ids that I need the data for. The needed data is available in b,c and d.

For instance :
data example1;
input enrolid ;
datalines;
1
2
;

data example2;
input enrolid y;
datalines;
1 11
1 12
3 11
;

data example3;
input enrolid y;
datalines;
1 111
1 222
2 411
;

data example4;
input enrolid y;
datalines;
1 1111
4 111
;

data merged;
merge
example1 (in=a)
example2 (in=b)
example3 (in=c)
example4 (in=d)
;
by enrolid;
IF A AND (B OR C OR D) ;
run;


Dataset wanted:
Enrolid y
1 11
1 12
1 111
1 222
1 1111
2 411

I need all entrees of ids of dataset A present in B,C and D.
Kurt_Bremser
Super User

So you want to use example1 as a lookup table to filter data out of the others. This is easily done with a hash object:

data want;
set
  example2
  example3
  example4
;
if _n_ = 1
then do;
  declare hash ex1 (dataset:"example1");
  ex1.definekey("enrolid");
  ex1.definedone();
end;
if ex1.check() = 0;
run;
ChrisNZ
Tourmaline | Level 20

Do you get any message in the log?

With so few observations, it should be easy to look at the observations that are duplicated.

Not knowing what you want, it's is difficult tot comment further.

Please post code using the appropriate icon.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 1972 views
  • 4 likes
  • 6 in conversation