Solved: Merging multiple datasets reduces no of observations.

ihtishamsultan · Posted 08-25-2021 10:29 PM

Hello community,

I am trying to do one is to many datasets merge and used the following code

DATA HB.IPADM;
MERGE SORT.ALLENRL2(IN=A KEEP=ENROLID)
IPADM.CCAEI103 (IN=B)
IPADM.CCAEI113 (IN=C)
IPADM.CCAEI123 (IN=D)
;
BY ENROLID;
IF A AND (B OR C OR D) ;
RUN;

The intent is to get all all data from B,C and D if they have a common enrolid of A.

When I run the code above I get 455 observation

When I run the merge statement individually I get different number of observations.

I ran the following code:

DATA HB.IPADM1;
MERGE SORT.ALLENRL2(IN=A KEEP=ENROLID)
IPADM.CCAEI103 (IN=B);
BY ENROLID;
IF A AND B ;
RUN;

DATA HB.IPADM2;
MERGE SORT.ALLENRL2(IN=A KEEP=ENROLID)
IPADM.CCAEI113 (IN=B);
BY ENROLID;
IF A AND B ;
RUN;

DATA HB.IPADM3;
MERGE SORT.ALLENRL2(IN=A KEEP=ENROLID)
IPADM.CCAEI123 (IN=B);
BY ENROLID;
IF A AND B ;
RUN;

After running the above three codes and combining them I get 555 observations. What am I doing incorrectly in the first code of one to many merge?

Many thanks for the help.

Tom · Posted 08-25-2021 11:58 PM

Use a VIEW.

data all / view=all;
   set b c d ;
   by enrolid;
run;

data want;
  merge a(in=ina) all (in=inall);
  by enrolid;
  if ina and inall;
run;

If A is tiny pull the list of ENROLID values into a macro variable and use it in a WHERE statement.

proc sql noprint;
select distinct quote(trim(enrolid)) into :list separated by ' ' from A;
quit;
data want;
  set b c d ;
  by enrolid;
  where enrolid in (&list);
run;

If A is small enough load it into a hash and use the FIND() method to check if the value is found.

If A is too large to load into a hash then create an index for A and use that in the data step to test if the id is in the A dataset.

data want;
  set b c d ;
  by enrolid;
  set a key=enrolid;
  _error_=0;
  if _iorc_ then delete;
run;

View solution in original post

ballardw · Posted 08-25-2021 10:55 PM

What happens in your first data step is that whenever more than one of the IPADM.CCAEI sets has the same Enrolid the all of those sets get merged so would have fewer records.

Example:

data example1;
  input enrolid x;
datalines;
1 1
2 2
;

data example2;
  input enrolid y;
datalines;
1  11
;

data example3;
  input enrolid z;
datalines;
1 111
;

data example4;
   input enrolid t;
datalines;
1 1111
;

data merged;
   merge 
      example1 (in=a)
      example2 (in=b)
      example3 (in=c)
      example4 (in=d)
    ;
    by enrolid;
    IF A AND (B OR C OR D) ;
run;

All 3 of the "other" datasets match but only one record is in the output.

You may need to provide some small examples, such as I did above, of your existing data and what you expect the output to look like.

If by "combining" the sets you create with the separate merges used a SET statement then you could set those 3 sets together and do a single merge (likely after sorting)

ihtishamsultan · Posted 08-25-2021 11:11 PM

Thank you, I now understand that duplicates are likely to be represented in one row with the merge statement. Is there a way where I can have separate rows for all the duplicates?

For instance :
data example1;
input enrolid ;
datalines;
1
2
;

data example2;
input enrolid y;
datalines;
1 11
1 12
3 11
;

data example3;
input enrolid y;
datalines;
1 111
1 222
2 411
;

data example4;
input enrolid y;
datalines;
1 1111
4 111
;

data merged;
merge
example1 (in=a)
example2 (in=b)
example3 (in=c)
example4 (in=d)
;
by enrolid;
IF A AND (B OR C OR D) ;
run;

Dataset wanted:
Enrolid y
1 11
1 12
1 111
1 222
1 1111
2 411

I need all entrees of ids of dataset A present in B,C and D.

ballardw · Posted 08-25-2021 11:25 PM

This could be accomplished by using a SET statement to combine example2, 3 and 4. Then merge with example1.

However, this will fail if you have duplicates of the By variable in both example1 and the combined data.

data example1;
   input enrolid ;
datalines;
1
2
;

data example2;
   input enrolid y;
datalines;
1 11
1 12
3 11
;

data example3;
   input enrolid y;
datalines;
1 111
1 222
2 411
;

data example4;
   input enrolid y;
datalines;
1 1111
4 111
;

Data stacked;
   set example2
       example3
       example4
   ;
run;

proc sort data=stacked;
  by enrolid;
run;

data merged;
   merge
      example1 (in=a)
      stacked (in=b)
   ;
   by enrolid;
   IF A AND B ;
run;

Please post code in either a text box, opened on the forum with the </> icon that appears above the message window, or the code box opened with the "running man" icon. It helps to tell when something is code or part of a question or comment and they will not reformat pasted text like the main message window. The main message windows reformat text making any code indenting go away as well as other things that can result in not-seen characters that can impact whether code will run.

ihtishamsultan · Posted 08-25-2021 11:31 PM

Apologies. I will use that from now on.

My problem with using the set statement to combine B,C and D is that the dataset B,C and D are too big to be combined on my computer (almost 200GB). Also there are no duplicates in dataset A but there are duplicates in B,C and D.

Is there an alternative?

Tom · Posted 08-25-2021 11:58 PM

Use a VIEW.

data all / view=all;
   set b c d ;
   by enrolid;
run;

data want;
  merge a(in=ina) all (in=inall);
  by enrolid;
  if ina and inall;
run;

If A is tiny pull the list of ENROLID values into a macro variable and use it in a WHERE statement.

proc sql noprint;
select distinct quote(trim(enrolid)) into :list separated by ' ' from A;
quit;
data want;
  set b c d ;
  by enrolid;
  where enrolid in (&list);
run;

If A is small enough load it into a hash and use the FIND() method to check if the value is found.

If A is too large to load into a hash then create an index for A and use that in the data step to test if the id is in the A dataset.

data want;
  set b c d ;
  by enrolid;
  set a key=enrolid;
  _error_=0;
  if _iorc_ then delete;
run;

ballardw · Posted 08-26-2021 10:19 AM

Set before or three data steps with merge and then set (or proc append which would be faster generally). There is likely to be a Proc SQL solution but if your data sets are "too big" to combine first then likely the performance in SQL is going to be slower.

SASKiwi · Posted 08-25-2021 10:56 PM

The MERGE statement cannot handle multiple input datasets where there are duplicate ENROLIDs in more than one of those input datasets.

Test each of your input datasets for duplicate ENROLIDs and if more than one has this issue then that explains the drop in observation counts.

If you are wanting to keep all duplicate ENROLIDs then you will need to switch to using SQL.

ihtishamsultan · Posted 08-25-2021 10:59 PM

Thank you, I want to keep all the duplicate records as well. Can you please help with the proc sql code. I haven't used sql much.

ChrisNZ · Posted 08-25-2021 11:02 PM

> I want to keep all the duplicate records as well.

How? What output do you expect if there are duplicates in A and B but not C? Or B and C and not A?

High-Performance SAS Coding - Third Edition

ihtishamsultan · Posted 08-25-2021 11:19 PM

In the code below my dataset example1 represents ids that I need the data for. The needed data is available in b,c and d.

For instance :
data example1;
input enrolid ;
datalines;
1
2
;

data example2;
input enrolid y;
datalines;
1 11
1 12
3 11
;

data example3;
input enrolid y;
datalines;
1 111
1 222
2 411
;

data example4;
input enrolid y;
datalines;
1 1111
4 111
;

data merged;
merge
example1 (in=a)
example2 (in=b)
example3 (in=c)
example4 (in=d)
;
by enrolid;
IF A AND (B OR C OR D) ;
run;

Dataset wanted:
Enrolid y
1 11
1 12
1 111
1 222
1 1111
2 411

I need all entrees of ids of dataset A present in B,C and D.

Kurt_Bremser · Posted 08-26-2021 03:56 AM

So you want to use example1 as a lookup table to filter data out of the others. This is easily done with a hash object:

data want;
set
  example2
  example3
  example4
;
if _n_ = 1
then do;
  declare hash ex1 (dataset:"example1");
  ex1.definekey("enrolid");
  ex1.definedone();
end;
if ex1.check() = 0;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ChrisNZ · Posted 08-25-2021 11:00 PM

Do you get any message in the log?

With so few observations, it should be easy to look at the observations that are duplicated.

Not knowing what you want, it's is difficult tot comment further.

Please post code using the appropriate icon.

High-Performance SAS Coding - Third Edition

Merging multiple datasets reduces no of observations.

Re: Merging multiple datasets reduces no of observations.

Re: Merging multiple datasets reduces no of observations.

Re: Merging multiple datasets reduces no of observations.

Re: Merging multiple datasets reduces no of observations.

Re: Merging multiple datasets reduces no of observations.

Re: Merging multiple datasets reduces no of observations.

Re: Merging multiple datasets reduces no of observations.

Re: Merging multiple datasets reduces no of observations.

Re: Merging multiple datasets reduces no of observations.

Re: Merging multiple datasets reduces no of observations.

Re: Merging multiple datasets reduces no of observations.

Re: Merging multiple datasets reduces no of observations.

Re: Merging multiple datasets reduces no of observations.

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!