Merge two datasets

CathyVI · Posted 04-11-2023 06:06 PM

Hi,

I am not sure which is the best way to merge this dataset. I am not pleased with the result I got when I merged it so please I will like to get help on the best type of merge. Of course I have read lots of materials on one-to-one merge, many to many both in the data step and proc sql. I know the question might be what type of outcome do I want? I want an outcome that will provide the best merge.

I have millions of record but I have create a sample code, my result & concern below.

data data1;
input ID$ 1-8 ID2$ 9-16 DT1 17-25 X 27-28 Y 29-30 Z 31-32 DT2;
datalines;
l2020642 F223220 20190620 0 0 1 20190620
l2020642 Z799199 20180607 1 0 0
l2020642 Z837743 20180418 0 0 1
l2020642 Z242142 20181221 0 0 1
;
run;

data data2;
input ID$ 1-8 MTH$ 9-11 YR 12-16 ;
datalines;
l2020642 08 2017
l2020642 01 2010
;
run;

proc sort data=data1; by ID;run;
proc sort data=data2; by ID;run;

Data Want;
merge data2 (IN=IN1) data1 (IN=In2);
by ID;
if In1 and In2;
run;

Data2 is my main file so I want to join data1 to data2.

The result generate 3 rows but am worried why the MTH and YR 08/2017 came out once and 01/2010 came out twice. I don't know if the right joining should be one-to-many or many-to-many. I think am not using the right joining. Please providing a code will be help. Thanks in advance

Reeza · Posted 04-11-2023 06:11 PM

What do you want as output? I suspect in this case you need to filter your join by more than just ID.

Tom · Posted 04-12-2023 09:32 AM

The first data step is not going to work right. Your column numbers are off, but the real issue is the last field you did not list any columns, so SAS will use LIST MODE input. Which means for that it will jump to the next line to find something to read into the DT2 variable.

data data1;
  input ID$ 1-8 ID2$ 10-16 DT1 18-25 X 27 Y 29 Z 31 DT2 33-40;
datalines;
l2020642 F223220 20190620 0 0 1 20190620
l2020642 Z799199 20180607 1 0 0
l2020642 Z837743 20180418 0 0 1
l2020642 Z242142 20181221 0 0 1
;

If you fix that then you get this as the result.

Obs       ID       MTH     YR       ID2         DT1      X    Y    Z       DT2

 1     l2020642    08     2017    F223220    20190620    0    0    1    20190620
 2     l2020642    01     2010    Z799199    20180607    1    0    0           .
 3     l2020642    01     2010    Z837743    20180418    0    0    1           .
 4     l2020642    01     2010    Z242142    20181221    0    0    1           .

What output would you like instead?

Also notice that you get are warning message in the log:

NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 2 observations read from the data set WORK.DATA2.
NOTE: There were 4 observations read from the data set WORK.DATA1.

When you MERGE two datasets that both have repeating observations then SAS will match then one for one until one of the datasets runs out of observations. In that case the values contributed by the "short" dataset will stay the same as the values of the last observation in the group.

If you want to eliminate those "retained" values you can make the data step a little more complex.

data want;
  merge data2(in=in1) data1(in=in2);
  by id;
  if in1 and in2 then output;
  call missing(of _all_);
run;

Results:

Obs       ID       MTH     YR       ID2         DT1      X    Y    Z       DT2

 1     l2020642    08     2017    F223220    20190620    0    0    1    20190620
 2     l2020642    01     2010    Z799199    20180607    1    0    0           .
 3     l2020642              .    Z837743    20180418    0    0    1           .
 4     l2020642              .    Z242142    20181221    0    0    1           .

If you want something else then explain what you want and how the program can KNOW which observations to output.

Kurt_Bremser · Posted 04-13-2023 04:51 AM

You really must show us what you expect as a result from this example datasets.

And do you really store your dates like this, and a month as character?

Or should your data be this:

data data1;
infile datalines truncover;
input ID$ ID2$ DT1 :mmddyy8. X Y Z DT2 :yymmdd8.;
format dt1 dt2 yymmddn8.;
datalines;
l2020642 F223220 20190620 0 0 1 20190620
l2020642 Z799199 20180607 1 0 0
l2020642 Z837743 20180418 0 0 1
l2020642 Z242142 20181221 0 0 1
;

For your second dataset, I would recommend to combine your month/year into a SAS date:

data data2;
input ID$ MTH YR;
monyear = mdy(mth,1,yr);
format monyear yymmn6.;
datalines;
l2020642 08 2017
l2020642 01 2010
;

This enables you to use SAS tools for dates (INTCK, INTHX and so on).

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Merge two datasets

Re: Merge two datasets

Re: Merge two datasets

Re: Merge two datasets

Catch up on SAS Innovate 2026

Merge two datasets

Re: Merge two datasets

Re: Merge two datasets

Re: Merge two datasets

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away