BookmarkSubscribeRSS Feed
CathyVI
Pyrite | Level 9

Hi,

I am not sure which is the best way to merge this dataset. I am not pleased with the result I got when I merged it so please I will like to get help on the best type of merge. Of course I have read lots of materials on one-to-one merge, many to many both in the data step and proc sql. I know the question might be what type of outcome do I want? I want an outcome that will provide the best merge.

I have millions of record but I have create a sample code, my result & concern below.

data data1;
input ID$ 1-8 ID2$ 9-16 DT1 17-25 X 27-28 Y 29-30 Z 31-32 DT2;
datalines;
l2020642 F223220 20190620 0 0 1 20190620
l2020642 Z799199 20180607 1 0 0
l2020642 Z837743 20180418 0 0 1
l2020642 Z242142 20181221 0 0 1
;
run;

data data2;
input ID$ 1-8 MTH$ 9-11 YR 12-16 ;
datalines;
l2020642 08 2017
l2020642 01 2010
;
run;

proc sort data=data1; by ID;run;
proc sort data=data2; by ID;run;

Data Want;
merge data2 (IN=IN1) data1 (IN=In2);
by ID;
if In1 and In2;
run;

Data2 is my main file so I want to join data1 to data2.

The result generate 3 rows but am worried why the MTH and YR 08/2017 came out once and 01/2010 came out twice. I don't know if the right joining should be one-to-many or many-to-many. I think am not using the right joining. Please providing a code will be help. Thanks in advance

 
 
 
 
 
 
3 REPLIES 3
Reeza
Super User
What do you want as output? I suspect in this case you need to filter your join by more than just ID.
Tom
Super User Tom
Super User

The first data step is not going to work right.  Your column numbers are off, but the real issue is the last field you did not list any columns, so SAS will use LIST MODE input.  Which means for that it will jump to the next line to find something to read into the DT2 variable.

data data1;
  input ID$ 1-8 ID2$ 10-16 DT1 18-25 X 27 Y 29 Z 31 DT2 33-40;
datalines;
l2020642 F223220 20190620 0 0 1 20190620
l2020642 Z799199 20180607 1 0 0
l2020642 Z837743 20180418 0 0 1
l2020642 Z242142 20181221 0 0 1
;

If you fix that then you get this as the result.

Obs       ID       MTH     YR       ID2         DT1      X    Y    Z       DT2

 1     l2020642    08     2017    F223220    20190620    0    0    1    20190620
 2     l2020642    01     2010    Z799199    20180607    1    0    0           .
 3     l2020642    01     2010    Z837743    20180418    0    0    1           .
 4     l2020642    01     2010    Z242142    20181221    0    0    1           .

What output would you like instead?

 

Also notice that you get are warning message in the log:

NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 2 observations read from the data set WORK.DATA2.
NOTE: There were 4 observations read from the data set WORK.DATA1.

When you MERGE two datasets that both have repeating observations then SAS will match then one for one until one of the datasets runs out of observations.  In that case the values contributed by the "short" dataset will stay the same as the values of the last observation in the group.

 

If you want to eliminate those "retained" values you can make the data step a little more complex.

data want;
  merge data2(in=in1) data1(in=in2);
  by id;
  if in1 and in2 then output;
  call missing(of _all_);
run;

Results:

Obs       ID       MTH     YR       ID2         DT1      X    Y    Z       DT2

 1     l2020642    08     2017    F223220    20190620    0    0    1    20190620
 2     l2020642    01     2010    Z799199    20180607    1    0    0           .
 3     l2020642              .    Z837743    20180418    0    0    1           .
 4     l2020642              .    Z242142    20181221    0    0    1           .

If you want something else then explain what you want and how the program can KNOW which observations to output.

Kurt_Bremser
Super User

You really must show us what you expect as a result from this example datasets.

 

And do you really store your dates like this, and a month as character?

Or should your data be this:

data data1;
infile datalines truncover;
input ID$ ID2$ DT1 :mmddyy8. X Y Z DT2 :yymmdd8.;
format dt1 dt2 yymmddn8.;
datalines;
l2020642 F223220 20190620 0 0 1 20190620
l2020642 Z799199 20180607 1 0 0
l2020642 Z837743 20180418 0 0 1
l2020642 Z242142 20181221 0 0 1
;

For your second dataset, I would recommend to combine your month/year into a SAS date:

data data2;
input ID$ MTH YR;
monyear = mdy(mth,1,yr);
format monyear yymmn6.;
datalines;
l2020642 08 2017
l2020642 01 2010
;

This enables you to use SAS tools for dates (INTCK, INTHX and so on).

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 3 replies
  • 543 views
  • 0 likes
  • 4 in conversation