I am not sure which is the best way to merge this dataset. I am not pleased with the result I got when I merged it so please I will like to get help on the best type of merge. Of course I have read lots of materials on one-to-one merge, many to many both in the data step and proc sql. I know the question might be what type of outcome do I want? I want an outcome that will provide the best merge.
I have millions of record but I have create a sample code, my result & concern below.
data data1; input ID$ 1-8 ID2$ 9-16 DT1 17-25 X 27-28 Y 29-30 Z 31-32 DT2; datalines; l2020642 F223220 20190620 0 0 1 20190620 l2020642 Z799199 20180607 1 0 0 l2020642 Z837743 20180418 0 0 1 l2020642 Z242142 20181221 0 0 1 ; run; data data2; input ID$ 1-8 MTH$ 9-11 YR 12-16 ; datalines; l2020642 08 2017 l2020642 01 2010 ; run; proc sort data=data1; by ID;run; proc sort data=data2; by ID;run; Data Want; merge data2 (IN=IN1) data1 (IN=In2); by ID; if In1 and In2; run;
Data2 is my main file so I want to join data1 to data2.
The result generate 3 rows but am worried why the MTH and YR 08/2017 came out once and 01/2010 came out twice. I don't know if the right joining should be one-to-many or many-to-many. I think am not using the right joining. Please providing a code will be help. Thanks in advance
The first data step is not going to work right. Your column numbers are off, but the real issue is the last field you did not list any columns, so SAS will use LIST MODE input. Which means for that it will jump to the next line to find something to read into the DT2 variable.
data data1; input ID$ 1-8 ID2$ 10-16 DT1 18-25 X 27 Y 29 Z 31 DT2 33-40; datalines; l2020642 F223220 20190620 0 0 1 20190620 l2020642 Z799199 20180607 1 0 0 l2020642 Z837743 20180418 0 0 1 l2020642 Z242142 20181221 0 0 1 ;
If you fix that then you get this as the result.
Obs ID MTH YR ID2 DT1 X Y Z DT2 1 l2020642 08 2017 F223220 20190620 0 0 1 20190620 2 l2020642 01 2010 Z799199 20180607 1 0 0 . 3 l2020642 01 2010 Z837743 20180418 0 0 1 . 4 l2020642 01 2010 Z242142 20181221 0 0 1 .
What output would you like instead?
Also notice that you get are warning message in the log:
NOTE: MERGE statement has more than one data set with repeats of BY values. NOTE: There were 2 observations read from the data set WORK.DATA2. NOTE: There were 4 observations read from the data set WORK.DATA1.
When you MERGE two datasets that both have repeating observations then SAS will match then one for one until one of the datasets runs out of observations. In that case the values contributed by the "short" dataset will stay the same as the values of the last observation in the group.
If you want to eliminate those "retained" values you can make the data step a little more complex.
data want; merge data2(in=in1) data1(in=in2); by id; if in1 and in2 then output; call missing(of _all_); run;
Obs ID MTH YR ID2 DT1 X Y Z DT2 1 l2020642 08 2017 F223220 20190620 0 0 1 20190620 2 l2020642 01 2010 Z799199 20180607 1 0 0 . 3 l2020642 . Z837743 20180418 0 0 1 . 4 l2020642 . Z242142 20181221 0 0 1 .
If you want something else then explain what you want and how the program can KNOW which observations to output.
You really must show us what you expect as a result from this example datasets.
And do you really store your dates like this, and a month as character?
Or should your data be this:
data data1; infile datalines truncover; input ID$ ID2$ DT1 :mmddyy8. X Y Z DT2 :yymmdd8.; format dt1 dt2 yymmddn8.; datalines; l2020642 F223220 20190620 0 0 1 20190620 l2020642 Z799199 20180607 1 0 0 l2020642 Z837743 20180418 0 0 1 l2020642 Z242142 20181221 0 0 1 ;
For your second dataset, I would recommend to combine your month/year into a SAS date:
data data2; input ID$ MTH YR; monyear = mdy(mth,1,yr); format monyear yymmn6.; datalines; l2020642 08 2017 l2020642 01 2010 ;
This enables you to use SAS tools for dates (INTCK, INTHX and so on).
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.