I have merged two datasets. first one has 55859 and other has 57658 observations. If we combine both values then it becomes 113517, But merged dataset tells that there are 89765 observations. So It means there are some same observations on the basis of fyear and cusip.
My question is, Is this the difference of 23752 observations from (113517-89765) is the same observation in two datasets? How can I get the observations which are same in the two data set by delete those which do not match with each
How can I get the observations which are same in the two data set by deleting those which do not match with each other?
18 data mydata.E1; 19 merge mydata.sdc mydata.compusip; 20 by fyear cusip; 21 run; NOTE: MERGE statement has more than one data set with repeats of BY values. NOTE: There were 55859 observations read from the data set MYDATA.SDC. NOTE: There were 57658 observations read from the data set MYDATA.COMPUSIP. NOTE: The data set MYDATA.E1 has 89765 observations and 120 variables. NOTE: DATA statement used (Total process time): real time 0.14 seconds cpu time 0.07 seconds
@Jahanzaib wrote:
@ballardw: your codes turns out with the same No of observations.
4 merge mydata.sdc (in=In1)
5 mydata.compusip (in=In2);
6 ;
7 by fyear cusip;
8 InSDC=In1;
9 InCompusip=In2;
10 run;
NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 55859 observations read from the data set MYDATA.SDC.
NOTE: There were 57658 observations read from the data set MYDATA.COMPUSIP.
NOTE: The data set MYDATA.MERGED2 has 89765 observations and 122 variables.
If you reread my post you will see that the code adds two variables to let you know which data set contributed to each record. You can do as you will with that information. Such as send those that only appear in set 1 to one output data set, only in set 2 to a different and both to yet a third set. Or select some desired combination. The purpose was to show how to get information about contributing datasets which is extensible to more sets.
You sounds surprised that you are actually getting matches, when doing a merge. Isn't that what you should expect?
If no matches were expected, you perhaps was looking for appending data.
The exact math of the observation numbers can be uncertain, you can have unpredictable results if one, or both, data sets have duplicates.
If you only want matching rows, in the data step use (in=a/b) ds option together with a subsetting if a and b;
In SQL, do a inner join.
You can use the IN dataset option to tell you which data set(s) contribute to the current the record. Since the IN variarbles are temporary you need to assign them to keep the values.
This code adds two variables that will have a value of 1 if the data set contributed. The ones where both InSDC and InCompusip = 1 are matches (and values for other common variables come from compusip).
data mydata.E1; merge mydata.sdc (in=In1) mydata.compusip (in=In2); ; by fyear cusip; InSDC=In1; InCompusip=In2; run;
Hi:
I suggest you revisit the lesson in Programming 1 on how merges work. As an example, here is a simple merge with much smaller data sets. You can work out with pencil and paper which rows are matches and which are non-matches. Using the IN= option allows you to control the output of matches and/or non-matches, as shown below.
cynthia
Thank you that was clear.
@Jahanzaib wrote:
@ballardw: your codes turns out with the same No of observations.
4 merge mydata.sdc (in=In1)
5 mydata.compusip (in=In2);
6 ;
7 by fyear cusip;
8 InSDC=In1;
9 InCompusip=In2;
10 run;
NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 55859 observations read from the data set MYDATA.SDC.
NOTE: There were 57658 observations read from the data set MYDATA.COMPUSIP.
NOTE: The data set MYDATA.MERGED2 has 89765 observations and 122 variables.
If you reread my post you will see that the code adds two variables to let you know which data set contributed to each record. You can do as you will with that information. Such as send those that only appear in set 1 to one output data set, only in set 2 to a different and both to yet a third set. Or select some desired combination. The purpose was to show how to get information about contributing datasets which is extensible to more sets.
You are merging two datasets that BOTH have multiple observations per BY group. Are you sure you want to do that? Is there not another variable you can add to your BY statement to make the merge either 1 to 1 or at least 1 to N?
What SAS will do when merging N to N is match the observations in the BY group in the order that they appear. So the first observations from table A is matched to the first observation from table B, etc. If one of the two datasets contributes fewer observations than the other then the values for its last observations are retained for the rest of BY group. This includes the setting of the variable specified in the IN= dataset option.
If for some strange reason you do want to continue with merging the data in this way and mearly want to eliminate the extra records for each BY group so that the output will only contain observations with data from both inputs then you will need to reset the IN= variables so that they will reflect whether a new observation has been read from that source. So if in a BY group there are 5 observations from A and only 3 from B you could get SAS to only output the first 3 observations for that BY group.
data want ;
set a(in=in1) b(in=in2);
by id ;
if in1 and in2 then output;
call missing(in1,in2);
run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.