Solved: Re: Merge and keep same observations

Jahanzaib · Posted 06-02-2017 08:52 AM

I have merged two datasets. first one has 55859 and other has 57658 observations. If we combine both values then it becomes 113517, But merged dataset tells that there are 89765 observations. So It means there are some same observations on the basis of fyear and cusip.

My question is, Is this the difference of 23752 observations from (113517-89765) is the same observation in two datasets? How can I get the observations which are same in the two data set by delete those which do not match with each

How can I get the observations which are same in the two data set by deleting those which do not match with each other?

18   data mydata.E1;
19   merge mydata.sdc mydata.compusip;
20   by fyear cusip;
21   run;

NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 55859 observations read from the data set MYDATA.SDC.
NOTE: There were 57658 observations read from the data set MYDATA.COMPUSIP.
NOTE: The data set MYDATA.E1 has 89765 observations and 120 variables.
NOTE: DATA statement used (Total process time):
      real time           0.14 seconds
      cpu time            0.07 seconds

ballardw · Posted 06-05-2017 12:49 PM

@Jahanzaib wrote:
@ballardw: your codes turns out with the same No of observations.
4 merge mydata.sdc (in=In1)
5 mydata.compusip (in=In2);
6 ;
7 by fyear cusip;
8 InSDC=In1;
9 InCompusip=In2;
10 run;

NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 55859 observations read from the data set MYDATA.SDC.
NOTE: There were 57658 observations read from the data set MYDATA.COMPUSIP.
NOTE: The data set MYDATA.MERGED2 has 89765 observations and 122 variables.

If you reread my post you will see that the code adds two variables to let you know which data set contributed to each record. You can do as you will with that information. Such as send those that only appear in set 1 to one output data set, only in set 2 to a different and both to yet a third set. Or select some desired combination. The purpose was to show how to get information about contributing datasets which is extensible to more sets.

View solution in original post

LinusH · Posted 06-02-2017 09:52 AM

You sounds surprised that you are actually getting matches, when doing a merge. Isn't that what you should expect?

If no matches were expected, you perhaps was looking for appending data.

The exact math of the observation numbers can be uncertain, you can have unpredictable results if one, or both, data sets have duplicates.

If you only want matching rows, in the data step use (in=a/b) ds option together with a subsetting if a and b;

In SQL, do a inner join.

Data never sleeps

Jahanzaib · Posted 06-02-2017 10:15 AM

No not surprised, I expect so but I want to keep those which are matched one, not the others.

ballardw · Posted 06-02-2017 10:21 AM

You can use the IN dataset option to tell you which data set(s) contribute to the current the record. Since the IN variarbles are temporary you need to assign them to keep the values.

This code adds two variables that will have a value of 1 if the data set contributed. The ones where both InSDC and InCompusip = 1 are matches (and values for other common variables come from compusip).

data mydata.E1;
   merge mydata.sdc     (in=In1)
         mydata.compusip (in=In2);
   ;
   by fyear cusip;
   InSDC=In1;
   InCompusip=In2;
run;

Jahanzaib · Posted 06-02-2017 09:00 PM

@ballardw: your codes turns out with the same No of observations.
4 merge mydata.sdc (in=In1)
5 mydata.compusip (in=In2);
6 ;
7 by fyear cusip;
8 InSDC=In1;
9 InCompusip=In2;
10 run;

NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 55859 observations read from the data set MYDATA.SDC.
NOTE: There were 57658 observations read from the data set MYDATA.COMPUSIP.
NOTE: The data set MYDATA.MERGED2 has 89765 observations and 122 variables.

Cynthia_sas · Posted 06-04-2017 10:47 AM

Hi:

I suggest you revisit the lesson in Programming 1 on how merges work. As an example, here is a simple merge with much smaller data sets. You can work out with pencil and paper which rows are matches and which are non-matches. Using the IN= option allows you to control the output of matches and/or non-matches, as shown below.

cynthia

Saawan · Posted 09-16-2017 05:58 PM

Thank you that was clear.

ballardw · Posted 06-05-2017 12:49 PM

@Jahanzaib wrote:
@ballardw: your codes turns out with the same No of observations.
4 merge mydata.sdc (in=In1)
5 mydata.compusip (in=In2);
6 ;
7 by fyear cusip;
8 InSDC=In1;
9 InCompusip=In2;
10 run;

NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 55859 observations read from the data set MYDATA.SDC.
NOTE: There were 57658 observations read from the data set MYDATA.COMPUSIP.
NOTE: The data set MYDATA.MERGED2 has 89765 observations and 122 variables.

If you reread my post you will see that the code adds two variables to let you know which data set contributed to each record. You can do as you will with that information. Such as send those that only appear in set 1 to one output data set, only in set 2 to a different and both to yet a third set. Or select some desired combination. The purpose was to show how to get information about contributing datasets which is extensible to more sets.

Tom · Posted 06-04-2017 01:27 PM

You are merging two datasets that BOTH have multiple observations per BY group. Are you sure you want to do that? Is there not another variable you can add to your BY statement to make the merge either 1 to 1 or at least 1 to N?

What SAS will do when merging N to N is match the observations in the BY group in the order that they appear. So the first observations from table A is matched to the first observation from table B, etc. If one of the two datasets contributes fewer observations than the other then the values for its last observations are retained for the rest of BY group. This includes the setting of the variable specified in the IN= dataset option.

If for some strange reason you do want to continue with merging the data in this way and mearly want to eliminate the extra records for each BY group so that the output will only contain observations with data from both inputs then you will need to reset the IN= variables so that they will reflect whether a new observation has been read from that source. So if in a BY group there are 5 observations from A and only 3 from B you could get SAS to only output the first 3 observations for that BY group.

data want ;
  set a(in=in1) b(in=in2);
  by id ;
  if in1 and in2 then output;
  call missing(in1,in2);
run;

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away