BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Jahanzaib
Quartz | Level 8

I have merged two datasets. first one has 55859 and other has 57658 observations. If we combine both values then it becomes 113517, But merged dataset tells that there are 89765 observations. So It means there are some same observations on the basis of fyear and cusip. 

My question is, Is this the difference of  23752 observations from (113517-89765) is the same observation in two datasets? How can I get the observations which are same in the two data set by delete those which do not match with each

How can I get the observations which are same in the two data set by deleting those which do not match with each other?

18   data mydata.E1;
19   merge mydata.sdc mydata.compusip;
20   by fyear cusip;
21   run;

NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 55859 observations read from the data set MYDATA.SDC.
NOTE: There were 57658 observations read from the data set MYDATA.COMPUSIP.
NOTE: The data set MYDATA.E1 has 89765 observations and 120 variables.
NOTE: DATA statement used (Total process time):
      real time           0.14 seconds
      cpu time            0.07 seconds
1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

@Jahanzaib wrote:
@ballardw: your codes turns out with the same No of observations.
4 merge mydata.sdc (in=In1)
5 mydata.compusip (in=In2);
6 ;
7 by fyear cusip;
8 InSDC=In1;
9 InCompusip=In2;
10 run;

NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 55859 observations read from the data set MYDATA.SDC.
NOTE: There were 57658 observations read from the data set MYDATA.COMPUSIP.
NOTE: The data set MYDATA.MERGED2 has 89765 observations and 122 variables.

If you reread my post you will see that the code adds two variables to let you know which data set contributed to each record. You can do as you will with that information. Such as send those that only appear in set 1 to one output data set, only in set 2 to a different and both to yet a third set. Or select some desired combination. The purpose was to show how to get information about contributing datasets which is extensible to more sets.

View solution in original post

8 REPLIES 8
LinusH
Tourmaline | Level 20

You sounds surprised that you are actually getting matches, when doing a merge. Isn't that what you should expect?

If no matches were expected, you perhaps was looking for appending data.

The exact math of the observation numbers can be uncertain, you can have unpredictable results if one, or both, data sets have duplicates.

If you only want matching rows, in the data step use (in=a/b) ds option together with a subsetting if a and b;

In SQL, do a inner join.

Data never sleeps
Jahanzaib
Quartz | Level 8
No not surprised, I expect so but I want to keep those which are matched one, not the others.
ballardw
Super User

You can use the IN dataset option to tell you which data set(s) contribute to the current the record. Since the IN variarbles are temporary you need to assign them to keep the values.

This code adds two variables that will have a value of 1 if the data set contributed. The ones where both InSDC and InCompusip = 1 are matches (and values for other common variables come from compusip).

 

data mydata.E1;
   merge mydata.sdc     (in=In1)
         mydata.compusip (in=In2);
   ;
   by fyear cusip;
   InSDC=In1;
   InCompusip=In2;
run;
Jahanzaib
Quartz | Level 8
@ballardw: your codes turns out with the same No of observations.
4 merge mydata.sdc (in=In1)
5 mydata.compusip (in=In2);
6 ;
7 by fyear cusip;
8 InSDC=In1;
9 InCompusip=In2;
10 run;

NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 55859 observations read from the data set MYDATA.SDC.
NOTE: There were 57658 observations read from the data set MYDATA.COMPUSIP.
NOTE: The data set MYDATA.MERGED2 has 89765 observations and 122 variables.
Cynthia_sas
SAS Super FREQ

Hi:

  I suggest you revisit the lesson in Programming 1 on how merges work. As an example, here is a simple merge with much smaller data sets. You can work out with pencil and paper which rows are matches and which are non-matches. Using the IN= option allows you to control the output of matches and/or non-matches, as shown below.

 

cynthia

how_merge_works.png

Saawan
Obsidian | Level 7

Thank you that was clear.

ballardw
Super User

@Jahanzaib wrote:
@ballardw: your codes turns out with the same No of observations.
4 merge mydata.sdc (in=In1)
5 mydata.compusip (in=In2);
6 ;
7 by fyear cusip;
8 InSDC=In1;
9 InCompusip=In2;
10 run;

NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 55859 observations read from the data set MYDATA.SDC.
NOTE: There were 57658 observations read from the data set MYDATA.COMPUSIP.
NOTE: The data set MYDATA.MERGED2 has 89765 observations and 122 variables.

If you reread my post you will see that the code adds two variables to let you know which data set contributed to each record. You can do as you will with that information. Such as send those that only appear in set 1 to one output data set, only in set 2 to a different and both to yet a third set. Or select some desired combination. The purpose was to show how to get information about contributing datasets which is extensible to more sets.

Tom
Super User Tom
Super User

You are merging two datasets that BOTH have multiple observations per BY group.  Are you sure you want to do that? Is there not another variable you can add to your BY statement to make the merge either 1 to 1 or at least 1 to N?

 

What SAS will do when merging N to N is match the observations in the BY group in the order that they appear. So the first observations from table A is matched to the first observation from table B, etc.  If one of the two datasets contributes fewer observations than the other then the values for its last observations are retained for the rest of BY group.  This includes the setting of the variable specified in the IN= dataset option.

 

If for some strange reason you do want to continue with merging the data in this way and mearly want to eliminate the extra records for each BY group so that the output will only contain observations with data from both inputs then you will need to reset the IN= variables so that they will reflect whether a new observation has been read from that source.  So if in a BY group there are 5 observations from A and only 3 from B you could get SAS to only output the first 3 observations for that BY group.

 

data want ;
  set a(in=in1) b(in=in2);
  by id ;
  if in1 and in2 then output;
  call missing(in1,in2);
run;

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 10747 views
  • 4 likes
  • 6 in conversation