02-20-2016 10:10 PM
I used Proc SQL to create a data-set from three different data sets based on identification code and year and quarter. However, the final sample size does not add up. Is there a command that can help me to make sure the final data is ok and includes all three data sets(it's a big data)? Variable wise, it is right and I checked.
CREATE TABLE SASUSER.FINALBEH AS
from SASUSER.feb20INF a, allanalyst b
and a.YEAR=b.YEAR AND A.QUARTER=B.QUARTER;
02-20-2016 10:35 PM
It's hard to say without more information about your sampling methodology and data.
How big is big?
Is it worth creating a view of the data and then using proc surveyselect?
02-20-2016 10:55 PM
thank you for replying!
My first data set has 17931 observations, second data 5429(183 variables) observations and thrid one has 535000(117 variables) obs. in total I have 355 variables. All data sets share common identifier and dates. I need to match data set 2 and 3 with 1(the main data base). So basically, I'm matching observations from data set 2 and 3 to data set 1 based on unique identifier and date(year and quarter).
I acquire different results with left join or natural join.
I'm not familiar with proc surveyselect.
02-20-2016 10:59 PM
To find entries in feb20inf that are not matched in allanalyst, try
proc sql; create table test as select * from sasuser.feb20inf as a where not exists ( select * from allanalyst where permno=a.permno and year=a.year and quarter=a.quarter ) ; quit;
02-21-2016 09:20 PM
If there are duplicated value for permno or YEAR or QUARTER , you will get more obs than expect , Try MERGE statement.
merge SASUSER.feb20INF allanalyst
by permno YEAR QUARTER;
02-23-2016 11:15 AM
If I had to investigate this situation, I would create a temporary dataset from each of the three original tables by restricting them to the key variables.
data keyall; set allanalyst(keep=permno year quarter); run;
Using the resulting three comparably small datasets you can easily
The results of these operations will give you a clear picture of the situation and hence guide you how to proceed with your "big" datasets. You will then know for sure in advance how many observations to expect from your intended joins/merges.
You should also make sure that there are no common variables except the key variables or, if there are some, that they are handled properly.