DATA Step, Macro, Functions and more

Proc SQL and sample size

Occasional Contributor
Posts: 17

Proc SQL and sample size

Hi Guys,


I used Proc SQL to create a data-set from three different data sets based on identification code and year and quarter. However, the final sample size does not add up. Is there a command that can help me to make sure the final data is ok and includes all three data sets(it's a big data)? Variable wise, it is right and I checked. 


proc sql;
select *
from SASUSER.feb20INF a, allanalyst b
where a.permno=b.permno



Super User
Posts: 17,930

Re: Proc SQL and sample size

It's hard to say without more information about your sampling methodology and data. 


How big is big?

Is it worth creating a view of the data and then using proc surveyselect?

Occasional Contributor
Posts: 17

Re: Proc SQL and sample size

thank you for replying! 

My first data set  has 17931 observations, second data 5429(183 variables) observations and thrid one has 535000(117 variables) obs. in total I have 355 variables. All data sets share common identifier and dates. I need to match data set 2 and 3 with 1(the main data base). So basically, I'm matching observations from data set 2 and 3 to data set 1 based on unique identifier and date(year and quarter). 

I acquire different results with left join or natural join. 


I'm not familiar with proc surveyselect. 

Respected Advisor
Posts: 4,659

Re: Proc SQL and sample size

To find entries in feb20inf that are not matched in allanalyst, try


proc sql;
create table test as
select  *
	sasuser.feb20inf as a
	where not exists (
		select * 
		from allanalyst
		where permno=a.permno and year=a.year and quarter=a.quarter )
Super User
Posts: 9,691

Re: Proc SQL and sample size

If there are duplicated value for permno or YEAR or QUARTER , you will get more obs than expect , Try MERGE statement.


data want;

merge SASUSER.feb20INF  allanalyst 
by permno YEAR QUARTER;

Trusted Advisor
Posts: 1,115

Re: Proc SQL and sample size

Hi @HannahPhD,


If I had to investigate this situation, I would create a temporary dataset from each of the three original tables by restricting them to the key variables.



data keyall;
set allanalyst(keep=permno year quarter);

Using the resulting three comparably small datasets you can easily

  • check for duplicate key combinations
  • perform frequency counts
  • sort by key variables (if necessary)
  • eliminate duplicates (if any)
  • create a "membership profile" for each key combination (e.g. "101" means "contained in the first and third dataset, not in the second")
  • perform "test merges"

The results of these operations will give you a clear picture of the situation and hence guide you how to proceed with your "big" datasets. You will then know for sure in advance how many observations to expect from your intended joins/merges.


You should also make sure that there are no common variables except the key variables or, if there are some, that they are handled properly.

Ask a Question
Discussion stats
  • 5 replies
  • 1 like
  • 5 in conversation