BookmarkSubscribeRSS Feed
HannahPhD
Obsidian | Level 7

Hi Guys,

 

I used Proc SQL to create a data-set from three different data sets based on identification code and year and quarter. However, the final sample size does not add up. Is there a command that can help me to make sure the final data is ok and includes all three data sets(it's a big data)? Variable wise, it is right and I checked. 

 

proc sql;
CREATE TABLE SASUSER.FINALBEH AS
select *
from SASUSER.feb20INF a, allanalyst b
where a.permno=b.permno
and a.YEAR=b.YEAR AND A.QUARTER=B.QUARTER;
quit;

 

Thanks! 

5 REPLIES 5
Reeza
Super User

It's hard to say without more information about your sampling methodology and data. 

 

How big is big?

Is it worth creating a view of the data and then using proc surveyselect?

HannahPhD
Obsidian | Level 7

thank you for replying! 

My first data set  has 17931 observations, second data 5429(183 variables) observations and thrid one has 535000(117 variables) obs. in total I have 355 variables. All data sets share common identifier and dates. I need to match data set 2 and 3 with 1(the main data base). So basically, I'm matching observations from data set 2 and 3 to data set 1 based on unique identifier and date(year and quarter). 

I acquire different results with left join or natural join. 

 

I'm not familiar with proc surveyselect. 

PGStats
Opal | Level 21

To find entries in feb20inf that are not matched in allanalyst, try

 

proc sql;
create table test as
select  *
from 
	sasuser.feb20inf as a
	where not exists (
		select * 
		from allanalyst
		where permno=a.permno and year=a.year and quarter=a.quarter )
;
quit;
PG
Ksharp
Super User

If there are duplicated value for permno or YEAR or QUARTER , you will get more obs than expect , Try MERGE statement.

 

data want;

merge SASUSER.feb20INF  allanalyst 
by permno YEAR QUARTER;
run;

FreelanceReinh
Jade | Level 19

Hi @HannahPhD,

 

If I had to investigate this situation, I would create a temporary dataset from each of the three original tables by restricting them to the key variables.

 

Example:

data keyall;
set allanalyst(keep=permno year quarter);
run;

Using the resulting three comparably small datasets you can easily

  • check for duplicate key combinations
  • perform frequency counts
  • sort by key variables (if necessary)
  • eliminate duplicates (if any)
  • create a "membership profile" for each key combination (e.g. "101" means "contained in the first and third dataset, not in the second")
  • perform "test merges"


The results of these operations will give you a clear picture of the situation and hence guide you how to proceed with your "big" datasets. You will then know for sure in advance how many observations to expect from your intended joins/merges.

 

You should also make sure that there are no common variables except the key variables or, if there are some, that they are handled properly.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 1476 views
  • 1 like
  • 5 in conversation