BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
pamplemouse22
Calcite | Level 5

Hi all, 

 

Data management question. I am trying to create a dataset using information from two datasets (not merged) and would like some suggestions how to complete this task. I'd like to sum down a column, by group, from dataset "have_a", using a condition that uses data in dataset "have_b". Is this possible without merging the two datasets?

 

I would like to generate dataset "want" by computing the average count from dataset "have_a" over the date range specified in dataset "have_b".

 

The following describes what I have and want. I hope this is clear. Thank you!! 

 

	data have_a;
	input group count date; 
	datalines; 
	1 50 01JAN2005
	1 51 02JAN2005
	1 49 03JAN2005
	1 34 04JAN2005
	1 34 05JAN2005
	;
	run;

	data have_b; 
	input group date_start date_end; 
	datalines; 
	1 01JAN2005 05JAN2005
	1 02JAN2005 05JAN2005
	; 
	run;

	data want; 
	input group date_start date_end mean; 
	datalines; 
	1 01JAN2005 05JAN2005 43.6
	1 02JAN2005 05JAN2005 42.0
	;
	run;

 

 

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

Not the most efficient:

proc sql;
   create table want as
   select distinct have_b.*, mean(have_a.count) as mean
   from have_b, have_a
   where  have_b.date_start le have_a.date le have_b.date_end
   group by have_b.date_start,  have_b.date_end
   ;
quit;

Note that this requires all of the date variables to be SAS date valued variables.

 

View solution in original post

8 REPLIES 8
Patrick
Opal | Level 21

@pamplemouse22

You will have to combine the data somehow in order to make the grouping information from your 2nd dataset available to your first dataset.

 

You could create a format from your 2nd dataset (using Proc Format / CNTLIN) and then directly use this format in one of the SAS procs for calculation of aggregated results.

pamplemouse22
Calcite | Level 5

Hi Patrick, 

 

Thanks! The only way I could think of merging the data is to create a binary variable "start"-1/0 and then have a single date variable and then I can merge by date, but this would involve making my dataset larger than it is - not sure I want to do that. the proc sql option below seems like a good option ! Am going to try that. Thanks for your help. 

ballardw
Super User

Not the most efficient:

proc sql;
   create table want as
   select distinct have_b.*, mean(have_a.count) as mean
   from have_b, have_a
   where  have_b.date_start le have_a.date le have_b.date_end
   group by have_b.date_start,  have_b.date_end
   ;
quit;

Note that this requires all of the date variables to be SAS date valued variables.

 

pamplemouse22
Calcite | Level 5

Thank you so much! This is exactly what I was looking for. I have a few more conditions that I didn't describe that are needed to make the merge, so I will attempt adjusting your example code, accordingly. Thanks for getting it started! I appreciate it. Might be back with additional questions. 

pamplemouse22
Calcite | Level 5

Hi ballardw,

 

Can you please explain to me how this part of your code pulls the right data for the calculation? Thank you! 

 

 where  have_b.date_start le have_a.date le have_b.date_end
   group by have_b.date_start,  have_b.date_end

 

ballardw
Super User

@pamplemouse22 wrote:

Hi ballardw,

 

Can you please explain to me how this part of your code pulls the right data for the calculation? Thank you! 

 

 where  have_b.date_start le have_a.date le have_b.date_end
   group by have_b.date_start,  have_b.date_end

 


The where clause restricts the data to where the "have_a" data set date is between the the start and end dates in the "have_b" data set allowing the endpoints to match.

The group by takes the results creating a "group" for the combinations of date_start and date_end for the summarization function

 

The reason I say this isn't the most effecient (in potential run time) is that this code combines every record in Have_b with every record in Have_a, tests to see if the Have_a data is the desired interval (if you have 20 records in Have_b and 100 in Have_a that is 20*100 comparisons which can get large quickly => slow run time), groups by the desired interval and then summarizes calculating the mean.

pamplemouse22
Calcite | Level 5

That is very clear, thank you. I am testing this out with a small sample of my data (and it works perfectly), but, the real dataset is very large. I will allow ample time for running. 

 

Desired intervals at the moment vary because of missing data, but with some more steps I can create a standard interval (where some of the date values are missing instead of being skipped). 

 

If there are quick adjustments to make this process more efficient, suggestions welcome.

 

Thank you so much! Appreciate it. 

Patrick
Opal | Level 21

@pamplemouse22

What does "very large" mean? 1, 10, 100, ... millions of rows?

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1154 views
  • 0 likes
  • 3 in conversation