DATA Step, Macro, Functions and more

Sum down a column using information from 2 unmerged datasets

Accepted Solution Solved
Reply
Contributor
Posts: 39
Accepted Solution

Sum down a column using information from 2 unmerged datasets

[ Edited ]

Hi all, 

 

Data management question. I am trying to create a dataset using information from two datasets (not merged) and would like some suggestions how to complete this task. I'd like to sum down a column, by group, from dataset "have_a", using a condition that uses data in dataset "have_b". Is this possible without merging the two datasets?

 

I would like to generate dataset "want" by computing the average count from dataset "have_a" over the date range specified in dataset "have_b".

 

The following describes what I have and want. I hope this is clear. Thank you!! 

 

	data have_a;
	input group count date; 
	datalines; 
	1 50 01JAN2005
	1 51 02JAN2005
	1 49 03JAN2005
	1 34 04JAN2005
	1 34 05JAN2005
	;
	run;

	data have_b; 
	input group date_start date_end; 
	datalines; 
	1 01JAN2005 05JAN2005
	1 02JAN2005 05JAN2005
	; 
	run;

	data want; 
	input group date_start date_end mean; 
	datalines; 
	1 01JAN2005 05JAN2005 43.6
	1 02JAN2005 05JAN2005 42.0
	;
	run;

 

 

 

 

 


Accepted Solutions
Solution
‎09-22-2017 09:34 PM
Super User
Posts: 13,947

Re: Sum down a column using information from 2 unmerged datasets

[ Edited ]
Posted in reply to pamplemouse22

Not the most efficient:

proc sql;
   create table want as
   select distinct have_b.*, mean(have_a.count) as mean
   from have_b, have_a
   where  have_b.date_start le have_a.date le have_b.date_end
   group by have_b.date_start,  have_b.date_end
   ;
quit;

Note that this requires all of the date variables to be SAS date valued variables.

 

View solution in original post


All Replies
Respected Advisor
Posts: 4,802

Re: Sum down a column using information from 2 unmerged datasets

Posted in reply to pamplemouse22

@pamplemouse22

You will have to combine the data somehow in order to make the grouping information from your 2nd dataset available to your first dataset.

 

You could create a format from your 2nd dataset (using Proc Format / CNTLIN) and then directly use this format in one of the SAS procs for calculation of aggregated results.

Contributor
Posts: 39

Re: Sum down a column using information from 2 unmerged datasets

Hi Patrick, 

 

Thanks! The only way I could think of merging the data is to create a binary variable "start"-1/0 and then have a single date variable and then I can merge by date, but this would involve making my dataset larger than it is - not sure I want to do that. the proc sql option below seems like a good option ! Am going to try that. Thanks for your help. 

Solution
‎09-22-2017 09:34 PM
Super User
Posts: 13,947

Re: Sum down a column using information from 2 unmerged datasets

[ Edited ]
Posted in reply to pamplemouse22

Not the most efficient:

proc sql;
   create table want as
   select distinct have_b.*, mean(have_a.count) as mean
   from have_b, have_a
   where  have_b.date_start le have_a.date le have_b.date_end
   group by have_b.date_start,  have_b.date_end
   ;
quit;

Note that this requires all of the date variables to be SAS date valued variables.

 

Contributor
Posts: 39

Re: Sum down a column using information from 2 unmerged datasets

Thank you so much! This is exactly what I was looking for. I have a few more conditions that I didn't describe that are needed to make the merge, so I will attempt adjusting your example code, accordingly. Thanks for getting it started! I appreciate it. Might be back with additional questions. 

Contributor
Posts: 39

Re: Sum down a column using information from 2 unmerged datasets

[ Edited ]

Hi ballardw,

 

Can you please explain to me how this part of your code pulls the right data for the calculation? Thank you! 

 

 where  have_b.date_start le have_a.date le have_b.date_end
   group by have_b.date_start,  have_b.date_end

 

Super User
Posts: 13,947

Re: Sum down a column using information from 2 unmerged datasets

Posted in reply to pamplemouse22

pamplemouse22 wrote:

Hi ballardw,

 

Can you please explain to me how this part of your code pulls the right data for the calculation? Thank you! 

 

 where  have_b.date_start le have_a.date le have_b.date_end
   group by have_b.date_start,  have_b.date_end

 


The where clause restricts the data to where the "have_a" data set date is between the the start and end dates in the "have_b" data set allowing the endpoints to match.

The group by takes the results creating a "group" for the combinations of date_start and date_end for the summarization function

 

The reason I say this isn't the most effecient (in potential run time) is that this code combines every record in Have_b with every record in Have_a, tests to see if the Have_a data is the desired interval (if you have 20 records in Have_b and 100 in Have_a that is 20*100 comparisons which can get large quickly => slow run time), groups by the desired interval and then summarizes calculating the mean.

Contributor
Posts: 39

Re: Sum down a column using information from 2 unmerged datasets

That is very clear, thank you. I am testing this out with a small sample of my data (and it works perfectly), but, the real dataset is very large. I will allow ample time for running. 

 

Desired intervals at the moment vary because of missing data, but with some more steps I can create a standard interval (where some of the date values are missing instead of being skipped). 

 

If there are quick adjustments to make this process more efficient, suggestions welcome.

 

Thank you so much! Appreciate it. 

Respected Advisor
Posts: 4,802

Re: Sum down a column using information from 2 unmerged datasets

Posted in reply to pamplemouse22

@pamplemouse22

What does "very large" mean? 1, 10, 100, ... millions of rows?

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 8 replies
  • 308 views
  • 0 likes
  • 3 in conversation