Solved: Re: Sum down a column using information from 2 unmerged datasets

pamplemouse22 · Posted 09-22-2017 05:38 PM

Hi all,

Data management question. I am trying to create a dataset using information from two datasets (not merged) and would like some suggestions how to complete this task. I'd like to sum down a column, by group, from dataset "have_a", using a condition that uses data in dataset "have_b". Is this possible without merging the two datasets?

I would like to generate dataset "want" by computing the average count from dataset "have_a" over the date range specified in dataset "have_b".

The following describes what I have and want. I hope this is clear. Thank you!!

	data have_a;
	input group count date; 
	datalines; 
	1 50 01JAN2005
	1 51 02JAN2005
	1 49 03JAN2005
	1 34 04JAN2005
	1 34 05JAN2005
	;
	run;

	data have_b; 
	input group date_start date_end; 
	datalines; 
	1 01JAN2005 05JAN2005
	1 02JAN2005 05JAN2005
	; 
	run;

	data want; 
	input group date_start date_end mean; 
	datalines; 
	1 01JAN2005 05JAN2005 43.6
	1 02JAN2005 05JAN2005 42.0
	;
	run;

ballardw · Posted 09-22-2017 06:55 PM

Not the most efficient:

proc sql;
   create table want as
   select distinct have_b.*, mean(have_a.count) as mean
   from have_b, have_a
   where  have_b.date_start le have_a.date le have_b.date_end
   group by have_b.date_start,  have_b.date_end
   ;
quit;

Note that this requires all of the date variables to be SAS date valued variables.

View solution in original post

Patrick · Posted 09-22-2017 06:48 PM

@pamplemouse22

You will have to combine the data somehow in order to make the grouping information from your 2nd dataset available to your first dataset.

You could create a format from your 2nd dataset (using Proc Format / CNTLIN) and then directly use this format in one of the SAS procs for calculation of aggregated results.

pamplemouse22 · Posted 09-22-2017 09:32 PM

Hi Patrick,

Thanks! The only way I could think of merging the data is to create a binary variable "start"-1/0 and then have a single date variable and then I can merge by date, but this would involve making my dataset larger than it is - not sure I want to do that. the proc sql option below seems like a good option ! Am going to try that. Thanks for your help.

ballardw · Posted 09-22-2017 06:55 PM

Not the most efficient:

proc sql;
   create table want as
   select distinct have_b.*, mean(have_a.count) as mean
   from have_b, have_a
   where  have_b.date_start le have_a.date le have_b.date_end
   group by have_b.date_start,  have_b.date_end
   ;
quit;

Note that this requires all of the date variables to be SAS date valued variables.

pamplemouse22 · Posted 09-22-2017 09:33 PM

Thank you so much! This is exactly what I was looking for. I have a few more conditions that I didn't describe that are needed to make the merge, so I will attempt adjusting your example code, accordingly. Thanks for getting it started! I appreciate it. Might be back with additional questions.

pamplemouse22 · Posted 09-25-2017 01:14 PM

Hi ballardw,

Can you please explain to me how this part of your code pulls the right data for the calculation? Thank you!

 where  have_b.date_start le have_a.date le have_b.date_end
   group by have_b.date_start,  have_b.date_end

ballardw · Posted 09-26-2017 11:02 AM

@pamplemouse22 wrote:

Hi ballardw,

Can you please explain to me how this part of your code pulls the right data for the calculation? Thank you!
 where  have_b.date_start le have_a.date le have_b.date_end
   group by have_b.date_start,  have_b.date_end

The where clause restricts the data to where the "have_a" data set date is between the the start and end dates in the "have_b" data set allowing the endpoints to match.

The group by takes the results creating a "group" for the combinations of date_start and date_end for the summarization function

The reason I say this isn't the most effecient (in potential run time) is that this code combines every record in Have_b with every record in Have_a, tests to see if the Have_a data is the desired interval (if you have 20 records in Have_b and 100 in Have_a that is 20*100 comparisons which can get large quickly => slow run time), groups by the desired interval and then summarizes calculating the mean.

pamplemouse22 · Posted 09-26-2017 11:39 AM

That is very clear, thank you. I am testing this out with a small sample of my data (and it works perfectly), but, the real dataset is very large. I will allow ample time for running.

Desired intervals at the moment vary because of missing data, but with some more steps I can create a standard interval (where some of the date values are missing instead of being skipped).

If there are quick adjustments to make this process more efficient, suggestions welcome.

Thank you so much! Appreciate it.

Patrick · Posted 09-26-2017 05:19 PM

@pamplemouse22

What does "very large" mean? 1, 10, 100, ... millions of rows?

Classroom Training Available!