Hi all,
Data management question. I am trying to create a dataset using information from two datasets (not merged) and would like some suggestions how to complete this task. I'd like to sum down a column, by group, from dataset "have_a", using a condition that uses data in dataset "have_b". Is this possible without merging the two datasets?
I would like to generate dataset "want" by computing the average count from dataset "have_a" over the date range specified in dataset "have_b".
The following describes what I have and want. I hope this is clear. Thank you!!
data have_a; input group count date; datalines; 1 50 01JAN2005 1 51 02JAN2005 1 49 03JAN2005 1 34 04JAN2005 1 34 05JAN2005 ; run; data have_b; input group date_start date_end; datalines; 1 01JAN2005 05JAN2005 1 02JAN2005 05JAN2005 ; run; data want; input group date_start date_end mean; datalines; 1 01JAN2005 05JAN2005 43.6 1 02JAN2005 05JAN2005 42.0 ; run;
Not the most efficient:
proc sql; create table want as select distinct have_b.*, mean(have_a.count) as mean from have_b, have_a where have_b.date_start le have_a.date le have_b.date_end group by have_b.date_start, have_b.date_end ; quit;
Note that this requires all of the date variables to be SAS date valued variables.
You will have to combine the data somehow in order to make the grouping information from your 2nd dataset available to your first dataset.
You could create a format from your 2nd dataset (using Proc Format / CNTLIN) and then directly use this format in one of the SAS procs for calculation of aggregated results.
Hi Patrick,
Thanks! The only way I could think of merging the data is to create a binary variable "start"-1/0 and then have a single date variable and then I can merge by date, but this would involve making my dataset larger than it is - not sure I want to do that. the proc sql option below seems like a good option ! Am going to try that. Thanks for your help.
Not the most efficient:
proc sql; create table want as select distinct have_b.*, mean(have_a.count) as mean from have_b, have_a where have_b.date_start le have_a.date le have_b.date_end group by have_b.date_start, have_b.date_end ; quit;
Note that this requires all of the date variables to be SAS date valued variables.
Thank you so much! This is exactly what I was looking for. I have a few more conditions that I didn't describe that are needed to make the merge, so I will attempt adjusting your example code, accordingly. Thanks for getting it started! I appreciate it. Might be back with additional questions.
Hi ballardw,
Can you please explain to me how this part of your code pulls the right data for the calculation? Thank you!
where have_b.date_start le have_a.date le have_b.date_end group by have_b.date_start, have_b.date_end
@pamplemouse22 wrote:
Hi ballardw,
Can you please explain to me how this part of your code pulls the right data for the calculation? Thank you!
where have_b.date_start le have_a.date le have_b.date_end group by have_b.date_start, have_b.date_end
The where clause restricts the data to where the "have_a" data set date is between the the start and end dates in the "have_b" data set allowing the endpoints to match.
The group by takes the results creating a "group" for the combinations of date_start and date_end for the summarization function
The reason I say this isn't the most effecient (in potential run time) is that this code combines every record in Have_b with every record in Have_a, tests to see if the Have_a data is the desired interval (if you have 20 records in Have_b and 100 in Have_a that is 20*100 comparisons which can get large quickly => slow run time), groups by the desired interval and then summarizes calculating the mean.
That is very clear, thank you. I am testing this out with a small sample of my data (and it works perfectly), but, the real dataset is very large. I will allow ample time for running.
Desired intervals at the moment vary because of missing data, but with some more steps I can create a standard interval (where some of the date values are missing instead of being skipped).
If there are quick adjustments to make this process more efficient, suggestions welcome.
Thank you so much! Appreciate it.
What does "very large" mean? 1, 10, 100, ... millions of rows?
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.