Hi,
I have two questions and I would appreciate if you help me.
1- I have more than 240 data sets with the same variables, each of which contains about 2 millions observations. How can I run the same statement for all of them without merging them together? For example, when I want to define a dummy variable for one of the variables that exists in all data sets, how can I define the dummy variable one time and apply it for all of the data sets?
2- As I mentioned, each of the data sets contains about 2 millions observations. Suppose that he data set contains the information of students of a university. We have three observations: name of the students, Exam#n, scores. How can I ask the SAS to calculate the average scores of each persons in all the exams she has taken and put the results in another column, like average scores?
Thanks,
You can't really just add a column to a table, you end up re-writing the whole table so it's inefficient, which is why I was asking what the rest of your code was.
anyways, you can use proc sql to add in the average:
proc sql;
create table want as
select *, avg(age) as avg_age
from sashelp.class
group by sex;
quit;
proc print; run;
As for #1... use a macro. Say you have datasets ds001, ds002, ab01b and you want to do the dummy variable assignment on all of them. Here is what your code may look like:
%MACRO repeat(dslist=);
%DO i=1 %TO %SYSFUNC(COUNTW(&dslist));
%LET ds = %SCAN(&dslist, &i);
DATA &ds;
SET &ds;
SELECT (var1);
WHEN ("Yes") dum1=1;
WHEN("No") dum2=1;
OTHERWISE dum3=1;
END;
RUN;
%END;
%MEND repeat;
Calling the macro:
%repeat(dslist= ds001 ds002 ab01b);
Or you could open the directory where all of the files are in a null data step and then call a similar macro definition except without the DO LOOP.
Message was edited by: Lauren Parlett
How you have access to the names of all 240 data sets will be a key as to how to efficiently call the macro's
1. Why do you need a dummy variable? SAS has class statements that allow you to avoid having to create dummy variables usually.
2. Did you need the average score added in to the original 2 million row data set or to a different data set.
1- Thank you for mentioning the class statement. Very helpful! But I still need to call some certain observations from every data set, for example to do some statistic analysis on them.
2- I need them to add in the original data sets.
How many variables sizing more detailed estimates for the output.
The class statement is a good one an d may there are more to think on.
I made just a mistake and my test dataset (UE) became 5Gn with 130M records still running within 10 minutes.
Better to develop something with small data with bug data in mind. 480M records start feeling as big data.
44 variables. I am working with just one of the data sets, but I need to write my code in a way that can work for all data sets.
You can add them in view with where processing ..... Please do a more complete description.
240 datesets 2M records each 3 variables name-student Exam score (type & length)
What certain observations do you need (where processing?) what is some statistics adding (proc append? or adding/updating a column)
What are your hardware restrcitions. SAs version.
student case was just an example. Each data set contains 44 VARs.
Perhaps post your code for your one data set and we can help generalize. Again, how do you plan to refer to each of the 240 data sets, is there a naming convention of sorts? I'm strongly against adding 44 new variables to datasets that are 20 million in size already so if you explain more perhaps there maybe more efficient methods to accomplish what you want.
The data sets contains some private information. However the logic is the same as student case below:
Name | Number of exam | Score | Average Score |
Alice | 1 | 80 | |
Alice | 2 | 100 | |
Alice | 3 | 90 | |
Alice | 4 | 70 | |
Peter | 1 | 75 | |
Peter | 2 | 71 | |
Peter | 3 | 67 | |
Peter | 4 | 63 | |
Peter | 5 | 59 | |
John | 1 | 55 | |
John | 2 | 51 | |
John | 3 | 47 | |
John | 4 | 43 | |
John | 5 | 39 | |
John | 6 | 35 |
What I want to do is calculating the average score for each person and putting it in the average score column.
The data set contains 44 VARs, but I need to do this operation just for one of them. So, I will add just one more variable.
You can't really just add a column to a table, you end up re-writing the whole table so it's inefficient, which is why I was asking what the rest of your code was.
anyways, you can use proc sql to add in the average:
proc sql;
create table want as
select *, avg(age) as avg_age
from sashelp.class
group by sex;
quit;
proc print; run;
Thank you.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.