BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
m1986MM
Obsidian | Level 7

Hi,

I have two questions and I would appreciate if you help me.

1- I have more than 240 data sets with the same variables, each of which contains about 2 millions observations. How can I run the same statement for all of them without merging them together? For example, when I want to define a dummy variable for one of the variables that exists in all data sets, how can I define the dummy variable one time and apply it for all of the data sets?

2- As I mentioned, each of the data sets contains about 2 millions observations. Suppose that he data set contains the information of students of a university. We have three observations: name of the students, Exam#n, scores. How can I ask the SAS to calculate the average scores of each persons in all the exams she has taken and put the results in another column, like average scores?

Thanks,

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

You can't really just add a column to a table, you end up re-writing the whole table so it's inefficient, which is why I was asking what the rest of your code was.

anyways, you can use proc sql to add in the average:

proc sql;

create table want as

select *, avg(age) as avg_age

from sashelp.class

group by sex;

quit;

proc print; run;

View solution in original post

11 REPLIES 11
bailunrui
Fluorite | Level 6

As for #1... use a macro. Say you have datasets ds001, ds002, ab01b and you want to do the dummy variable assignment on all of them. Here is what your code may look like:

%MACRO repeat(dslist=);

%DO i=1 %TO %SYSFUNC(COUNTW(&dslist));

     %LET ds = %SCAN(&dslist, &i);

     DATA &ds;

          SET &ds;

          SELECT (var1);

               WHEN ("Yes") dum1=1;

               WHEN("No")   dum2=1;

               OTHERWISE dum3=1;

          END;

     RUN;

%END;

%MEND repeat;

Calling the macro:

%repeat(dslist= ds001 ds002 ab01b);

Or you could open the directory where all of the files are in a null data step and then call a similar macro definition except without the DO LOOP.

Message was edited by: Lauren Parlett

Reeza
Super User

How you have access to the names of all 240 data sets will be a key as to how to efficiently call the macro's

1. Why do you need a dummy variable? SAS has class statements that allow you to avoid having to create dummy variables usually.

2. Did you need the average score added in to the original 2 million row data set or to a different data set.

m1986MM
Obsidian | Level 7

1- Thank you for mentioning the class statement. Very helpful! But I still need to call some certain observations from every data set, for example to do some statistic analysis on them.

2- I need them to add in the original data sets.

jakarman
Barite | Level 11

How many variables sizing more detailed estimates for the output.

The class statement is a good one an d may there are more to think on.

I made just a mistake and my test dataset (UE) became 5Gn with 130M records still running within 10 minutes.

Better to develop something with small data with bug data in mind. 480M records start feeling as big data. 

---->-- ja karman --<-----
m1986MM
Obsidian | Level 7

44 variables. I am working with just one of the data sets, but I need to write my code in a way that can work for all data sets.

jakarman
Barite | Level 11

You can add them in view with where processing .....  Please do a more complete description.

240 datesets 2M records each 3 variables name-student Exam score (type & length)

What certain observations do you need (where processing?)  what is some statistics  adding (proc append? or adding/updating a column)

What are your hardware restrcitions. SAs version.

---->-- ja karman --<-----
m1986MM
Obsidian | Level 7

student case was just an example. Each data set contains 44 VARs.

Reeza
Super User

Perhaps post your code for your one data set and we can help generalize. Again, how do you plan to refer to each of the 240 data sets, is there a naming convention of sorts? I'm strongly against adding 44 new variables to datasets that are 20 million in size already so if you explain more perhaps there maybe more efficient methods to accomplish what you want.

m1986MM
Obsidian | Level 7

The data sets contains some private information. However the logic is the same as student case below:

NameNumber
of exam
ScoreAverage Score
Alice 180
Alice 2100
Alice 390
Alice 470
Peter175
Peter271
Peter367
Peter463
Peter559
John155
John251
John347
John443
John539
John635

What I want to do is calculating the average score for each person and putting it in the average score column.

The data set contains 44 VARs, but I need to do this operation just for one of them. So, I will add just one more variable.

Reeza
Super User

You can't really just add a column to a table, you end up re-writing the whole table so it's inefficient, which is why I was asking what the rest of your code was.

anyways, you can use proc sql to add in the average:

proc sql;

create table want as

select *, avg(age) as avg_age

from sashelp.class

group by sex;

quit;

proc print; run;

m1986MM
Obsidian | Level 7

Thank you.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 8796 views
  • 10 likes
  • 4 in conversation