Solved: Re: How to run the same code for different data sets?

m1986MM · Posted 10-08-2014 02:07 PM

Hi,

I have two questions and I would appreciate if you help me.

1- I have more than 240 data sets with the same variables, each of which contains about 2 millions observations. How can I run the same statement for all of them without merging them together? For example, when I want to define a dummy variable for one of the variables that exists in all data sets, how can I define the dummy variable one time and apply it for all of the data sets?

2- As I mentioned, each of the data sets contains about 2 millions observations. Suppose that he data set contains the information of students of a university. We have three observations: name of the students, Exam#n, scores. How can I ask the SAS to calculate the average scores of each persons in all the exams she has taken and put the results in another column, like average scores?

Thanks,

Reeza · Posted 10-08-2014 08:16 PM

You can't really just add a column to a table, you end up re-writing the whole table so it's inefficient, which is why I was asking what the rest of your code was.

anyways, you can use proc sql to add in the average:

proc sql;

create table want as

select *, avg(age) as avg_age

from sashelp.class

group by sex;

quit;

proc print; run;

View solution in original post

bailunrui · Posted 10-08-2014 02:24 PM

As for #1... use a macro. Say you have datasets ds001, ds002, ab01b and you want to do the dummy variable assignment on all of them. Here is what your code may look like:

%MACRO repeat(dslist=);

%DO i=1 %TO %SYSFUNC(COUNTW(&dslist));

%LET ds = %SCAN(&dslist, &i);

DATA &ds;

SET &ds;

SELECT (var1);

WHEN ("Yes") dum1=1;

WHEN("No") dum2=1;

OTHERWISE dum3=1;

END;

RUN;

%END;

%MEND repeat;

Calling the macro:

%repeat(dslist= ds001 ds002 ab01b);

Or you could open the directory where all of the files are in a null data step and then call a similar macro definition except without the DO LOOP.

Message was edited by: Lauren Parlett

Reeza · Posted 10-08-2014 03:00 PM

How you have access to the names of all 240 data sets will be a key as to how to efficiently call the macro's

1. Why do you need a dummy variable? SAS has class statements that allow you to avoid having to create dummy variables usually.

2. Did you need the average score added in to the original 2 million row data set or to a different data set.

m1986MM · Posted 10-08-2014 05:28 PM

1- Thank you for mentioning the class statement. Very helpful! But I still need to call some certain observations from every data set, for example to do some statistic analysis on them.

2- I need them to add in the original data sets.

jakarman · Posted 10-08-2014 05:11 PM

How many variables sizing more detailed estimates for the output.

The class statement is a good one an d may there are more to think on.

I made just a mistake and my test dataset (UE) became 5Gn with 130M records still running within 10 minutes.

Better to develop something with small data with bug data in mind. 480M records start feeling as big data.

---->-- ja karman --<-----

m1986MM · Posted 10-08-2014 05:39 PM

44 variables. I am working with just one of the data sets, but I need to write my code in a way that can work for all data sets.

jakarman · Posted 10-08-2014 05:40 PM

You can add them in view with where processing ..... Please do a more complete description.

240 datesets 2M records each 3 variables name-student Exam score (type & length)

What certain observations do you need (where processing?) what is some statistics adding (proc append? or adding/updating a column)

What are your hardware restrcitions. SAs version.

---->-- ja karman --<-----

m1986MM · Posted 10-08-2014 06:35 PM

student case was just an example. Each data set contains 44 VARs.

Reeza · Posted 10-08-2014 07:04 PM

Perhaps post your code for your one data set and we can help generalize. Again, how do you plan to refer to each of the 240 data sets, is there a naming convention of sorts? I'm strongly against adding 44 new variables to datasets that are 20 million in size already so if you explain more perhaps there maybe more efficient methods to accomplish what you want.

m1986MM · Posted 10-08-2014 07:56 PM

The data sets contains some private information. However the logic is the same as student case below:

Name	Number of exam	Score	Average Score
Alice	1	80
Alice	2	100
Alice	3	90
Alice	4	70
Peter	1	75
Peter	2	71
Peter	3	67
Peter	4	63
Peter	5	59
John	1	55
John	2	51
John	3	47
John	4	43
John	5	39
John	6	35

What I want to do is calculating the average score for each person and putting it in the average score column.

The data set contains 44 VARs, but I need to do this operation just for one of them. So, I will add just one more variable.

Reeza · Posted 10-08-2014 08:16 PM

You can't really just add a column to a table, you end up re-writing the whole table so it's inefficient, which is why I was asking what the rest of your code was.

anyways, you can use proc sql to add in the average:

proc sql;

create table want as

select *, avg(age) as avg_age

from sashelp.class

group by sex;

quit;

proc print; run;

m1986MM · Posted 10-08-2014 08:38 PM

Thank you.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away