BookmarkSubscribeRSS Feed
sasecn
Quartz | Level 8

I have a data set with hundreds of variables and millions of observations. I need to constantly modify this data (about 30 GB). For example, creating a new variable by take sum of two existing variables, etc. I used the following data step:

 

data mydata;

  set mydata;

  new_var=var1+var2;

run;

 

My understanding is that even the data has the same name, but SAS is creating a new data based on the original one. This will take longer time every time since the data size is large. I am wondering if there is other more efficient ways to add new variable to existing data. 

 

6 REPLIES 6
Reeza
Super User

Not really, once you're creating new variables it will recreate the data. SQL seems like it doesn't using insert, but it still does as well. 

 

What type of efficiency are you trying to gain here? Are you trying to speed up your processing to reduce wait time while programming or your time?

sasecn
Quartz | Level 8

want to reduce waiting time Smiley Happy

Reeza
Super User

I don't know enough to do that 😞 But the approaches I use:

 

1. hash tables for joins, assuming you have enough memory

2. Limit the number of obs when testing code, I'll usually only work with about 1-2 million observations, enough to have all the things I'm likely to encounter. If you use that, be very careful with sorts - especially ones without OUT data sets. It will create a data set with only 1000000 records. 

option obs=1000000;

3. Make sure  to do multiple steps together rather than have 5 different data steps that each create a single variable.

4. Create indexes to make sure the usage afterwards is quick. 

 


@sasecn wrote:

want to reduce waiting time Smiley Happy


 

sasecn
Quartz | Level 8

Thanks for the ideas!

ballardw
Super User

@sasecn wrote:

I have a data set with hundreds of variables and millions of observations. I need to constantly modify this data (about 30 GB). For example, creating a new variable by take sum of two existing variables, etc. I used the following data step:

 

 


You might look into WHY you have to "constantly modify" existing data. Doing so is often indicative of poor process planning.

Are the new variables used frequently or just once?

 

If for a specific analysis task it is often preferable to extract only the variables needed and then add the variables just for that task. Such as:

 

Data analysistask;

   set mydata (keep= var1 var2 <other variables needed for the specific task>);

   new_var = sum(var1, var2); /* or var1+var2 if you know why there is a difference between SUM and +*/

run;

and use the analysistask set for the next steps.

Another approach instead of creating a huge data set might be to create a view from the existing data. The view basically has the code to create stuff from the current version of the data set and trades execution time for storage space

Data analysistask /view=analysistask;

   set mydata (keep= var1 var2 <other variables needed for the specific task>);

   new_var = sum(var1, var2); /* or var1+var2 if you know why there is a difference between SUM and +*/

run;

 

Creates a reference that will read the current set mydata.

Then something like

 

Proc Reg data=analysistask;

<proc options>

run;

in effect reads Mydata, adds the new_var and then uses the result in the procedure. Which will be slower to execute but doesn't have the opportunity to overwrite existing data.

 

BTW habitual use of the structure:

Data mydata;

   set mydata;

  <modification code>

;

will eventually lead to case where one or more variables has values you do not expect and you will have a lot of fun trying to recover data.

sasecn
Quartz | Level 8

Thanks for the reply! Good thoughts!

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 1880 views
  • 1 like
  • 3 in conversation