Splitting a SAS dataset into multiple datasets

Reply
Frequent Contributor
Posts: 107

Splitting a SAS dataset into multiple datasets

[ Edited ]

Hello all 

 

Please help.....

 

I have a data set with 850 variables in it and I would like to split that into multiple datasets by variable names. Thinking of getting the contents of the table using proc contents  and put the variable names into a macro which can be used for splitting?

 

Please, can someone suggest me a better way to do that?

 

Current dataset:    Name Age sex  a_abc   a_xyz   a_chf   a_hhh    b_gfy b_kiy b_gyr c_rrr c_hhh c_kkk

 

Required datasets:         Name age sex a_abc a_xyz a_chf a_hhh

                                       Name age sex b_gfy b_kiy b_gyr

                                       Name age sex c_rrr c_hhh c_kkk

 

This process of splitting should run daily and append to the existing tables.

 

Thank you very much in advance. Looking forward for your suggestions.

 

 

Super User
Posts: 17,912

Re: Splitting a SAS dataset into multiple datasets

I think use dynamic KEEP statements and develop your list of variables somehow. You can possibly use the colon prefix for variables lists, or you may want to create a dataset that drives your process. 

 

Your code will end up being. If you do create a table with the mappings, ie table name and the variable list, I'd recommend call execute to process the append.

 

proc append base=master_table_name data=new_data (keep = variable_list);

run;

 

ie data set - append_daily_task

 

 

master_table var_list
table1 id b:
table2 id c:
table3 id d:

And then 

data _null_;
set append_daily_task;

str = catt("proc append base=", master_table, " data=dataset (keep=", var_list, "); run;");

call execute(str);

run;

 

Super User
Posts: 5,093

Re: Splitting a SAS dataset into multiple datasets

[ Edited ]

If these are the actual variable names, the splitting would be straightforward:

 

data more_a_vars (keep=name age sex a_: )

     more_b_vars (keep=name age sex b_: )

     more_c_vars (keep=name age sex c_: );

set have;

run;

 

You didn't actually mention the names of the tables that these get appended to.  (In fact, I'm not sure that they exist yet.)  But PROC APPEND would work:

 

proc append base=all_a_vars data=more_a_vars;

run;

proc append base=all_b_vars data=more_b_vars;

run;

proc append base=all_c_vars data=more_c_vars;

run;

Super User
Super User
Posts: 7,417

Re: Splitting a SAS dataset into multiple datasets

First off, why do you have a dataset with 850 variables in it?  50 variables would be many to my mind so how have you ended up with so many.  Fix that in the first instance and this question becomes irrelevant.  Maybe use a normalised data structure (paramter response going down the page).

Secondly, a simple formula: 

number of datasets * hours of coding = time need to be spent. 

In almost all cases, creating many datasets rather than processing one will result in more coding requirements, and generally take longer to run.  

Just look at your process - have dataset with masses of columns which need to be split up, then appended back.  Conversely, a normalised dataset which can directly be appended to, which has more work?

Frequent Contributor
Posts: 107

Re: Splitting a SAS dataset into multiple datasets

Thanks for your response 

 

Basically we import the data from a SQL table everyday which is enormously wide(850 columns) . After we import the data we use a simple merge to update and insert the history table on SAS (850 variables) which is a very big table(60 million records). This process of merging is taking so long.

Now the idea is to make the history table into smaller tables by brand and also do the same for the table which we import from SQl everyday.I guess this will make the merge easier?

Super User
Super User
Posts: 7,417

Re: Splitting a SAS dataset into multiple datasets

When you say "import the data from a SQL table everyday which is enormously wide(850 columns)", do you really mean that all you do is:

select * from sql_table

And that returns a dataset with 850 columns?  I would strongly doubt that is the case, I don't think anyone who uses a dataabase would setup one table with that many columns, database's are relational models, hence the idea is to break the data up into smallest possible chunks, never repeating data items.

 

As for the data, well 60million records is going to take time, there is no getting away from that.  Question is do you really need all that history data, whats it used for.  Perhaps (just guessing) you do some calculations over the data, say keeping balance or something for years, if so then why not keep the calculated values dataset, calculate the values on the new data, then add the summarised data together, rather than combine raw and then calculate.  There are many ways to do any process.  

If however you are stuck with that way of doing things, then maybe Base SAS isn't the right tool.  Maybe look at one of SAS's big data tools.

Super User
Posts: 17,912

Re: Splitting a SAS dataset into multiple datasets

Define long and easier?

 

Thsts a big table no matter how you slice it. 

 

Doing smaller merges against a bigger table may not get you a massive reduction in speed. Splitting it up also means more code maintenance which means more room for errors. 

 

Having 850 cols does seem excessive as well. Perhaps worth splitting source and/or possibly transposing? Not that 1billion rows is going to be much faster. Your definitely in big data territory so need to be efficient about your processes. 

Respected Advisor
Posts: 3,901

Re: Splitting a SAS dataset into multiple datasets

[ Edited ]

I assume what you're creating is something like an analytical base table. 850 columns is excessive but I've seen that before and it normally happens when you've got an analytics department as your customer and they are just asking for "everything" instead of thinking. So I guess you have no choice but it is what it is.

 

We would need to understand your process more in detail to give you detailed advice so here just some ideas.

 

Do you know which step in your process takes the most time? And why?

 

My assumptions:

1. single threaded read from SQL server

2. full re-creation of the SAS table as part of a data step merge

 

What you could try:

- store your analytical base table in SAS using the SPDE engine. 

http://support.sas.com/documentation/cdl/en/engspde/67961/HTML/default/viewer.htm#p000szxwmf95tmn1vw...

- create indexes for the keys used for merging

- don't re-create the table but use a modify statement instead to apply changes to your table in place 

https://support.sas.com/documentation/cdl/en/lestmtsref/68024/HTML/default/viewer.htm#n0g9jfr4x5hgsf...

 

...and if your dataset should contain quite a few long character variables which are not "filled up" (eg. $50 but mostly blank or only a few characters in it) then compression should also be beneficial (compress=yes).

 

 

 

Ask a Question
Discussion stats
  • 7 replies
  • 568 views
  • 0 likes
  • 5 in conversation