09-29-2016 03:36 PM - edited 09-29-2016 03:46 PM
I have a data set with 850 variables in it and I would like to split that into multiple datasets by variable names. Thinking of getting the contents of the table using proc contents and put the variable names into a macro which can be used for splitting?
Please, can someone suggest me a better way to do that?
Current dataset: Name Age sex a_abc a_xyz a_chf a_hhh b_gfy b_kiy b_gyr c_rrr c_hhh c_kkk
Required datasets: Name age sex a_abc a_xyz a_chf a_hhh
Name age sex b_gfy b_kiy b_gyr
Name age sex c_rrr c_hhh c_kkk
This process of splitting should run daily and append to the existing tables.
Thank you very much in advance. Looking forward for your suggestions.
09-29-2016 03:50 PM
I think use dynamic KEEP statements and develop your list of variables somehow. You can possibly use the colon prefix for variables lists, or you may want to create a dataset that drives your process.
Your code will end up being. If you do create a table with the mappings, ie table name and the variable list, I'd recommend call execute to process the append.
proc append base=master_table_name data=new_data (keep = variable_list);
ie data set - append_daily_task
master_table var_list table1 id b: table2 id c: table3 id d:
data _null_; set append_daily_task; str = catt("proc append base=", master_table, " data=dataset (keep=", var_list, "); run;"); call execute(str); run;
09-29-2016 03:52 PM - edited 09-29-2016 03:52 PM
If these are the actual variable names, the splitting would be straightforward:
data more_a_vars (keep=name age sex a_: )
more_b_vars (keep=name age sex b_: )
more_c_vars (keep=name age sex c_: );
You didn't actually mention the names of the tables that these get appended to. (In fact, I'm not sure that they exist yet.) But PROC APPEND would work:
proc append base=all_a_vars data=more_a_vars;
proc append base=all_b_vars data=more_b_vars;
proc append base=all_c_vars data=more_c_vars;
09-30-2016 04:59 AM
First off, why do you have a dataset with 850 variables in it? 50 variables would be many to my mind so how have you ended up with so many. Fix that in the first instance and this question becomes irrelevant. Maybe use a normalised data structure (paramter response going down the page).
Secondly, a simple formula:
number of datasets * hours of coding = time need to be spent.
In almost all cases, creating many datasets rather than processing one will result in more coding requirements, and generally take longer to run.
Just look at your process - have dataset with masses of columns which need to be split up, then appended back. Conversely, a normalised dataset which can directly be appended to, which has more work?
09-30-2016 06:25 AM
Thanks for your response
Basically we import the data from a SQL table everyday which is enormously wide(850 columns) . After we import the data we use a simple merge to update and insert the history table on SAS (850 variables) which is a very big table(60 million records). This process of merging is taking so long.
Now the idea is to make the history table into smaller tables by brand and also do the same for the table which we import from SQl everyday.I guess this will make the merge easier?
09-30-2016 06:47 AM
When you say "import the data from a SQL table everyday which is enormously wide(850 columns)", do you really mean that all you do is:
select * from sql_table
And that returns a dataset with 850 columns? I would strongly doubt that is the case, I don't think anyone who uses a dataabase would setup one table with that many columns, database's are relational models, hence the idea is to break the data up into smallest possible chunks, never repeating data items.
As for the data, well 60million records is going to take time, there is no getting away from that. Question is do you really need all that history data, whats it used for. Perhaps (just guessing) you do some calculations over the data, say keeping balance or something for years, if so then why not keep the calculated values dataset, calculate the values on the new data, then add the summarised data together, rather than combine raw and then calculate. There are many ways to do any process.
If however you are stuck with that way of doing things, then maybe Base SAS isn't the right tool. Maybe look at one of SAS's big data tools.
09-30-2016 06:49 AM
Define long and easier?
Thsts a big table no matter how you slice it.
Doing smaller merges against a bigger table may not get you a massive reduction in speed. Splitting it up also means more code maintenance which means more room for errors.
Having 850 cols does seem excessive as well. Perhaps worth splitting source and/or possibly transposing? Not that 1billion rows is going to be much faster. Your definitely in big data territory so need to be efficient about your processes.
10-01-2016 08:18 PM - edited 10-01-2016 08:22 PM
I assume what you're creating is something like an analytical base table. 850 columns is excessive but I've seen that before and it normally happens when you've got an analytics department as your customer and they are just asking for "everything" instead of thinking. So I guess you have no choice but it is what it is.
We would need to understand your process more in detail to give you detailed advice so here just some ideas.
Do you know which step in your process takes the most time? And why?
1. single threaded read from SQL server
2. full re-creation of the SAS table as part of a data step merge
What you could try:
- store your analytical base table in SAS using the SPDE engine.
- create indexes for the keys used for merging
- don't re-create the table but use a modify statement instead to apply changes to your table in place
...and if your dataset should contain quite a few long character variables which are not "filled up" (eg. $50 but mostly blank or only a few characters in it) then compression should also be beneficial (compress=yes).