Hi Everybody!
I recently came across acode online that works fantastically well in splitting a big dataset into smaller datasets. However, I was wondering if someone could help me modify it so the smaller datasets are more aligned with what I want.
Right now, the code (see below) splits the big dataset based on one variable (a character variable). What I want to do is split the big dataset by two different variables, the current variable in the code but also by a second variable (not currently in code).
For example, let's assume I have a dataset for the entire United States for six years (2009-2015). I want to split that dataset per state per year (Texas2012).
Here is the code I am using:
%LET TABLE=PQIEX.STELIGENROLL_KEYVAR;
%LET COLUMN=SDA;
PROC SQL NOPRINT;
/* build a mini program for each value */
/* create a table with valid chars from data value */
SELECT DISTINCT
CAT("SDA.PROFILE_",COMPRESS(&COLUMN.,,'KAD'),"(WHERE=(&COLUMN.=",QUOTE(&COLUMN.),"))")
INTO :ALLSTEPS SEPARATED BY ''
FROM &TABLE.;
QUIT;
%PUT NOTE:&ALLSTEPS.;
DATA SDA.&ALLSTEPS.;
SET &TABLE;
RUN;
Some additional information that might be helpful:
1) SAS Version 9.2
2) Library name: PQIEX
3) Dataset: STELIGENROLL_KEYVAR
4) First Splitting Variable: SDA (character variable with 13 different values)
5) Second Splitting Variable (the one I am having trouble incorporating into the code): FFYEAR (numeric variable for the years 2009-2015)
I am fairly new with SAS and just have no clue how to modify this code in order to split the dataset by two variables. My mind starts getting confused on how to treat the numeric variable and how to incorporate it into the name of the new datasets. If y'all can please help me (and have patience with my ignorance), I would greatly appreciate it!
Muchas Gracias
Try the code below.
I used sashelp.prdsal2 --> it has fields similar to your case (year, state). The code will perform single data read with multiple writes.
Hope it helps,
Ahmed
proc sql noprint;
select distinct cats('WORK.',translate(STRIP(state),'_',' '),'_',year)
,cat ('WHEN (year=',year,' and state="',STRIP(state),'") OUTPUT WORK.',translate(STRIP(state),'_',' '),'_',year,';')
into :g_outputNames separated by ' '
,:g_conditions separated by ' '
from sashelp.prdsal2
order by year;
quit;
/*
%put g_outputNames = %superq(g_outputNames);
%put g_conditions=%superq(g_conditions);
*/
DATA &g_outputNames;
SET sashelp.prdsal2;
SELECT;
&g_conditions
OTHERWISE;
END;
RUN;
I don't have a lot of time now to look at your code but maybe the cats function could be useful here.
select distinct(cats(state, year))
where ....
I tried doing this but my attemps end in failure. I don't fully grasp how to modify the current code and how it handles the cat function to include the second variable.
Why do you want to split your dataset?
How many records in your big dataset vs each smaller ones (on average).
This isn't usually a recommended step, there are a lot of ways to use data within SAS without splltting it up.
My big dataset has over 300 million observations; my computer takes a long time to do basic things with such a massive dataset. I also need to run specific analysis based on the observations that are in specific region and for specific years. I cannot give you an average of how many observations would be allocated into each smaller dataset since the regions and years vary considerably with their observations.
Ok...that seems somewhat reasonable 🙂
Here's a blog that goes over splitting a data set.
http://blogs.sas.com/content/sasdummy/2015/01/26/how-to-split-one-data-set-into-many/
One thing to keep in mind, if you have two variables that you're splliting on you can add the two variables you need to design your naming structure of your new datasets with some thought so that you can easily query them later on. If you're likely to use year more often or state most often then that should come first in your naming convention.
I would think with 300 million splitting by only one variable would be enough to obtain efficient results. Perhaps split by one and index by the other? My suggestion would be split by year and then index by state.
Look into using call execute to do this, instead of a macro or in combination with a macro.
I'll try and post some code tonight if someone else hasn't already answered it 🙂
Hi Reeza,
You found the original blog I found the code in! I actually reached out to the individual asking for help and he directed me to posting the question on here.
I really want to split my dataset by region and year because there is a good chance I'll be performing different analyses on them. I understand your point of trying to keep the dataset together as much as possible, but considering all the different moving parts in my project, having a separate dataset per region per year is a lot more efficient.
I do appreciate your help getting this resolved!
So what's your naming convention for the files?
Try the code below.
I used sashelp.prdsal2 --> it has fields similar to your case (year, state). The code will perform single data read with multiple writes.
Hope it helps,
Ahmed
proc sql noprint;
select distinct cats('WORK.',translate(STRIP(state),'_',' '),'_',year)
,cat ('WHEN (year=',year,' and state="',STRIP(state),'") OUTPUT WORK.',translate(STRIP(state),'_',' '),'_',year,';')
into :g_outputNames separated by ' '
,:g_conditions separated by ' '
from sashelp.prdsal2
order by year;
quit;
/*
%put g_outputNames = %superq(g_outputNames);
%put g_conditions=%superq(g_conditions);
*/
DATA &g_outputNames;
SET sashelp.prdsal2;
SELECT;
&g_conditions
OTHERWISE;
END;
RUN;
If you really must split, use the code @AhmedAl_Attar provided.
You'll create all your tables in one go rather than reading your large table over and over.
To decide wether to split, you have to weigh:
- the time and storage wasted in splitting the original table
versus
- the time used to create an index and the potential additional time processing subsets of the original table directly (as compared to processing smaller tables)
I also agree with @LinusH that SPDE is worth a try. Its indexes are very fast, and its binary compression very effective.
Maybe duplicating and splitting is the answer. Or maybe not.
I'd try to get the list of distinct state/year combinations via a data step and proc sort nodupkey. select distinct in proc sql can be very resource and time consuming with big datasets.
The code worked perfectly! Thanks for the help Ahmed 🙂
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.