DATA Step, Macro, Functions and more

Splitting big dataset into smaller dataset

Accepted Solution Solved
Reply
Contributor
Posts: 45
Accepted Solution

Splitting big dataset into smaller dataset

Hi Everybody!

 

I recently came across acode online that works fantastically well in splitting a big dataset into smaller datasets. However, I was wondering if someone could help me modify it so the smaller datasets are more aligned with what I want.

 

Right now, the code (see below) splits the big dataset based on one variable (a character variable). What I want to do is split the big dataset by two different variables, the current variable in the code but also by a second variable (not currently in code).

 

For example, let's assume I have a dataset for the entire United States for six years (2009-2015). I want to split that dataset per state per year (Texas2012).

 

Here is the code I am using: 

 

%LET TABLE=PQIEX.STELIGENROLL_KEYVAR;
%LET COLUMN=SDA;
PROC SQL NOPRINT;
/* build a mini program for each value */
/* create a table with valid chars from data value */
SELECT DISTINCT
CAT("SDA.PROFILE_",COMPRESS(&COLUMN.,,'KAD'),"(WHERE=(&COLUMN.=",QUOTE(&COLUMN.),"))")
INTO :ALLSTEPS SEPARATED BY ''
FROM &TABLE.;
QUIT;
%PUT NOTE:&ALLSTEPS.;

DATA SDA.&ALLSTEPS.;
SET &TABLE;
RUN;

 

Some additional information that might be helpful: 

 

1) SAS Version 9.2

2) Library name: PQIEX

3) Dataset: STELIGENROLL_KEYVAR

4) First Splitting Variable: SDA (character variable with 13 different values)

5) Second Splitting Variable (the one I am having trouble incorporating into the code): FFYEAR (numeric variable for the years 2009-2015)

 

I am fairly new with SAS and just have no clue how to modify this code in order to split the dataset by two variables. My mind starts getting confused on how to treat the numeric variable and how to incorporate it into the name of the new datasets. If y'all can please help me (and have patience with my ignorance), I would greatly appreciate it!

 

Muchas Gracias

 


Accepted Solutions
Solution
‎04-06-2016 10:42 AM
Regular Contributor
Posts: 222

Re: Splitting big dataset into smaller dataset

Posted in reply to TXSASneophyte

Try the code below.

I used sashelp.prdsal2 --> it has fields similar to your case (year, state). The code will perform single data read with multiple writes.

 

Hope it helps,

Ahmed

 

proc sql noprint;
    select distinct cats('WORK.',translate(STRIP(state),'_',' '),'_',year)
            ,cat ('WHEN (year=',year,' and state="',STRIP(state),'") OUTPUT WORK.',translate(STRIP(state),'_',' '),'_',year,';')
    into    :g_outputNames separated by ' '
            ,:g_conditions separated by ' '
    from    sashelp.prdsal2
    order by year;
quit;
/*
%put g_outputNames = %superq(g_outputNames);
%put g_conditions=%superq(g_conditions);
*/

DATA &g_outputNames;
    SET sashelp.prdsal2;
    SELECT;
        &g_conditions
        OTHERWISE;
    END;
RUN;

View solution in original post


All Replies
Occasional Contributor
Posts: 11

Re: Splitting big dataset into smaller dataset

Posted in reply to TXSASneophyte

 

I don't have a lot of time now to look at your code but maybe the cats function could be useful here. 

 

select distinct(cats(state, year))

where ....

 

 

Contributor
Posts: 45

Re: Splitting big dataset into smaller dataset

Posted in reply to Julie4435637

I tried doing this but my attemps end in failure. I don't fully grasp how to modify the current code and how it handles the cat function to include the second variable. 

Super User
Posts: 19,878

Re: Splitting big dataset into smaller dataset

Posted in reply to TXSASneophyte

Why do you want to split your dataset?

How many records in your big dataset vs each smaller ones (on average). 

 

This isn't usually a recommended step, there are a lot of ways to use data within SAS without splltting it up. 

Contributor
Posts: 45

Re: Splitting big dataset into smaller dataset

My big dataset has over 300 million observations; my computer takes a long time to do basic things with such a massive dataset. I also need to run specific analysis based on the observations that are in specific region and for specific years. I cannot give you an average of how many observations would be allocated into each smaller dataset since the regions and years vary considerably with their observations. 

Super User
Posts: 5,441

Re: Splitting big dataset into smaller dataset

Posted in reply to TXSASneophyte
A combined index on year and state would probably give a perform boost (if that's a normal query).
Also, use SPDE instead of the BASE engine could speed up IO.
Investing in faster disks and more RAM, depending in your common query pattern of course. HW is cheap, man power isn't.
Data never sleeps
Super User
Posts: 19,878

Re: Splitting big dataset into smaller dataset

Ok...that seems somewhat reasonable Smiley Happy

 

Here's a blog that goes over splitting a data set. 

 

http://blogs.sas.com/content/sasdummy/2015/01/26/how-to-split-one-data-set-into-many/

 

One thing to keep in mind, if you have two variables that you're splliting on you can add the two variables you need to design your naming structure of your new datasets with some thought so that you can easily query them later on. If  you're likely to use year more often or state most often then that should come first in your naming convention.  

 

I would think with 300 million splitting by only one variable would be enough to obtain efficient results. Perhaps split by one and index by the other?  My suggestion would be split by year and then index by state. 

 

Look into using call execute to do this, instead of a macro or in combination with a macro. 

 

I'll try and post some code tonight if someone else hasn't already answered it Smiley Happy

Contributor
Posts: 45

Re: Splitting big dataset into smaller dataset

Hi Reeza,

 

You found the original blog I found the code in! I actually reached out to the individual asking for help and he directed me to posting the question on here. 

I really want to split my dataset by region and year because there is a good chance I'll be performing different analyses on them. I understand your point of trying to keep the dataset together as much as possible, but considering all the different moving parts in my project, having a separate dataset per region per year is a lot more efficient.  

 

I do appreciate your help getting this resolved!

Super User
Posts: 19,878

Re: Splitting big dataset into smaller dataset

Posted in reply to TXSASneophyte

So what's your naming convention for the files?

Solution
‎04-06-2016 10:42 AM
Regular Contributor
Posts: 222

Re: Splitting big dataset into smaller dataset

Posted in reply to TXSASneophyte

Try the code below.

I used sashelp.prdsal2 --> it has fields similar to your case (year, state). The code will perform single data read with multiple writes.

 

Hope it helps,

Ahmed

 

proc sql noprint;
    select distinct cats('WORK.',translate(STRIP(state),'_',' '),'_',year)
            ,cat ('WHEN (year=',year,' and state="',STRIP(state),'") OUTPUT WORK.',translate(STRIP(state),'_',' '),'_',year,';')
    into    :g_outputNames separated by ' '
            ,:g_conditions separated by ' '
    from    sashelp.prdsal2
    order by year;
quit;
/*
%put g_outputNames = %superq(g_outputNames);
%put g_conditions=%superq(g_conditions);
*/

DATA &g_outputNames;
    SET sashelp.prdsal2;
    SELECT;
        &g_conditions
        OTHERWISE;
    END;
RUN;

PROC Star
Posts: 1,760

Re: Splitting big dataset into smaller dataset

Posted in reply to AhmedAl_Attar

If you really must split, use the code @AhmedAl_Attar provided.

 

You'll create all your tables in one go rather than reading your large table over and over.

 

To decide wether to split, you have to weigh:

- the time and storage wasted in splitting the original table

versus

- the time used to create an index and the potential additional time processing subsets of the original table directly (as compared to processing smaller tables)

 

I also agree with @LinusH that SPDE is worth a try. Its indexes are very fast, and its binary compression very effective.

 

Maybe duplicating and splitting is the answer. Or maybe not.

 

Super User
Posts: 7,868

Re: Splitting big dataset into smaller dataset

Posted in reply to AhmedAl_Attar

I'd try to get the list of distinct state/year combinations via a data step and proc sort nodupkey. select distinct in proc sql can be very resource and time consuming with big datasets.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Contributor
Posts: 45

Re: Splitting big dataset into smaller dataset

Posted in reply to AhmedAl_Attar

The code worked perfectly! Thanks for the help Ahmed Smiley Happy

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 12 replies
  • 454 views
  • 2 likes
  • 7 in conversation