BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
TXSASneophyte
Obsidian | Level 7

Hi Everybody!

 

I recently came across acode online that works fantastically well in splitting a big dataset into smaller datasets. However, I was wondering if someone could help me modify it so the smaller datasets are more aligned with what I want.

 

Right now, the code (see below) splits the big dataset based on one variable (a character variable). What I want to do is split the big dataset by two different variables, the current variable in the code but also by a second variable (not currently in code).

 

For example, let's assume I have a dataset for the entire United States for six years (2009-2015). I want to split that dataset per state per year (Texas2012).

 

Here is the code I am using: 

 

%LET TABLE=PQIEX.STELIGENROLL_KEYVAR;
%LET COLUMN=SDA;
PROC SQL NOPRINT;
/* build a mini program for each value */
/* create a table with valid chars from data value */
SELECT DISTINCT
CAT("SDA.PROFILE_",COMPRESS(&COLUMN.,,'KAD'),"(WHERE=(&COLUMN.=",QUOTE(&COLUMN.),"))")
INTO :ALLSTEPS SEPARATED BY ''
FROM &TABLE.;
QUIT;
%PUT NOTE:&ALLSTEPS.;

DATA SDA.&ALLSTEPS.;
SET &TABLE;
RUN;

 

Some additional information that might be helpful: 

 

1) SAS Version 9.2

2) Library name: PQIEX

3) Dataset: STELIGENROLL_KEYVAR

4) First Splitting Variable: SDA (character variable with 13 different values)

5) Second Splitting Variable (the one I am having trouble incorporating into the code): FFYEAR (numeric variable for the years 2009-2015)

 

I am fairly new with SAS and just have no clue how to modify this code in order to split the dataset by two variables. My mind starts getting confused on how to treat the numeric variable and how to incorporate it into the name of the new datasets. If y'all can please help me (and have patience with my ignorance), I would greatly appreciate it!

 

Muchas Gracias

 

1 ACCEPTED SOLUTION

Accepted Solutions
AhmedAl_Attar
Ammonite | Level 13

Try the code below.

I used sashelp.prdsal2 --> it has fields similar to your case (year, state). The code will perform single data read with multiple writes.

 

Hope it helps,

Ahmed

 

proc sql noprint;
    select distinct cats('WORK.',translate(STRIP(state),'_',' '),'_',year)
            ,cat ('WHEN (year=',year,' and state="',STRIP(state),'") OUTPUT WORK.',translate(STRIP(state),'_',' '),'_',year,';')
    into    :g_outputNames separated by ' '
            ,:g_conditions separated by ' '
    from    sashelp.prdsal2
    order by year;
quit;
/*
%put g_outputNames = %superq(g_outputNames);
%put g_conditions=%superq(g_conditions);
*/

DATA &g_outputNames;
    SET sashelp.prdsal2;
    SELECT;
        &g_conditions
        OTHERWISE;
    END;
RUN;

View solution in original post

12 REPLIES 12
Julie4435637
Obsidian | Level 7

 

I don't have a lot of time now to look at your code but maybe the cats function could be useful here. 

 

select distinct(cats(state, year))

where ....

 

 

TXSASneophyte
Obsidian | Level 7

I tried doing this but my attemps end in failure. I don't fully grasp how to modify the current code and how it handles the cat function to include the second variable. 

Reeza
Super User

Why do you want to split your dataset?

How many records in your big dataset vs each smaller ones (on average). 

 

This isn't usually a recommended step, there are a lot of ways to use data within SAS without splltting it up. 

TXSASneophyte
Obsidian | Level 7

My big dataset has over 300 million observations; my computer takes a long time to do basic things with such a massive dataset. I also need to run specific analysis based on the observations that are in specific region and for specific years. I cannot give you an average of how many observations would be allocated into each smaller dataset since the regions and years vary considerably with their observations. 

LinusH
Tourmaline | Level 20
A combined index on year and state would probably give a perform boost (if that's a normal query).
Also, use SPDE instead of the BASE engine could speed up IO.
Investing in faster disks and more RAM, depending in your common query pattern of course. HW is cheap, man power isn't.
Data never sleeps
Reeza
Super User

Ok...that seems somewhat reasonable 🙂

 

Here's a blog that goes over splitting a data set. 

 

http://blogs.sas.com/content/sasdummy/2015/01/26/how-to-split-one-data-set-into-many/

 

One thing to keep in mind, if you have two variables that you're splliting on you can add the two variables you need to design your naming structure of your new datasets with some thought so that you can easily query them later on. If  you're likely to use year more often or state most often then that should come first in your naming convention.  

 

I would think with 300 million splitting by only one variable would be enough to obtain efficient results. Perhaps split by one and index by the other?  My suggestion would be split by year and then index by state. 

 

Look into using call execute to do this, instead of a macro or in combination with a macro. 

 

I'll try and post some code tonight if someone else hasn't already answered it 🙂

TXSASneophyte
Obsidian | Level 7

Hi Reeza,

 

You found the original blog I found the code in! I actually reached out to the individual asking for help and he directed me to posting the question on here. 

I really want to split my dataset by region and year because there is a good chance I'll be performing different analyses on them. I understand your point of trying to keep the dataset together as much as possible, but considering all the different moving parts in my project, having a separate dataset per region per year is a lot more efficient.  

 

I do appreciate your help getting this resolved!

Reeza
Super User

So what's your naming convention for the files?

AhmedAl_Attar
Ammonite | Level 13

Try the code below.

I used sashelp.prdsal2 --> it has fields similar to your case (year, state). The code will perform single data read with multiple writes.

 

Hope it helps,

Ahmed

 

proc sql noprint;
    select distinct cats('WORK.',translate(STRIP(state),'_',' '),'_',year)
            ,cat ('WHEN (year=',year,' and state="',STRIP(state),'") OUTPUT WORK.',translate(STRIP(state),'_',' '),'_',year,';')
    into    :g_outputNames separated by ' '
            ,:g_conditions separated by ' '
    from    sashelp.prdsal2
    order by year;
quit;
/*
%put g_outputNames = %superq(g_outputNames);
%put g_conditions=%superq(g_conditions);
*/

DATA &g_outputNames;
    SET sashelp.prdsal2;
    SELECT;
        &g_conditions
        OTHERWISE;
    END;
RUN;

ChrisNZ
Tourmaline | Level 20

If you really must split, use the code @AhmedAl_Attar provided.

 

You'll create all your tables in one go rather than reading your large table over and over.

 

To decide wether to split, you have to weigh:

- the time and storage wasted in splitting the original table

versus

- the time used to create an index and the potential additional time processing subsets of the original table directly (as compared to processing smaller tables)

 

I also agree with @LinusH that SPDE is worth a try. Its indexes are very fast, and its binary compression very effective.

 

Maybe duplicating and splitting is the answer. Or maybe not.

 

Kurt_Bremser
Super User

I'd try to get the list of distinct state/year combinations via a data step and proc sort nodupkey. select distinct in proc sql can be very resource and time consuming with big datasets.

TXSASneophyte
Obsidian | Level 7

The code worked perfectly! Thanks for the help Ahmed 🙂

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 12 replies
  • 4115 views
  • 2 likes
  • 7 in conversation