BookmarkSubscribeRSS Feed
csessa3
Calcite | Level 5

Hello,

 

I would like to make state indicator dummy variables for a large dataset (n > 600,000) I have found documentation suggesting to run a prog glm or logistic. This method works for me if I reduce the sample size, but it will not run on my computer with the entire sample.

 

I have also hard coded it before, but am hoping to find something more efficient as it is making my code very long and difficult to sift through.

 

I have also seen the following code, but I am receiving an error saying "Array Subscript out of range." I have yet to decipher what that means.

 

This is the code I have most recently tried:

data indicators;
set miss_states1;
array dummys {*} st1 - st2;
do i=1 to DIM(dummys);
	dummys(i) = 0;
end;
dummys(state) = 1;
run;

Thank you in advance to anyone who can help!

16 REPLIES 16
ghosh
Barite | Level 11

The error is in this line

dummys(state) = 1;

 

csessa3
Calcite | Level 5

Hello, thank you for identifying the issue. Do you know how to correct it? What is wrong with it?

Astounding
PROC Star
Something shorter and more efficient? Here's an idea. Do nothing. Leave your data just as it is.

Both GLM and LOGISTIC support the CLASS statement which let's the procedure create the dummy variables for you.
csessa3
Calcite | Level 5

Hello. I would be fine using the logistic class statement, but it will not run for my large sample. Which is why I am looking for an alternate solution

Reeza
Super User
Have you tried the approaches outlined here, specifically GLMSELECT?
My concern is if that doesn't run, your actual model won't run either?

Could you create a small sample that has all the values using PROC SURVEYSELECT and STRATA to ensure your categorical variables are included and then use one of the shown methods?
csessa3
Calcite | Level 5

The GLMSELECT and the proc logistic work for creating the categorical variables when the sample size is reduced. I have previously hard coded the state indicators and run my final regression model with no issue, so I am not worried about my final model not working.

 

I am not familiar about the PROC SURVEYSELECT and STRATA method you have suggested. Would you be able to give me a little more clarification?

ballardw
Super User

@csessa3 wrote:

Hello,

 

I would like to make state indicator dummy variables for a large dataset (n > 600,000) I have found documentation suggesting to run a prog glm or logistic. This method works for me if I reduce the sample size, but it will not run on my computer with the entire sample.

 

I have also hard coded it before, but am hoping to find something more efficient as it is making my code very long and difficult to sift through.

 

I have also seen the following code, but I am receiving an error saying "Array Subscript out of range." I have yet to decipher what that means.

 

This is the code I have most recently tried:

data indicators;
set miss_states1;
array dummys {*} st1 - st2;
do i=1 to DIM(dummys);
	dummys(i) = 0;
end;
dummys(state) = 1;
run;

Thank you in advance to anyone who can help!


What is the value of the variable State? Your array dummys contains exactly two elements and as defined would allow use of index values of 1 and 2. If STATE contains any value other than 1 or 2 then that is the cause of the "out of range" error.

 

And if you are having issues with the Proc Logistic or Proc GLM I strongly suggest that you show the LOG for entire procedure, code, warnings, errors and notes. Copy the entire proc from the log and paste the text into  a code box opened with the </> icon.

csessa3
Calcite | Level 5

State is a string variable with the state abbreviation (FL, GA, AL, etc.)

 

The error is that my machine runs out of memory. It will spin for 20-30 minutes and then say insufficient memory. When I open the file it will have categorical variables for 10 of the 50 states. So it works, but it can't complete the task for the entire file.

ballardw
Super User

@csessa3 wrote:

State is a string variable with the state abbreviation (FL, GA, AL, etc.)

 

The error is that my machine runs out of memory. It will spin for 20-30 minutes and then say insufficient memory. When I open the file it will have categorical variables for 10 of the 50 states. So it works, but it can't complete the task for the entire file.


So not only are you attempting to use a variable with more than 2 values as the index of the array you are attempting to use a Character variable where a Numeric is required.

 

And again, show the proc logistic/glm code you are attempting. 600,000 records is not really a "large" data set. Search this forum an you will find references to folks using way more records than that.

 

The most likely cause for running out of computational resources is likely to be number of variables. If you read the documentation for Proc Logistics in the details section is a topic labeled "computational resources" that shows the memory needed.

And additional memory is needed if the SELECTION option is used.

So, SHOW the code, from the log.

 

Adding additional variables, i.e. your hand made indicators is likely to increase memory problems not reduce them.

 

 

PaigeMiller
Diamond | Level 26

Although this has been mentioned earlier in this thread, I mention it again for emphasis.

 

You usually do not need to create your own dummy variables in SAS. Many SAS analyses procedures allow you to use a CLASS statement, so the SAS PROC creates the dummy variables for you, and you know they are correct — no fumbling around in a data step to get the correct dummy variables.


So, what do you plan to do with these dummy variables after you create them?

--
Paige Miller
csessa3
Calcite | Level 5

It's a long story, but the gist is that I have a dataset of medical providers who are listed in multiple states within the dataset. I am going to use the dummy variables to calculate which state they are listed in the most to use this as their "primary" state. 

 

So for example, if doctor A is listed in Florida 2 times and Georgia 1 time, I want to say he is a Florida doctor. 

PaigeMiller
Diamond | Level 26

@csessa3 wrote:

It's a long story, but the gist is that I have a dataset of medical providers who are listed in multiple states within the dataset. I am going to use the dummy variables to calculate which state they are listed in the most to use this as their "primary" state. 

 

So for example, if doctor A is listed in Florida 2 times and Georgia 1 time, I want to say he is a Florida doctor. 


PROC SUMMARY or PROC FREQ, then you don't need to create dummy variables yourself.

--
Paige Miller
Astounding
PROC Star

SAS can easily track data for 50 states without running out of memory.  So your post tells me that you are doing something different, such as trying to track the states for all providers at the same time.  Bottom line:  you will need to post the log from the program that is running out of memory if you want helpful feedback.

ballardw
Super User

@csessa3 wrote:

It's a long story, but the gist is that I have a dataset of medical providers who are listed in multiple states within the dataset. I am going to use the dummy variables to calculate which state they are listed in the most to use this as their "primary" state. 

 

So for example, if doctor A is listed in Florida 2 times and Georgia 1 time, I want to say he is a Florida doctor. 


Here is an example of one way involving simple steps to accomplish the part of identifying the "primary" state involved. Note that if there are two or more states with the same count there is nothing supplying any rule for tie breaking.

data have;
   input id $ state $;
datalines;
1 AL
1 AL
1 NM
2 AZ
2 NM
2 NM
;
Proc summary data=have nway;
   class id state;
   output out=work.freq (drop=_type_);
run;
proc sort data=work.freq;
   by id descending _freq_;
run;
data work.temp;
   set work.freq;
   by id;
   if first.id;
run;

proc sql;
   create table want as
   select a.*, b.state as PrimaryState
   from have as a
        left join
        work.temp as b
        on a.id =b.id
   ;
quit;

The first data step is just to have something to code against. The Proc summary gets a count; the sort orders the count data to get the most common state first, the temp data step gets only one record for each id and the proc SQL adds a new variable of "PrimaryState" back to your original data.

 

However, you still have not described how this is going to impact your Proc Logistic/GLM issue. And if you mean to add 50 more variables for "dummies" then that is not the correct approach.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 16 replies
  • 1801 views
  • 2 likes
  • 7 in conversation