Creating Dummy Variables from Categorical Variable for large dataset

csessa3 · Posted 07-19-2020 08:29 PM

Hello,

I would like to make state indicator dummy variables for a large dataset (n > 600,000) I have found documentation suggesting to run a prog glm or logistic. This method works for me if I reduce the sample size, but it will not run on my computer with the entire sample.

I have also hard coded it before, but am hoping to find something more efficient as it is making my code very long and difficult to sift through.

I have also seen the following code, but I am receiving an error saying "Array Subscript out of range." I have yet to decipher what that means.

This is the code I have most recently tried:

data indicators;
set miss_states1;
array dummys {*} st1 - st2;
do i=1 to DIM(dummys);
	dummys(i) = 0;
end;
dummys(state) = 1;
run;

Thank you in advance to anyone who can help!

ghosh · Posted 07-19-2020 10:14 PM

The error is in this line

dummys(state) = 1;

csessa3 · Posted 07-20-2020 08:43 AM

Hello, thank you for identifying the issue. Do you know how to correct it? What is wrong with it?

Astounding · Posted 07-19-2020 10:57 PM

Something shorter and more efficient? Here's an idea. Do nothing. Leave your data just as it is.

Both GLM and LOGISTIC support the CLASS statement which let's the procedure create the dummy variables for you.

csessa3 · Posted 07-20-2020 08:42 AM

Hello. I would be fine using the logistic class statement, but it will not run for my large sample. Which is why I am looking for an alternate solution

Reeza · Posted 07-19-2020 11:00 PM

Have you tried the approaches outlined here, specifically GLMSELECT?
My concern is if that doesn't run, your actual model won't run either?

Could you create a small sample that has all the values using PROC SURVEYSELECT and STRATA to ensure your categorical variables are included and then use one of the shown methods?

csessa3 · Posted 07-20-2020 08:48 AM

The GLMSELECT and the proc logistic work for creating the categorical variables when the sample size is reduced. I have previously hard coded the state indicators and run my final regression model with no issue, so I am not worried about my final model not working.

I am not familiar about the PROC SURVEYSELECT and STRATA method you have suggested. Would you be able to give me a little more clarification?

ballardw · Posted 07-20-2020 12:44 AM

@csessa3 wrote:

Hello,

I would like to make state indicator dummy variables for a large dataset (n > 600,000) I have found documentation suggesting to run a prog glm or logistic. This method works for me if I reduce the sample size, but it will not run on my computer with the entire sample.

I have also hard coded it before, but am hoping to find something more efficient as it is making my code very long and difficult to sift through.

I have also seen the following code, but I am receiving an error saying "Array Subscript out of range." I have yet to decipher what that means.

This is the code I have most recently tried:
data indicators;
set miss_states1;
array dummys {*} st1 - st2;
do i=1 to DIM(dummys);
	dummys(i) = 0;
end;
dummys(state) = 1;
run;
Thank you in advance to anyone who can help!

What is the value of the variable State? Your array dummys contains exactly two elements and as defined would allow use of index values of 1 and 2. If STATE contains any value other than 1 or 2 then that is the cause of the "out of range" error.

And if you are having issues with the Proc Logistic or Proc GLM I strongly suggest that you show the LOG for entire procedure, code, warnings, errors and notes. Copy the entire proc from the log and paste the text into a code box opened with the </> icon.

csessa3 · Posted 07-20-2020 08:58 AM

State is a string variable with the state abbreviation (FL, GA, AL, etc.)

The error is that my machine runs out of memory. It will spin for 20-30 minutes and then say insufficient memory. When I open the file it will have categorical variables for 10 of the 50 states. So it works, but it can't complete the task for the entire file.

ballardw · Posted 07-20-2020 11:12 AM

@csessa3 wrote:

State is a string variable with the state abbreviation (FL, GA, AL, etc.)

The error is that my machine runs out of memory. It will spin for 20-30 minutes and then say insufficient memory. When I open the file it will have categorical variables for 10 of the 50 states. So it works, but it can't complete the task for the entire file.

So not only are you attempting to use a variable with more than 2 values as the index of the array you are attempting to use a Character variable where a Numeric is required.

And again, show the proc logistic/glm code you are attempting. 600,000 records is not really a "large" data set. Search this forum an you will find references to folks using way more records than that.

The most likely cause for running out of computational resources is likely to be number of variables. If you read the documentation for Proc Logistics in the details section is a topic labeled "computational resources" that shows the memory needed.

And additional memory is needed if the SELECTION option is used.

So, SHOW the code, from the log.

Adding additional variables, i.e. your hand made indicators is likely to increase memory problems not reduce them.

PaigeMiller · Posted 07-20-2020 06:31 AM

Although this has been mentioned earlier in this thread, I mention it again for emphasis.

You usually do not need to create your own dummy variables in SAS. Many SAS analyses procedures allow you to use a CLASS statement, so the SAS PROC creates the dummy variables for you, and you know they are correct — no fumbling around in a data step to get the correct dummy variables.

So, what do you plan to do with these dummy variables after you create them?

--
Paige Miller

csessa3 · Posted 07-20-2020 08:55 AM

It's a long story, but the gist is that I have a dataset of medical providers who are listed in multiple states within the dataset. I am going to use the dummy variables to calculate which state they are listed in the most to use this as their "primary" state.

So for example, if doctor A is listed in Florida 2 times and Georgia 1 time, I want to say he is a Florida doctor.

PaigeMiller · Posted 07-20-2020 09:01 AM

@csessa3 wrote:

It's a long story, but the gist is that I have a dataset of medical providers who are listed in multiple states within the dataset. I am going to use the dummy variables to calculate which state they are listed in the most to use this as their "primary" state.

So for example, if doctor A is listed in Florida 2 times and Georgia 1 time, I want to say he is a Florida doctor.

PROC SUMMARY or PROC FREQ, then you don't need to create dummy variables yourself.

--
Paige Miller

Astounding · Posted 07-20-2020 11:13 AM

SAS can easily track data for 50 states without running out of memory. So your post tells me that you are doing something different, such as trying to track the states for all providers at the same time. Bottom line: you will need to post the log from the program that is running out of memory if you want helpful feedback.

ballardw · Posted 07-20-2020 12:57 PM

@csessa3 wrote:

It's a long story, but the gist is that I have a dataset of medical providers who are listed in multiple states within the dataset. I am going to use the dummy variables to calculate which state they are listed in the most to use this as their "primary" state.

So for example, if doctor A is listed in Florida 2 times and Georgia 1 time, I want to say he is a Florida doctor.

Here is an example of one way involving simple steps to accomplish the part of identifying the "primary" state involved. Note that if there are two or more states with the same count there is nothing supplying any rule for tie breaking.

data have;
   input id $ state $;
datalines;
1 AL
1 AL
1 NM
2 AZ
2 NM
2 NM
;
Proc summary data=have nway;
   class id state;
   output out=work.freq (drop=_type_);
run;
proc sort data=work.freq;
   by id descending _freq_;
run;
data work.temp;
   set work.freq;
   by id;
   if first.id;
run;

proc sql;
   create table want as
   select a.*, b.state as PrimaryState
   from have as a
        left join
        work.temp as b
        on a.id =b.id
   ;
quit;

The first data step is just to have something to code against. The Proc summary gets a count; the sort orders the count data to get the most common state first, the temp data step gets only one record for each id and the proc SQL adds a new variable of "PrimaryState" back to your original data.

However, you still have not described how this is going to impact your Proc Logistic/GLM issue. And if you mean to add 50 more variables for "dummies" then that is not the correct approach.

Creating Dummy Variables from Categorical Variable for large dataset

Re: Creating Dummy Variables from Categorical Variable for large dataset

Re: Creating Dummy Variables from Categorical Variable for large dataset

Re: Creating Dummy Variables from Categorical Variable for large dataset

Re: Creating Dummy Variables from Categorical Variable for large dataset

Re: Creating Dummy Variables from Categorical Variable for large dataset

Re: Creating Dummy Variables from Categorical Variable for large dataset

Re: Creating Dummy Variables from Categorical Variable for large dataset

Re: Creating Dummy Variables from Categorical Variable for large dataset

Re: Creating Dummy Variables from Categorical Variable for large dataset

Re: Creating Dummy Variables from Categorical Variable for large dataset

Re: Creating Dummy Variables from Categorical Variable for large dataset

Re: Creating Dummy Variables from Categorical Variable for large dataset

Re: Creating Dummy Variables from Categorical Variable for large dataset

Re: Creating Dummy Variables from Categorical Variable for large dataset

SAS Innovate 2025: Call for Content

Classroom Training Available!