Help using Base SAS procedures

Dummy variables for very large data set

Reply
Occasional Contributor
Posts: 5

Dummy variables for very large data set

I have a large data set (thousands of observations) and I have a variable with 200 different values. I need to create a dummy variable where each value takes a number (e.g., the variable is Product Type and I need a new dummy variable where Product Type "food" is classified as 1, Type "linens" is 2, etc.). So the dummy variable would have numbers from 1 - 200 representing each product type. The problem is, because the data set is so large, every procedure I try to use causes SAS to completely freeze after running forever. 

 

What might be the best (and easiest) way to accomplish this? 

Super User
Super User
Posts: 9,599

Re: Dummy variables for very large data set

Probably format would be quickest.  Normally you can just apply a format to the variable in question without updating the table, but if you really need a new variable then you will need to write it.

Format

proc format;
  value $product
    "food"=1
    "linens"=2
    ...;
run;
    

To apply without modification - does not create a new variable just adds the format (so you effectively have both):

proc datasets library=work;
  modify yourds
    format product_type $product.;
run;

To create a format you would write the dataset again:

data want;
  set yourds;
  new_var=put(product_type,$product.);
run;

If your dataset is that large however you might want to consider changing to big data procedures, working on grid, breaking data up etc. as normal SAS should be fine with a few million records.

Super User
Posts: 10,280

Re: Dummy variables for very large data set


@cbing wrote:

I have a large data set (thousands of observations)


"Thousands" is not even small nowadays, it's actually puny. If that is not a typing mistake, then something else must be at play that causes your problems. To determine that, we will need example data and code (see my footnotes for hints how to post those).

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
SAS Super FREQ
Posts: 4,245

Re: Dummy variables for very large data set

What do you intend to do with these dummy variables? Are you planning to run an analysis? Compute summary statistics? If you provide sample data and indicate what you want to acheive, we can advise the best way to accomplish your objective.

Respected Advisor
Posts: 3,065

Re: Dummy variables for very large data set

So the dummy variable would have numbers from 1 - 200 representing each product type. The problem is, because the data set is so large, every procedure I try to use causes SAS to completely freeze after running forever. 

 

You certainly didn't provide us any useful details. What procedures did you try? Be specific. Show us the code for one or two of them. As long as I'm asking for details ... how large is the dataset? How many rows? How many columns? 


If we have this information, we might be able to provide help.

 

Also, variables having values 1-200 are not dummy variables in the traditional sense. Doing this gives you no benefit over using categories like "food" and "linens", it just makes the results harder to interpret because now you have to translate the numbers back to words that people can understand. Many SAS procedures do this translation to dummy variable internally, so you don't have to create the dummy variables anyway, and perhaps some of the unmentioned procedures in SAS that you tried actually have this feature.

--
Paige Miller
Trusted Advisor
Posts: 1,345

Re: Dummy variables for very large data set

As I understand it, you have a character variable, say PRODUCT_TYPE, that has hundreds of possible values (say 222).  You want to create a numeric category variable (not a dummy variable) NEW_VAR with values from 1 through 222 to use in other procedures,  reports, and data transformations.

 

If so, you can run a data step which reads each record in dataset HAVE, and uses a hash object to keep track of the product_type values already encountered (and their associated NEW_VAR).  And each time a new product_type is found, NEW_VAR is incremented and the product_type/new_var association is added to the hash object: 

 

data want (drop=rc);
  set have;
  if _n_=1 then do;
    declare hash h ();
      h.definekey('product_type');
      h.definedata('new_var');
      h.definedone();
  end;
  rc=h.find(key:product_type); /*Already in hash h?*/
  if rc^=0 then do;            /* If not, then add this item to h*/
    new_var=h.num_items+1;
    rc=h.add();
  end;
run;

 

You now have a dataset with the original variable and its categorical value.  If needed, you can use these to build the format for the original variable (and the new variable) if needed.

 

 

SAS Super FREQ
Posts: 4,245

Re: Dummy variables for very large data set

After re-reading the OP's statement and several of the comments, it appears that the OP's problem is not related to dummy variables. It sounds like the issue is "how to recode a character variable as a numeric variable."  If that is true, then again I ask "why do you want to do this?" SAS is perfectly capable of using character variables as CLASS variables, in plots, etc. What is the goal of this task? What problem are you trying to solve?

Occasional Contributor
Posts: 5

Re: Dummy variables for very large data set

I am doing a replication, and quite frankly, I don't know why the dummy variable is being used. I used the product type as an example, but I realize that was a bad choice. It is actually a numeric variable with product types being depicted as numbers (e.g., product type 33 which actually means "food," but the 33 is in numeric form in the data set; I apologize for the vagueness in this example, I thought it'd be clearer using the character values instead of saying I need 33=1, 39=2, and so forth since the values are not continuous 33, 34, 35, etc.). In the study I'm trying to replicate, they used dummy variables for the product type (though I'm unsure why, but I assume there's a reason that would affect my results?). The product type is meant to be a control variable, but from what I understand it functions just like any other independent variable. I can run a regular OLS regression with the variable as-is, without the dummy variables, but I'm not getting the correct results - they are actually way off, not just a little bit. So I was trying to figure out another way.

 

From my regression I ultimately need the interaction term coefficients for the two groups (part of the interaction term in the equation below), and then their mean differences for each year.

 

model: DV = Year Year*Group ProductType

 

The previous codes I used to try and do the dummy variables include proc glmmod, and I also tried a macro code, but I no longer have the code because it was a mess and not what I needed. 

Super User
Posts: 10,280

Re: Dummy variables for very large data set

It's best if you now supply some example data (see my footnotes) and what you expect the dataset to look like after the creation of the dummy variable(s), so we can make code suggestions for how to achieve that.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
Respected Advisor
Posts: 3,065

Re: Dummy variables for very large data set

[ Edited ]

I apologize for the vagueness in this example, I thought it'd be clearer using the character values instead of saying I need 33=1, 39=2, and so forth since the values are not continuous 33, 34, 35, etc.). In the study I'm trying to replicate, they used dummy variables for the product type (though I'm unsure why, but I assume there's a reason that would affect my results?). 

 

Apologizing is nice, but providing a clear problem statement is even better. Anyway, changing 33 to 1, and 39 to 2, and so on, is pointless, and gains you nothing. Now perhaps you have the situation where BOTH 33 and 67 get changed to 1, in which case such a transformation is useful and not pointless; but you didn't say that. So I will stick with my pointless statement.

 

In the study I'm trying to replicate, they used dummy variables for the product type (though I'm unsure why, but I assume there's a reason that would affect my results?). The product type is meant to be a control variable, but from what I understand it functions just like any other independent variable. I can run a regular OLS regression with the variable as-is, without the dummy variables, but I'm not getting the correct results - they are actually way off, not just a little bit. So I was trying to figure out another way.

 

Let's forget this whole dummy variable thing for now. It is not relevant to describing the problem clearly, and getting good results. Furthermore, as I stated earlier, many SAS procedures, including PROC GLM (which is the PROC I think you should use) does not need dummy variables to be created beforehand. GLM will create the dummy variables for you, so you don't have to.

 

If you are not getting correct results, there are many possible reasons for this, and so you can't simply conclude that you need to add dummy variables.

 

Let me ask you this ... are you using the exact same data as the study you are trying to replicate? Or are you using your own data but using the design of the study you are trying to replicate? This isn't clear either.

 

 

--
Paige Miller
Ask a Question
Discussion stats
  • 9 replies
  • 250 views
  • 8 likes
  • 6 in conversation