Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Programming
- /
- SAS Procedures
- /
- Dummy variables for very large data set

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 05-15-2018 11:01 PM
(1478 views)

I have a large data set (thousands of observations) and I have a variable with 200 different values. I need to create a dummy variable where each value takes a number (e.g., the variable is Product Type and I need a new dummy variable where Product Type "food" is classified as 1, Type "linens" is 2, etc.). So the dummy variable would have numbers from 1 - 200 representing each product type. The problem is, because the data set is so large, every procedure I try to use causes SAS to completely freeze after running forever.

What might be the best (and easiest) way to accomplish this?

9 REPLIES 9

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Probably format would be quickest. Normally you can just apply a format to the variable in question without updating the table, but if you really need a new variable then you will need to write it.

Format

proc format; value $product "food"=1 "linens"=2 ...; run;

To apply without modification - does not create a new variable just adds the format (so you effectively have both):

proc datasets library=work; modify yourds format product_type $product.; run;

To create a format you would write the dataset again:

data want; set yourds; new_var=put(product_type,$product.); run;

If your dataset is that large however you might want to consider changing to big data procedures, working on grid, breaking data up etc. as normal SAS should be fine with a few million records.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@cbing wrote:

I have a large data set (thousands of observations)

"Thousands" is not even *small* nowadays, it's actually *puny*. If that is not a typing mistake, then something else must be at play that causes your problems. To determine that, we will need example data and code (see my footnotes for hints how to post those).

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

So the dummy variable would have numbers from 1 - 200 representing each product type. The problem is, because the data set is so large, every procedure I try to use causes SAS to completely freeze after running forever.

You certainly didn't provide us any useful details. What procedures did you try? Be specific. Show us the code for one or two of them. As long as I'm asking for details ... how large is the dataset? How many rows? How many columns?

If we have this information, we might be able to provide help.

Also, variables having values 1-200 are not dummy variables in the traditional sense. Doing this gives you no benefit over using categories like "food" and "linens", it just makes the results harder to interpret because now you have to translate the numbers back to words that people can understand. Many SAS procedures do this translation to dummy variable internally, so you don't have to create the dummy variables anyway, and perhaps some of the unmentioned procedures in SAS that you tried actually have this feature.

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

As I understand it, you have a character variable, say PRODUCT_TYPE, that has hundreds of possible values (say 222). You want to create a numeric category variable (not a dummy variable) NEW_VAR with values from 1 through 222 to use in other procedures, reports, and data transformations.

If so, you can run a data step which reads each record in dataset HAVE, and uses a hash object to keep track of the product_type values already encountered (and their associated NEW_VAR). And each time a new product_type is found, NEW_VAR is incremented and the product_type/new_var association is added to the hash object:

```
data want (drop=rc);
set have;
if _n_=1 then do;
declare hash h ();
h.definekey('product_type');
h.definedata('new_var');
h.definedone();
end;
rc=h.find(key:product_type); /*Already in hash h?*/
if rc^=0 then do; /* If not, then add this item to h*/
new_var=h.num_items+1;
rc=h.add();
end;
run;
```

You now have a dataset with the original variable and its categorical value. If needed, you can use these to build the format for the original variable (and the new variable) if needed.

--------------------------

The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for

Allow PROC SORT to output multiple datasets

--------------------------

The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for

Allow PROC SORT to output multiple datasets

--------------------------

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I am doing a replication, and quite frankly, I don't know why the dummy variable is being used. I used the product type as an example, but I realize that was a bad choice. It is actually a numeric variable with product types being depicted as numbers (e.g., product type 33 which actually means "food," but the 33 is in numeric form in the data set; I apologize for the vagueness in this example, I thought it'd be clearer using the character values instead of saying I need 33=1, 39=2, and so forth since the values are not continuous 33, 34, 35, etc.). In the study I'm trying to replicate, they used dummy variables for the product type (though I'm unsure why, but I assume there's a reason that would affect my results?). The product type is meant to be a control variable, but from what I understand it functions just like any other independent variable. I can run a regular OLS regression with the variable as-is, without the dummy variables, but I'm not getting the correct results - they are actually way off, not just a little bit. So I was trying to figure out another way.

From my regression I ultimately need the interaction term coefficients for the two groups (part of the interaction term in the equation below), and then their mean differences for each year.

model: DV = Year Year*Group ProductType

The previous codes I used to try and do the dummy variables include proc glmmod, and I also tried a macro code, but I no longer have the code because it was a mess and not what I needed.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I apologize for the vagueness in this example, I thought it'd be clearer using the character values instead of saying I need 33=1, 39=2, and so forth since the values are not continuous 33, 34, 35, etc.). In the study I'm trying to replicate, they used dummy variables for the product type (though I'm unsure why, but I assume there's a reason that would affect my results?).

Apologizing is nice, but providing a clear problem statement is even better. Anyway, changing 33 to 1, and 39 to 2, and so on, is pointless, and gains you nothing. Now perhaps you have the situation where BOTH 33 and 67 get changed to 1, in which case such a transformation is useful and not pointless; but you didn't say that. So I will stick with my pointless statement.

In the study I'm trying to replicate, they used dummy variables for the product type (though I'm unsure why, but I assume there's a reason that would affect my results?). The product type is meant to be a control variable, but from what I understand it functions just like any other independent variable. I can run a regular OLS regression with the variable as-is, without the dummy variables, but I'm not getting the correct results - they are actually way off, not just a little bit. So I was trying to figure out another way.

Let's forget this whole dummy variable thing for now. It is not relevant to describing the problem clearly, and getting good results. Furthermore, as I stated earlier, many SAS procedures, including PROC GLM (which is the PROC I think you should use) does not need dummy variables to be created beforehand. GLM will create the dummy variables for you, so you don't have to.

If you are not getting correct results, *there are many possible reasons for this*, and so you can't simply conclude that you need to add dummy variables.

Let me ask you this ... are you using the exact same data as the study you are trying to replicate? Or are you using your own data but using the design of the study you are trying to replicate? This isn't clear either.

--

Paige Miller

Paige Miller

**Available on demand!**

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.