turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- Base SAS Programming
- /
- Merging variables

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-12-2016 12:23 AM

Hi I have a dataset with 3 variables that have the following freqs for each of the variables by year

var1*year

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012

22 34 43 21 0 0 0 1 0 1 1 0 0

33 44 17 13 1 1 0 0 0 0 0 1 0

var2*year

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012

0 1 1 0 35 24 42 22 0 0 0 0 2

1 0 0 0 54 13 24 16 2 1 0 0 0

var3*year

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012

0 0 0 0 1 0 1 1 33 54 23 14 78

1 2 0 0 0 0 0 0 65 44 123 34 89

So I need to join these 3 variables into one variable using only years 2000-2003 for var1, 2004 to 2007 for var2 and 2008-2012 for var3.

Thank you

Accepted Solutions

Solution

05-19-2016
10:08 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-17-2016 03:41 PM

Thanks for clarifying that this is about cancer staging schemes and for the attached cross(?)-tabulations.

I find it confusing that both the tables in your initial post and the new ones have titles suggesting cross-tabulations (var1*year etc.), but have *no row headers*. This makes it difficult to compare different tables appropriately, because it is not clear which categories (rows) in different tables should be compared.

Each of the tables in the attachment has a total row at the bottom, which can be guessed only from the numbers.

Also, you use variable name VarComb (which I used for the CATX-concatenated variable) for a variable derived using the COALESCE function (an approach which I advised against).

The attached tables present a different picture than those in your initial post: Table "var3*year" in your initial post suggested that for the years 2004-2006 VAR3 has only very few non-missing values. The attached table with the same title, however, shows large numbers in all (displayed) categories. At the same time, table "Var2*year" has substantial numbers in only two of the eight categories shown. Of course, this must have to do with the significant change in the assignment of years to variables which has occurred between your initial and current post.

I tend to believe that, at the current stage, this is not a programming question. You should rather talk to a medical expert who is familiar with those different cancer staging schemes and who can advise you if and how different schemes can be unified for further analysis. As soon as you can describe the rule clearly, many forum members will be able to help you implementing it in SAS.

All Replies

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-12-2016 01:17 AM

Your question is unclear.

Post what you have, as identical as possible, and what you want. Ideally, what you've tried as well and what isn't working.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-12-2016 10:21 AM

Do you have your data in SAS? Since you can't have a variable named "2000" then that structure you demonstrate woule have to have many more variables with different names than implied.

And how are you "joining" variables? Concatenate, add, multiply or something else entirely?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-12-2016 11:02 AM

Hi @mayasak,

Maybe this can be a start:

```
data want;
set have;
if year in (2000:2003) then var_new=var1;
else if year in (2004:2007) then var_new=var2;
else if year in (2008:2012) then var_new=var3;
run;
```

Then you can compare your PROC FREQ output to that of the following step:

```
proc freq data=want;
tables var_new*year;
run;
```

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-16-2016 01:51 PM

Thank you FreelanceReinhard. That was perfect. I also wanted to know how merge the variables without considering the years so that I can compare both sets and see if there are any overlaps within the different variables for the same years.

Thanks

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-16-2016 02:58 PM

Without considering the years, you could use the COALESCE function to obtain an aggregated ("merged") variable (by taking the first non-missing value, if any, from VAR1, VAR2, VAR3 in each observation):

```
data want2a;
set have;
var_new=coalesce(of var1-var3);
run;
proc compare data=want2a c=want;
run;
```

All differences shown in the PROC COMPARE output would be due to "overlaps" of "different variables for the same years." However, there can be overlaps without affecting the PROC COMPARE output, for example if VAR1 and VAR2 had (possibly different) non-missing values in an observation of the year 2000, VAR_NEW in both WANT and WANT2A would have the value from VAR1. Therefore, I wouldn't recommend this approach in general.

I would rather look at the combinations of VAR1 - VAR3:

```
data want2b;
length varcomb $30; /* please adapt the length as appropriate */
set have;
varcomb=catx(of var1-var3);
run;
proc freq data=want2b order=freq;
tables varcomb;
run;
```

Now, overlaps can be clearly recognized in the PROC FREQ output. For instance, a value VARCOMB='1,.,0' would mean "VAR1=1, VAR2=. (missing), VAR3=0." Typically, the corresponding frequency of such combinations will be low. Due to the ORDER= option these exceptional cases will appear closer to the bottom of the frequency table, whereas the predominant cases will be found in the first few rows of the table.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-16-2016 06:22 PM

I'm not sure if I'm doing the right thing. Let me explain what I'm supposed to do.

The data is for cancer and the three variables are for cancer staging. The problem is that the staging schemes for cancer changed throughout the years. So the first variable is for staging used from year 1988 to 2000. The second variable from year 2001-2003 and the third for years 2004 to 2003. However, I found that there were overlap in years using different staging schemes. What I have to do is to compare between the total cases per each year vs the number of records returned with a populated field value for each respective year/ staging scheme.

I've attached Freq tests for the var2*year and var3*year

Also

CombVar*year (using “if year in” codes that you provided in the beginning)

And

VarComb*year (using the coalesce statement)

Thanks

Solution

05-19-2016
10:08 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-17-2016 03:41 PM

Thanks for clarifying that this is about cancer staging schemes and for the attached cross(?)-tabulations.

I find it confusing that both the tables in your initial post and the new ones have titles suggesting cross-tabulations (var1*year etc.), but have *no row headers*. This makes it difficult to compare different tables appropriately, because it is not clear which categories (rows) in different tables should be compared.

Each of the tables in the attachment has a total row at the bottom, which can be guessed only from the numbers.

Also, you use variable name VarComb (which I used for the CATX-concatenated variable) for a variable derived using the COALESCE function (an approach which I advised against).

The attached tables present a different picture than those in your initial post: Table "var3*year" in your initial post suggested that for the years 2004-2006 VAR3 has only very few non-missing values. The attached table with the same title, however, shows large numbers in all (displayed) categories. At the same time, table "Var2*year" has substantial numbers in only two of the eight categories shown. Of course, this must have to do with the significant change in the assignment of years to variables which has occurred between your initial and current post.

I tend to believe that, at the current stage, this is not a programming question. You should rather talk to a medical expert who is familiar with those different cancer staging schemes and who can advise you if and how different schemes can be unified for further analysis. As soon as you can describe the rule clearly, many forum members will be able to help you implementing it in SAS.