Solved: Re: Renaming variables that have the same prefix

kmardinian · Posted 09-19-2017 06:17 PM

Hi, I'm running a dataset filled with different gene mutations listed like AKT1 EL7K, AKT1 E49K, and AKT1 F55L. And I'd like to group them all into one variable named AKT1. This is what I had based on what I foudn online, but it hasn't been working.

Data MCCAssign1;
Set CombinedData (keep= N Name Notes protocol);
If missing(protocol) then delete;
if findw(N,'AKT1') then Ncat=AKT; end;

Any help is much appreciated!

Thank you!

mkeintz · Posted 09-20-2017 02:36 PM

Are you saying you have 50 vars for each individual? But you want a single frequency table over all those 50, right?

Presumably you want to run a proc freq on a data set derived from your data set. The derived data set (NEED below) would have 50 observations for each observation of 50 vars in your original dataset HAVE. You can do this in a data step with an array declaration containing all of your original vars as elements of the array, and then loop over the array, generating and outputting the main gene from each array element.

data need (keep=name main_gene);
  set have;
  array n {*} n1-n50;
  do I=1 to dim(n);
    main_gene=scan(n{I},1);
    output;
  end;
run;

proc freq data=need;
  tables main_gene;
run;

Notes:

If a given main gene value occurs more than once in the 50 vars, this program counts it every time. If you only want to count main gene once, then the simplest solution is to put this between the DATA step and the PROC FREQ:
     proc sort data=need noduprec;
       by name main_gene;
     run;
If your original vars are NOT neatly named N1, N2, ... N50, you can still use the array. Just replace the "N1-N50" with the list of vars in the array statement. But note the program assume each of the original gene values has main gene as the first space-separated word.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

Reeza · Posted 09-19-2017 06:21 PM

Please post some example data detailing what you have and what you want as output.
This does not need to be real data, it can be fake, but it should be as simple as possible, while complex enough to illustrate your problem.

kmardinian · Posted 09-19-2017 06:34 PM

My data looks like this:

Name N

John Doe EGFR R32R

John Doe EGFR 5539

Jane Williams BRCA1 6006

Jane Williams BRCA1 4003

Jane Williams AKT1 3997

So each person has several different mutations and I want to group their mutations based on the prefix of each one. So I would like to have them renamed as EGFR, BRCA1, and AKT1, erasing the numbers at the end.

As for my output, I want to find the total frequency of each mutation in my population, which is why I'd like to focus on the main gene and not the numbers at the end.

Thank you!

Reeza · Posted 09-19-2017 06:51 PM

If there's a space, as in your sample use SCAN()

prefix = scan(N_var, 1);

kmardinian · Posted 09-19-2017 06:56 PM

Hi Reeza,

Thank you for taking the time to help me with this. I'm not sure I have the syntax right.

Data Assign1;
ABL1 = scan(N_var, 1);
Run;

I'm not sure what to put for N_var.

Thank you!

Reeza · Posted 09-19-2017 07:05 PM

The variable name. The old variable that is, the left hand side of the equation is the new variable name.

new_variable = scan( old_variable, 1);

All functions are listed here and you should bookmark this page:

http://support.sas.com/documentation/cdl/en/lefunctionsref/69762/HTML/default/viewer.htm#p0w6napahk6...

It's helpful to search by category when you don't know exactly what you're looking for.

SCAN function specifically:

http://support.sas.com/documentation/cdl/en/lefunctionsref/69762/HTML/default/viewer.htm#p0jshdjy2z9...

kmardinian · Posted 09-19-2017 07:12 PM

So the only issue with the scan function, is that I have over 50 variables with EGFR as a prefix, so is the only way to write out each line of code for the 50 variables?

Thank you!

Reeza · Posted 09-19-2017 07:48 PM

CONTAINS/FINDW/INDEXW all work to find strings but I think you haven't explained your problem thoroughly.

SCAN separates the word regardless of what the prefix is, so in my opinion that's a better method, given your example.

Going back to my first response:

This does not need to be real data, it can be fake, but it should be as simple as possible, while complex enough to illustrate your problem.

Given your last answer, I suggest you provide a better example of what you're trying to achieve.

You have several options but it also depends on what you want as output which you haven't specified. You can use an array to loop through as many variables as you want, or you could transpose your data to a long format which is more suitable for analysis and follows the concept of 'Tidy Data'.

So please provide a better example of your sample data AND what you want you want as output.

kmardinian · Posted 09-19-2017 09:15 PM

John Doe	EGFR RS2R
John Doe	EGFR 5539
Jane Williams	BRCA1 6006
Jane Williams	BRCA1 2002
Tom Ford	BRCA1 4008
Tom Ford	BRCA1 2343
Tom Ford	BRCA1 2343
Tom Ford	EGFR 6382
Luis Mo	ALK1 8373
Luis Mo	EGFR 3378
Katie Lu	BRCA1 3873
katie Lu	EGFR 8739

This is a tiny example of what my dataset looks like. I have 2,000 different individuals, each with their own different set of mutations. I want to find the frequency of each mutation (ex. EGFR, BRCA1, etc ) in my total population, regardless of what location the mutation is on the gene (so I don't care about the numbers). So I wanted to find an easy way to fine the freuquency of EGFR mutations, by grouping all the EGFR XXXX into one variable category (EGFR), the BRAC1 XXXX into another, etc... without having to manually do it for 2,000 people. I would like my output to show the percentage of each mutation in my population with the new variables I create.

I'm sorry my explanations are poor, as you can see I am definitely a SAS beginner and have not mastered a lot of the data management aspect.

Thank you!

ballardw · Posted 09-20-2017 10:42 AM

@kmardinian wrote:

So the only issue with the scan function, is that I have over 50 variables with EGFR as a prefix, so is the only way to write out each line of code for the 50 variables?

Thank you!

Your examples have all showed a single variable, not clearly named, to process. If you have have multiple variables then you should show an example with multiple variables and what the result should be.

It is possible that if you are looking to search a number of variables for a single value that an array with the variables to search would be the likely approach. But a clearer example would help.

Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the {i} icon or attached as text to show exactly what you have and that we can test code against.

The above would show us exactly what a subset of your data would look like.

Then you could show, using the same approach, what the results for that example data should be.

mkeintz · Posted 09-20-2017 02:36 PM

Are you saying you have 50 vars for each individual? But you want a single frequency table over all those 50, right?

Presumably you want to run a proc freq on a data set derived from your data set. The derived data set (NEED below) would have 50 observations for each observation of 50 vars in your original dataset HAVE. You can do this in a data step with an array declaration containing all of your original vars as elements of the array, and then loop over the array, generating and outputting the main gene from each array element.

data need (keep=name main_gene);
  set have;
  array n {*} n1-n50;
  do I=1 to dim(n);
    main_gene=scan(n{I},1);
    output;
  end;
run;

proc freq data=need;
  tables main_gene;
run;

Notes:

If a given main gene value occurs more than once in the 50 vars, this program counts it every time. If you only want to count main gene once, then the simplest solution is to put this between the DATA step and the PROC FREQ:
     proc sort data=need noduprec;
       by name main_gene;
     run;
If your original vars are NOT neatly named N1, N2, ... N50, you can still use the array. Just replace the "N1-N50" with the list of vars in the array statement. But note the program assume each of the original gene values has main gene as the first space-separated word.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

kmardinian · Posted 09-20-2017 07:33 PM

Yes, that worked nicely! Thank you everyone for all your help. This seemed to be the easiest way to do this, I appreciate the advice.

Thank you!

kmardinian · Posted 09-19-2017 07:07 PM

I also did find a contains statement that might possibly work, but haven't been able to figure it out either. Here is an example of that below

data test;
   set test;
   if find(name,'ian','i') ge 1;
   /* equivalent WHERE clause */
   *where upcase(name) contains 'IAN';
run;

It is from this website: http://support.sas.com/kb/43/303.html

Michelle · Posted 09-20-2017 10:35 AM

Have you tried running Reeza's suggestion to use the SCAN in your original code? You would replace it in your original code like this:

Data MCCAssign1;
Set CombinedData (keep= N Name Notes protocol);
If missing(protocol) then delete;

Ncat = SCAN( N, 1);

run;

And you should end up with the desired Ncat variable for all of the cases.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away