Hi, I'm running a dataset filled with different gene mutations listed like AKT1 EL7K, AKT1 E49K, and AKT1 F55L. And I'd like to group them all into one variable named AKT1. This is what I had based on what I foudn online, but it hasn't been working.
Data MCCAssign1;
Set CombinedData (keep= N Name Notes protocol);
If missing(protocol) then delete;
if findw(N,'AKT1') then Ncat=AKT; end;
Any help is much appreciated!
Thank you!
Are you saying you have 50 vars for each individual? But you want a single frequency table over all those 50, right?
Presumably you want to run a proc freq on a data set derived from your data set. The derived data set (NEED below) would have 50 observations for each observation of 50 vars in your original dataset HAVE. You can do this in a data step with an array declaration containing all of your original vars as elements of the array, and then loop over the array, generating and outputting the main gene from each array element.
data need (keep=name main_gene);
set have;
array n {*} n1-n50;
do I=1 to dim(n);
main_gene=scan(n{I},1);
output;
end;
run;
proc freq data=need;
tables main_gene;
run;
Notes:
Please post some example data detailing what you have and what you want as output.
This does not need to be real data, it can be fake, but it should be as simple as possible, while complex enough to illustrate your problem.
My data looks like this:
Name N
John Doe EGFR R32R
John Doe EGFR 5539
Jane Williams BRCA1 6006
Jane Williams BRCA1 4003
Jane Williams AKT1 3997
So each person has several different mutations and I want to group their mutations based on the prefix of each one. So I would like to have them renamed as EGFR, BRCA1, and AKT1, erasing the numbers at the end.
As for my output, I want to find the total frequency of each mutation in my population, which is why I'd like to focus on the main gene and not the numbers at the end.
Thank you!
If there's a space, as in your sample use SCAN()
prefix = scan(N_var, 1);
Hi Reeza,
Thank you for taking the time to help me with this. I'm not sure I have the syntax right.
Data Assign1;
ABL1 = scan(N_var, 1);
Run;
I'm not sure what to put for N_var.
Thank you!
The variable name. The old variable that is, the left hand side of the equation is the new variable name.
new_variable = scan( old_variable, 1);
All functions are listed here and you should bookmark this page:
It's helpful to search by category when you don't know exactly what you're looking for.
SCAN function specifically:
So the only issue with the scan function, is that I have over 50 variables with EGFR as a prefix, so is the only way to write out each line of code for the 50 variables?
Thank you!
CONTAINS/FINDW/INDEXW all work to find strings but I think you haven't explained your problem thoroughly.
SCAN separates the word regardless of what the prefix is, so in my opinion that's a better method, given your example.
Going back to my first response:
This does not need to be real data, it can be fake, but it should be as simple as possible, while complex enough to illustrate your problem.
Given your last answer, I suggest you provide a better example of what you're trying to achieve.
You have several options but it also depends on what you want as output which you haven't specified. You can use an array to loop through as many variables as you want, or you could transpose your data to a long format which is more suitable for analysis and follows the concept of 'Tidy Data'.
So please provide a better example of your sample data AND what you want you want as output.
John Doe | EGFR RS2R |
John Doe | EGFR 5539 |
Jane Williams | BRCA1 6006 |
Jane Williams | BRCA1 2002 |
Tom Ford | BRCA1 4008 |
Tom Ford | BRCA1 2343 |
Tom Ford | BRCA1 2343 |
Tom Ford | EGFR 6382 |
Luis Mo | ALK1 8373 |
Luis Mo | EGFR 3378 |
Katie Lu | BRCA1 3873 |
katie Lu | EGFR 8739 |
This is a tiny example of what my dataset looks like. I have 2,000 different individuals, each with their own different set of mutations. I want to find the frequency of each mutation (ex. EGFR, BRCA1, etc ) in my total population, regardless of what location the mutation is on the gene (so I don't care about the numbers). So I wanted to find an easy way to fine the freuquency of EGFR mutations, by grouping all the EGFR XXXX into one variable category (EGFR), the BRAC1 XXXX into another, etc... without having to manually do it for 2,000 people. I would like my output to show the percentage of each mutation in my population with the new variables I create.
I'm sorry my explanations are poor, as you can see I am definitely a SAS beginner and have not mastered a lot of the data management aspect.
Thank you!
@kmardinian wrote:
So the only issue with the scan function, is that I have over 50 variables with EGFR as a prefix, so is the only way to write out each line of code for the 50 variables?
Thank you!
Your examples have all showed a single variable, not clearly named, to process. If you have have multiple variables then you should show an example with multiple variables and what the result should be.
It is possible that if you are looking to search a number of variables for a single value that an array with the variables to search would be the likely approach. But a clearer example would help.
Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the {i} icon or attached as text to show exactly what you have and that we can test code against.
The above would show us exactly what a subset of your data would look like.
Then you could show, using the same approach, what the results for that example data should be.
Are you saying you have 50 vars for each individual? But you want a single frequency table over all those 50, right?
Presumably you want to run a proc freq on a data set derived from your data set. The derived data set (NEED below) would have 50 observations for each observation of 50 vars in your original dataset HAVE. You can do this in a data step with an array declaration containing all of your original vars as elements of the array, and then loop over the array, generating and outputting the main gene from each array element.
data need (keep=name main_gene);
set have;
array n {*} n1-n50;
do I=1 to dim(n);
main_gene=scan(n{I},1);
output;
end;
run;
proc freq data=need;
tables main_gene;
run;
Notes:
Yes, that worked nicely! Thank you everyone for all your help. This seemed to be the easiest way to do this, I appreciate the advice.
Thank you!
I also did find a contains statement that might possibly work, but haven't been able to figure it out either. Here is an example of that below
data test; set test; if find(name,'ian','i') ge 1; /* equivalent WHERE clause */ *where upcase(name) contains 'IAN'; run;
It is from this website: http://support.sas.com/kb/43/303.html
Have you tried running Reeza's suggestion to use the SCAN in your original code? You would replace it in your original code like this:
Data MCCAssign1;
Set CombinedData (keep= N Name Notes protocol);
If missing(protocol) then delete;
Ncat = SCAN( N, 1);
run;
And you should end up with the desired Ncat variable for all of the cases.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.