Thank you very much in advance!
I use SAS 9.4. I want to add a column to my sas dataset. Generally, I know how it can be done. However, the condition under which I put values in the new column is complicated. So. I had to ask this question. I also apoligize if my statements for describing the problem are not very straightforward.
My original dataset is very large, so I provided a part of it here:
ENROLID | DX1 | DX2 | DX3 | DX4 | DX5 | DX6 | DX7 | DX8 | DX9 | DX10 |
133058304 | 73730 | 51889 | V151 | 73739 | V454 | V5489 | V5849 | 4271 | 4589 | 7452 |
133058304 | 51884 | 7452 | 75672 | 4660 | 51889 | 7931 | 4293 | 78609 | 5180 | |
133058304 | 51881 | 2761 | 5180 | 51889 | 7931 | 7452 | 49390 | 4589 | 486 | 78609 |
133062302 | 486 | 78060 | 78701 | 481 | 78650 | |||||
133062601 | 8208 | 2720 | 4010 | 41400 | 49390 | 8209 | 2724 | 71945 | 4019 | |
133062601 | 41400 | 4111 | 496 | 41070 | 4019 | |||||
133062601 | 1623 | 4019 | 41400 | 1629 | 7866 | 7856 | ||||
133064602 | 40311 | 5849 | 5855 | 27651 | 4280 | 5859 | 5939 | 32723 | 25040 | 496 |
133064602 | 5849 | 2762 | 40391 | 586 | 78050 | 78079 | V5882 | 4293 | 2449 | 25000 |
133064602 | 99673 | 40391 | 5856 | 78650 | 25000 | 5859 | 586 | 78659 | ||
133068301 | V5811 | 20150 | 2859 | 20280 | 45386 | 78701 | ||||
133068301 | V5811 | 20151 | 78060 | 20150 | 2768 | |||||
133068301 | 28800 | 20190 | 2874 | 78061 | ||||||
133068301 | 20158 | 4233 | 486 | 20150 | 20190 | 2859 | 78060 | 20280 | 42789 | 51889 |
In this data, I want to add a new column based on the values under variables "DX1, ..., DX10."
For each observation (i.e., row), if any of the values under these variables (i.e., DX1, ..., DX10) follow the numbers provided in the table below, then the value of the new column (for that specific row) will be one of 7 categories (shown in the excel file). So, to make it clear, the elements of the new column should be one of those 7 categories.
category 1 | category 2 | category 3 | category 4 | category 5 | category 6 | category 7 |
30780 | 2761 | 30400 | 30550 | 5859 | 96501 | 5855 |
4019 | 20158 | 51889 | 30551 | 78650 | E8500 | 7452 |
25000 | 33900 | 30402 | 30552 | 96502 | 51883 | |
30789 | 33901 | 30403 | 30553 | 96509 | 7489 | |
3380 | 33902 | 30470 | E8501 | 78600 | ||
33811 | 33905 | 30471 | 7991 | |||
33812 | 33910 | 30472 | 99739 | |||
33818 | 33911 | 30473 | ||||
33819 | 33912 | |||||
33821 | 33920 | |||||
33828 | ||||||
33829 | ||||||
3384 | ||||||
3502 | ||||||
37991 |
I will be very thankful if anyone can help me in this regard!
Here is one way:
data have; infile cards dlm='09'x truncover; informat ENROLID $9.; input ENROLID (DX1 DX2 DX3 DX4 DX5 DX6 DX7 DX8 DX9 DX10) ($); cards; 133058304 73730 51889 V151 73739 V454 V5489 V5849 4271 4589 7452 133058304 51884 7452 75672 4660 51889 7931 4293 78609 5180 133058304 51881 2761 5180 51889 7931 7452 49390 4589 486 78609 133062302 486 78060 78701 481 78650 133062601 8208 2720 4010 41400 49390 8209 2724 71945 4019 133062601 41400 4111 496 41070 4019 133062601 1623 4019 41400 1629 7866 7856 133064602 40311 5849 5855 27651 4280 5859 5939 32723 25040 496 133064602 5849 2762 40391 586 78050 78079 V5882 4293 2449 25000 133064602 99673 40391 5856 78650 25000 5859 586 78659 133068301 V5811 20150 2859 20280 45386 78701 133068301 V5811 20151 78060 20150 2768 133068301 28800 20190 2874 78061 133068301 20158 4233 486 20150 20190 2859 78060 20280 42789 51889 ; data want (drop=i); array c1(15) $ _temporary_ ('30780' '4019' '25000' '30789' '3380' '33811' '33812' '33818' '33819' '33821' '33828' '33829' '3384' '3502' '37991'); array c2(10) $ _temporary_ ('2761' '20158' '33900' '33901' '33902' '33905' '33910' '33911' '33912' '33920'); array c3(8) $ _temporary_ ('30400' '51889' '30402' '30403' '30470' '30471' '30472' '30473'); array c4(4) $ _temporary_ ('30550' '30551' '30552' '30553'); array c5(5) $ _temporary_ ('5859' '78650' '96502' '96509' 'E8501'); array c6(2) $ _temporary_ ('96501' 'E8500'); array c7(7) $ _temporary_ ('5855' '7452' '51883' '7489' '78600' '7991' '99739'); array dx(*) $ dx1-dx10; set have; do i=1 to dim(dx); if dx(i) in c1 then category=1; else if dx(i) in c2 then category=2; else if dx(i) in c3 then category=2; else if dx(i) in c4 then category=2; else if dx(i) in c5 then category=2; else if dx(i) in c6 then category=2; else if dx(i) in c7 then category=2; else call missing(category); if not missing(category) then leave; end; run;
Art, CEO, AnalystFinder.com
What if multiple entries (in a row) match the values of two or more columns?
Art, CEO, AnalystFinder.com
In the actual data, that won't happen (categories are mutually exclusive). But, if that generally happens, we assign one category at random.
Here is one way:
data have; infile cards dlm='09'x truncover; informat ENROLID $9.; input ENROLID (DX1 DX2 DX3 DX4 DX5 DX6 DX7 DX8 DX9 DX10) ($); cards; 133058304 73730 51889 V151 73739 V454 V5489 V5849 4271 4589 7452 133058304 51884 7452 75672 4660 51889 7931 4293 78609 5180 133058304 51881 2761 5180 51889 7931 7452 49390 4589 486 78609 133062302 486 78060 78701 481 78650 133062601 8208 2720 4010 41400 49390 8209 2724 71945 4019 133062601 41400 4111 496 41070 4019 133062601 1623 4019 41400 1629 7866 7856 133064602 40311 5849 5855 27651 4280 5859 5939 32723 25040 496 133064602 5849 2762 40391 586 78050 78079 V5882 4293 2449 25000 133064602 99673 40391 5856 78650 25000 5859 586 78659 133068301 V5811 20150 2859 20280 45386 78701 133068301 V5811 20151 78060 20150 2768 133068301 28800 20190 2874 78061 133068301 20158 4233 486 20150 20190 2859 78060 20280 42789 51889 ; data want (drop=i); array c1(15) $ _temporary_ ('30780' '4019' '25000' '30789' '3380' '33811' '33812' '33818' '33819' '33821' '33828' '33829' '3384' '3502' '37991'); array c2(10) $ _temporary_ ('2761' '20158' '33900' '33901' '33902' '33905' '33910' '33911' '33912' '33920'); array c3(8) $ _temporary_ ('30400' '51889' '30402' '30403' '30470' '30471' '30472' '30473'); array c4(4) $ _temporary_ ('30550' '30551' '30552' '30553'); array c5(5) $ _temporary_ ('5859' '78650' '96502' '96509' 'E8501'); array c6(2) $ _temporary_ ('96501' 'E8500'); array c7(7) $ _temporary_ ('5855' '7452' '51883' '7489' '78600' '7991' '99739'); array dx(*) $ dx1-dx10; set have; do i=1 to dim(dx); if dx(i) in c1 then category=1; else if dx(i) in c2 then category=2; else if dx(i) in c3 then category=2; else if dx(i) in c4 then category=2; else if dx(i) in c5 then category=2; else if dx(i) in c6 then category=2; else if dx(i) in c7 then category=2; else call missing(category); if not missing(category) then leave; end; run;
Art, CEO, AnalystFinder.com
Thank you vrey much! But, when I run the code, both "Have" and "Want" data files only show ENROLID (i.e., DX1,...DX10 are not shown).
Furthermore, as I mentioned in my first message, the data provided is just a part of the original data (with almost 10 million observations). So, if I follow your approach, how can I accomodate the data?
Thanks again!
When I run the code I get enrollid, dx1-dx10, and category. You would have to show your log.
Should work regardless of how many records there are.
Art, CEO, AnalystFinder.com
I was able to find the answer. I used a part of your code. That DEFINITELY helped. Thanks a lot!
I am doing a similar analysis and would appreciate if you can post your solution.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.
Find more tutorials on the SAS Users YouTube channel.