Solved: Re: Add a new column based on the entries of other columns

Alireza_Boloori · Posted 07-03-2017 09:44 PM

Thank you very much in advance!

I use SAS 9.4. I want to add a column to my sas dataset. Generally, I know how it can be done. However, the condition under which I put values in the new column is complicated. So. I had to ask this question. I also apoligize if my statements for describing the problem are not very straightforward.

My original dataset is very large, so I provided a part of it here:

ENROLID	DX1	DX2	DX3	DX4	DX5	DX6	DX7	DX8	DX9	DX10
133058304	73730	51889	V151	73739	V454	V5489	V5849	4271	4589	7452
133058304	51884	7452	75672	4660	51889	7931	4293	78609	5180
133058304	51881	2761	5180	51889	7931	7452	49390	4589	486	78609
133062302	486	78060	78701	481	78650
133062601	8208	2720	4010	41400	49390	8209	2724	71945	4019
133062601	41400	4111	496	41070	4019
133062601	1623	4019	41400	1629	7866	7856
133064602	40311	5849	5855	27651	4280	5859	5939	32723	25040	496
133064602	5849	2762	40391	586	78050	78079	V5882	4293	2449	25000
133064602	99673	40391	5856	78650	25000	5859	586	78659
133068301	V5811	20150	2859	20280	45386	78701
133068301	V5811	20151	78060	20150	2768
133068301	28800	20190	2874	78061
133068301	20158	4233	486	20150	20190	2859	78060	20280	42789	51889

In this data, I want to add a new column based on the values under variables "DX1, ..., DX10."

For each observation (i.e., row), if any of the values under these variables (i.e., DX1, ..., DX10) follow the numbers provided in the table below, then the value of the new column (for that specific row) will be one of 7 categories (shown in the excel file). So, to make it clear, the elements of the new column should be one of those 7 categories.

category 1	category 2	category 3	category 4	category 5	category 6	category 7
30780	2761	30400	30550	5859	96501	5855
4019	20158	51889	30551	78650	E8500	7452
25000	33900	30402	30552	96502		51883
30789	33901	30403	30553	96509		7489
3380	33902	30470		E8501		78600
33811	33905	30471				7991
33812	33910	30472				99739
33818	33911	30473
33819	33912
33821	33920
33828
33829
3384
3502
37991

I will be very thankful if anyone can help me in this regard!

art297 · Posted 07-03-2017 10:32 PM

Here is one way:

data have;
  infile cards dlm='09'x truncover;
  informat ENROLID $9.;
  input ENROLID	(DX1	DX2	DX3	DX4	DX5	DX6	DX7	DX8	DX9	DX10) ($);
  cards;
133058304	73730	51889	V151	73739	V454	V5489	V5849	4271	4589	7452
133058304	51884	7452	75672	4660	51889	7931	4293	78609	5180	 
133058304	51881	2761	5180	51889	7931	7452	49390	4589	486	78609
133062302	486	78060	78701	481	78650	 	 	 	 	 
133062601	8208	2720	4010	41400	49390	8209	2724	71945	4019	 
133062601	41400	4111	496	41070	4019	 	 	 	 	 
133062601	1623	4019	41400	1629	7866	7856	 	 	 	 
133064602	40311	5849	5855	27651	4280	5859	5939	32723	25040	496
133064602	5849	2762	40391	586	78050	78079	V5882	4293	2449	25000
133064602	99673	40391	5856	78650	25000	5859	586	78659	 	 
133068301	V5811	20150	2859	20280	45386	78701	 	 	 	 
133068301	V5811	20151	78060	20150	2768	 	 	 	 	 
133068301	28800	20190	2874	78061	 	 	 	 	 	 
133068301	20158	4233	486	20150	20190	2859	78060	20280	42789	51889
;

data want (drop=i);
  array c1(15) $ _temporary_ ('30780' '4019' '25000' '30789' '3380' '33811' '33812' '33818'
                 '33819' '33821' '33828' '33829' '3384' '3502' '37991');
  array c2(10) $ _temporary_ ('2761' '20158' '33900' '33901' '33902' '33905' '33910' '33911' '33912' '33920');
  array c3(8) $ _temporary_ ('30400' '51889' '30402' '30403' '30470' '30471' '30472' '30473');
  array c4(4) $ _temporary_ ('30550' '30551' '30552' '30553');
  array c5(5) $ _temporary_ ('5859' '78650' '96502' '96509' 'E8501');
  array c6(2) $ _temporary_ ('96501' 'E8500');
  array c7(7) $ _temporary_ ('5855' '7452' '51883' '7489' '78600' '7991' '99739');
  array dx(*) $  dx1-dx10;
  set have;
  do i=1 to dim(dx);
    if dx(i) in c1 then category=1;
    else if dx(i) in c2 then category=2;
    else if dx(i) in c3 then category=2;
    else if dx(i) in c4 then category=2;
    else if dx(i) in c5 then category=2;
    else if dx(i) in c6 then category=2;
    else if dx(i) in c7 then category=2;
    else call missing(category);
    if not missing(category) then leave;
  end;
run;

Art, CEO, AnalystFinder.com

View solution in original post

art297 · Posted 07-03-2017 09:53 PM

What if multiple entries (in a row) match the values of two or more columns?

Art, CEO, AnalystFinder.com

Alireza_Boloori · Posted 07-03-2017 09:55 PM

In the actual data, that won't happen (categories are mutually exclusive). But, if that generally happens, we assign one category at random.

art297 · Posted 07-03-2017 10:32 PM

Here is one way:

data have;
  infile cards dlm='09'x truncover;
  informat ENROLID $9.;
  input ENROLID	(DX1	DX2	DX3	DX4	DX5	DX6	DX7	DX8	DX9	DX10) ($);
  cards;
133058304	73730	51889	V151	73739	V454	V5489	V5849	4271	4589	7452
133058304	51884	7452	75672	4660	51889	7931	4293	78609	5180	 
133058304	51881	2761	5180	51889	7931	7452	49390	4589	486	78609
133062302	486	78060	78701	481	78650	 	 	 	 	 
133062601	8208	2720	4010	41400	49390	8209	2724	71945	4019	 
133062601	41400	4111	496	41070	4019	 	 	 	 	 
133062601	1623	4019	41400	1629	7866	7856	 	 	 	 
133064602	40311	5849	5855	27651	4280	5859	5939	32723	25040	496
133064602	5849	2762	40391	586	78050	78079	V5882	4293	2449	25000
133064602	99673	40391	5856	78650	25000	5859	586	78659	 	 
133068301	V5811	20150	2859	20280	45386	78701	 	 	 	 
133068301	V5811	20151	78060	20150	2768	 	 	 	 	 
133068301	28800	20190	2874	78061	 	 	 	 	 	 
133068301	20158	4233	486	20150	20190	2859	78060	20280	42789	51889
;

data want (drop=i);
  array c1(15) $ _temporary_ ('30780' '4019' '25000' '30789' '3380' '33811' '33812' '33818'
                 '33819' '33821' '33828' '33829' '3384' '3502' '37991');
  array c2(10) $ _temporary_ ('2761' '20158' '33900' '33901' '33902' '33905' '33910' '33911' '33912' '33920');
  array c3(8) $ _temporary_ ('30400' '51889' '30402' '30403' '30470' '30471' '30472' '30473');
  array c4(4) $ _temporary_ ('30550' '30551' '30552' '30553');
  array c5(5) $ _temporary_ ('5859' '78650' '96502' '96509' 'E8501');
  array c6(2) $ _temporary_ ('96501' 'E8500');
  array c7(7) $ _temporary_ ('5855' '7452' '51883' '7489' '78600' '7991' '99739');
  array dx(*) $  dx1-dx10;
  set have;
  do i=1 to dim(dx);
    if dx(i) in c1 then category=1;
    else if dx(i) in c2 then category=2;
    else if dx(i) in c3 then category=2;
    else if dx(i) in c4 then category=2;
    else if dx(i) in c5 then category=2;
    else if dx(i) in c6 then category=2;
    else if dx(i) in c7 then category=2;
    else call missing(category);
    if not missing(category) then leave;
  end;
run;

Art, CEO, AnalystFinder.com

Alireza_Boloori · Posted 07-04-2017 12:02 PM

Thank you vrey much! But, when I run the code, both "Have" and "Want" data files only show ENROLID (i.e., DX1,...DX10 are not shown).

Furthermore, as I mentioned in my first message, the data provided is just a part of the original data (with almost 10 million observations). So, if I follow your approach, how can I accomodate the data?

Thanks again!

art297 · Posted 07-04-2017 12:15 PM

When I run the code I get enrollid, dx1-dx10, and category. You would have to show your log.

Should work regardless of how many records there are.

Art, CEO, AnalystFinder.com

Alireza_Boloori · Posted 07-04-2017 12:18 PM

I was able to find the answer. I used a part of your code. That DEFINITELY helped. Thanks a lot!

mmx5c · Posted 07-21-2019 05:52 PM

I am doing a similar analysis and would appreciate if you can post your solution.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away