Data Manipulation

newsas007 · Posted 03-17-2021 06:36 PM

Hi there,

I have data that has pre and post indo. I need a new data that should not include conditions that occurred in pre. For example, ID#1 has Anemia and Diarrhea in pre and post both. The new data needs to delete these observations.

Data have:

ID	DateX	Conditions	Type
1	12/25/2019	Anemia	Post
1	11/5/2018	Colitis	Post
1	11/5/2018	Diarrhea	Post
1	11/24/2019	Kidney failure/disease	Post
1	12/12/2018	Neuro	Post
1	5/20/2017	Anemia	Pre
1	5/20/2017	Diarrhea	Pre
1	9/14/2017	Myositis	Pre
1	4/24/2017	Polyneuropathy	Pre
2	8/19/2020	Adrenal/Adrenocortical	Post
2	3/23/2020	Colitis	Post
2	10/7/2019	Neutropenia	Post
3	7/9/2017	Dermatitis/Rash	Post
3	4/22/2018	Liver failure/disease	Post
3	4/5/2015	Myositis	Post
3	6/8/2015	Neuro	Post

Data Want:

ID	DateX	Conditions	Type
1	11/5/2018	Colitis	Post
1	11/24/2019	Kidney failure	Post
1	12/12/2018	Neuro	Post
1	9/14/2017	Myositis	Pre
1	4/24/2017	Polyneuropathy	Pre
2	8/19/2020	Adrenal	Post
2	3/23/2020	Colitis	Post
2	10/7/2019	Neutropenia	Post
3	7/9/2017	Dermatitis/Rash	Post
3	4/22/2018	Liver failure/disease	Post
3	4/5/2015	Myositis	Post
3	6/8/2015	Neuro	Post

Shmuel · Posted 03-17-2021 07:16 PM

It seems to me that you did not delete the required observations from WANT.

Check next code and adapt to the right requirements:

data have;
  infile cards dlm='09'x truncover;
  input ID DateX :mmddyy10. Conditions $ Type $;
  format datex date9.;
cards;
1	12/25/2019	Anemia	Post
1	11/5/2018	Colitis	Post
1	11/5/2018	Diarrhea	Post
1	11/24/2019	Kidney failure/disease	Post
1	12/12/2018	Neuro	Post
1	5/20/2017	Anemia	Pre
1	5/20/2017	Diarrhea	Pre
1	9/14/2017	Myositis	Pre
1	4/24/2017	Polyneuropathy	Pre
2	8/19/2020	Adrenal/Adrenocortical	Post
2	3/23/2020	Colitis	Post
2	10/7/2019	Neutropenia	Post
3	7/9/2017	Dermatitis/Rash	Post
3	4/22/2018	Liver failure/disease	Post
3	4/5/2015	Myositis	Post
3	6/8/2015	Neuro	Post
; run;

proc sort data=have;
  by ID conditions type;
run;
data want;
 merge have(where=(type='Pre') in=in1)
       have(where=(type='Post') in=in2);
  by ID conditions type;
     if not (in1 and in2);  
run;

ballardw · Posted 03-17-2021 07:33 PM

@newsas007 wrote:

Hi there,

I have data that has pre and post indo. I need a new data that should not include conditions that occurred in pre. For example, ID#1 has Anemia and Diarrhea in pre and post both. The new data needs to delete these observations.

Data have:

ID DateX Conditions Type

1 12/25/2019 Anemia Post

1 11/5/2018 Colitis Post

1 11/5/2018 Diarrhea Post

1 11/24/2019 Kidney failure/disease Post

1 12/12/2018 Neuro Post

1 5/20/2017 Anemia Pre

1 5/20/2017 Diarrhea Pre

1 9/14/2017 Myositis Pre

1 4/24/2017 Polyneuropathy Pre

2 8/19/2020 Adrenal/Adrenocortical Post

2 3/23/2020 Colitis Post

2 10/7/2019 Neutropenia Post

3 7/9/2017 Dermatitis/Rash Post

3 4/22/2018 Liver failure/disease Post

3 4/5/2015 Myositis Post

3 6/8/2015 Neuro Post

Data Want:

ID DateX Conditions Type

1 12/25/2019 Anemia Post

1 11/5/2018 Colitis Post

1 11/5/2018 Diarrhea Post

1 11/24/2019 Kidney failure/disease Post

1 12/12/2018 Neuro Post

1 5/20/2017 Anemia Pre

1 5/20/2017 Diarrhea Pre

1 9/14/2017 Myositis Pre

1 4/24/2017 Polyneuropathy Pre

2 8/19/2020 Adrenal/Adrenocortical Post

2 3/23/2020 Colitis Post

2 10/7/2019 Neutropenia Post

3 7/9/2017 Dermatitis/Rash Post

3 4/22/2018 Liver failure/disease Post

3 4/5/2015 Myositis Post

3 6/8/2015 Neuro Post

Please double check your "want". It looks exactly like Have to me.

Provide your data as data step code pasted into a text box using the </> icon on the forum such as

data have;
   infile datalines dlm=',';
   input ID $ DateX :mmddyy10. Conditions :$25. Type :$4.;
   format datex mmddyy10.;
datalines;
1,12/25/2019,Anemia,Post
1,11/5/2018,Colitis,Post
1,11/5/2018,Diarrhea,Post
1,11/24/2019,Kidney failure/disease,Post
1,12/12/2018,Neuro,Post
1,5/20/2017,Anemia,Pre
1,5/20/2017,Diarrhea,Pre
1,9/14/2017,Myositis,Pre
1,4/24/2017,Polyneuropathy,Pre
2,8/19/2020,Adrenal/Adrenocortical,Post
2,3/23/2020,Colitis,Post
2,10/7/2019,Neutropenia,Post
3,7/9/2017,Dermatitis/Rash,Post
3,4/22/2018,Liver failure/disease,Post
3,4/5/2015,Myositis,Post
3,6/8/2015,Neuro,Post
;

I can't tell if you want just the "pre" , just the "post" or both removed.

The following code removes both.

proc sort data=have;
   by id conditions datex;
run;

data want;
   set have;
   by id conditions;
   if type='Pre' and first.conditions and not (last.conditions) then delete;
   if type='Post' and not(first.conditions) and last.conditions then delete;
run;

You can remove one of the "if" that you don't need.

This works by getting like values of Conditions in chronological order for each ID. If your Datex is not a numeric date value then you need to make a date value, preferably with an appropriate format for legibility.

The BY statement in the Want data step code creates automatic true/false variables for each of the variables on the By statement. So each record can tell whether the Condition is the first time that appears for each id. If a value is the only one then both First and Last are true.

Warning: If you have 3 values of the same condition for an ID this is likely to not do what you want.

Also if the "Pre" and "Post" are incorrect for chronological order then this will fail.

Both of the above would require a more complete example data, example results and likely additional actual description of rules involved.

mkeintz · Posted 03-18-2021 02:10 AM

You apparently want conditions that appear exclusively in pre, or exclusively in post:

data have;
   infile datalines dlm=',';
   input ID $ DateX :mmddyy10. Conditions :$25. Type :$4.;
   format datex mmddyy10.;
datalines;
1,12/25/2019,Anemia,Post
1,11/5/2018,Colitis,Post
1,11/5/2018,Diarrhea,Post
1,11/24/2019,Kidney failure/disease,Post
1,12/12/2018,Neuro,Post
1,5/20/2017,Anemia,Pre
1,5/20/2017,Diarrhea,Pre
1,9/14/2017,Myositis,Pre
1,4/24/2017,Polyneuropathy,Pre
2,8/19/2020,Adrenal/Adrenocortical,Post
2,3/23/2020,Colitis,Post
2,10/7/2019,Neutropenia,Post
3,7/9/2017,Dermatitis/Rash,Post
3,4/22/2018,Liver failure/disease,Post
3,4/5/2015,Myositis,Post
3,6/8/2015,Neuro,Post
run;

data want;
  set have;
  if _n_=1 then do;
    declare hash pre_conditions (dataset:'have (where=(type="Pre")');
      pre_conditions.definekey('id','conditions');
      pre_conditions.definedone();
    declare hash post_conditions (dataset:'have (where=(type="Post")');
      post_conditions.definekey('id','conditions');
      post_conditions.definedone();
  end;
  if (type='Pre' and post_conditions.check()^=0)  or
     (type='Post' and pre_conditions.check()^=0);
run;

This creates a hash object (think lookup table) for all the pre's, and a separate hash object for all the post's. Then for each incoming pre record, keep it only if that id/condition doesn't appear in the post hash. And vice versa.

or, slightly more efficient, because it constructs one large hash object instead of two smaller ones:

data want;
  set have;
  if _n_=1 then do;
    declare hash all_conditions (dataset:'have');
      all_conditions.definekey('id','conditions','type');
      all_conditions.definedone();
  end;
  if (type='Pre' and all_conditions.check(key:id,key:conditions,key:'Post')^=0) or
     (type='Post' and all_conditions.check(key:id,key:conditions,key:'Pre')^=0) ;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Data Manipulation

Re: Data Manipulation

Re: Data Manipulation

Re: Data Manipulation

Data Manipulation

Re: Data Manipulation

Re: Data Manipulation

Re: Data Manipulation

Registration is open

SAS Training: Just a Click Away