HI, In the dataset below, The dataset has values Level1, Level2, Level3, Level4 and Blank for the variables named MONTH. I would like to keep only rows with one or more values missing but missing has to be consecutive. i.e. Delete ID 3,6,13,14,21 and keep the rest.
In the new dataset created I would like to keep only those with value as Level1 and consecutive missing values i.e. Delete ID 9 and 23.
Here is the sample of my dataset.
ID | Store | Month 1 | Month 2 | Month 3 | Month 4 | Month 5 | Month 6 | Month 7 | Month 8 |
1 | A | Level1 | Level1 | Level1 | Level1 | ||||
2 | A | Level1 | Level1 | ||||||
3 | B | Level1 | Level1 | Level1 | Level1 | Level1 | |||
4 | C | Level1 | Level1 | Level2 | Level2 | Level2 | |||
5 | B | Level1 | |||||||
6 | B | Level1 | Level1 | Level1 | Level3 | Level3 | Level1 | Level1 | |
7 | A | Level1 | Level1 | ||||||
8 | A | Level1 | Level1 | Level1 | Level1 | Level1 | |||
9 | D | Level1 | Level4 | Level4 | Level4 | Level4 | Level4 | Level4 | |
10 | D | Level1 | Level1 | Level1 | |||||
11 | E | Level1 | |||||||
12 | E | Level1 | Level1 | Level1 | Level1 | Level1 | Level1 | Level1 | |
13 | D | Level1 | Level1 | ||||||
14 | B | Level1 | Level1 | Level1 | Level3 | Level3 | Level3 | Level3 | |
15 | A | Level1 | Level1 | Level1 | Level1 | Level1 | |||
16 | D | Level1 | Level1 | ||||||
17 | B | Level1 | Level1 | Level1 | |||||
18 | E | Level1 | Level1 | Level1 | |||||
19 | F | Level1 | Level1 | Level1 | Level1 | Level1 | |||
20 | D | Level1 | Level1 | Level1 | Level1 | Level1 | |||
21 | A | Level1 | Level1 | Level1 | Level1 | Level1 | Level1 | Level1 | |
22 | A | Level1 | Level1 | Level1 | Level1 | Level1 | Level1 | Level1 | |
23 | D | Level1 | Level1 | Level1 | Level1 | Level1 | Level1 | Level2 | |
24 | E | Level1 | Level1 |
what source dataset are you testing against.
I tested again for you, here is my log
3952 data have;
3953 input ID Store$ (Month1-Month8)($);
3954 infile datalines dsd missover;
3955 datalines;
NOTE: The data set WORK.HAVE has 24 observations and 10 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
3980 ;
3981
3982 data want;
3983 set have;
3984 array t(*) month:;
3985 k=whichc(' ', of t(*));
3986 if k>1 then do;
3987 do _n_=dim(t) by -1 to k+1;
3988 if not missing(t(_n_)) then do; _t=1;leave;end;
3989 end;
3990 end;
3991 if _t then delete;
3992 drop _t;
3993 run;
NOTE: There were 24 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 19 observations and 11 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.00 seconds
This is my output:
ID | Store | Month1 | Month2 | Month3 | Month4 | Month5 | Month6 | Month7 | Month8 |
1 | A | Level1 | Level1 | Level1 | Level1 | ||||
2 | A | Level1 | Level1 | ||||||
4 | C | Level1 | Level1 | Level2 | Level2 | Level2 | |||
5 | B | Level1 | |||||||
7 | A | Level1 | Level1 | ||||||
8 | A | Level1 | Level1 | Level1 | Level1 | Level1 | |||
9 | D | Level1 | Level4 | Level4 | Level4 | Level4 | Level4 | Level4 | |
10 | D | Level1 | Level1 | Level1 | |||||
11 | E | Level1 | |||||||
12 | E | Level1 | Level1 | Level1 | Level1 | Level1 | Level1 | Level1 | |
15 | A | Level1 | Level1 | Level1 | Level1 | Level1 | |||
16 | D | Level1 | Level1 | ||||||
17 | B | Level1 | Level1 | Level1 | |||||
18 | E | Level1 | Level1 | Level1 | |||||
19 | F | Level1 | Level1 | Level1 | Level1 | Level1 | |||
20 | D | Level1 | Level1 | Level1 | Level1 | Level1 | |||
22 | A | Level1 | Level1 | Level1 | Level1 | Level1 | Level1 | Level1 | |
23 | D | Level1 | Level1 | Level1 | Level1 | Level1 | Level1 | Level2 | |
24 | E | Level1 | Level1 |
Just to understand your logic, why is ID=2 not deleted? It has a consecutive missing value for Month 5?
If variables MONTH has a value and then has a consecutive missing values then I want to keep it so ID 2 is a keep and so is 1,4,5 and so on.
Ok. I think you are looking for something like this
data have;
input ID Store$ (Month1-Month8)($);
infile datalines dsd missover;
datalines;
1,A,Level1,Level1,Level1,Level1,,,,
2,A,Level1,Level1,,,,,,
3,B,Level1,Level1,,,Level1,,Level1,Level1
4,C,Level1,Level1,Level2,Level2,Level2,,,
5,B,Level1,,,,,,,
6,B,Level1,,Level1,Level1,Level3,Level3,Level1,Level1
7,A,Level1,Level1,,,,,,
8,A,Level1,Level1,Level1,Level1,Level1,,,
9,D,Level1,Level4,Level4,Level4,Level4,Level4,Level4,
10,D,Level1,Level1,Level1,,,,,
11,E,Level1,,,,,,,
12,E,Level1,Level1,Level1,Level1,Level1,Level1,Level1,
13,D,Level1,,Level1,,,,,
14,B,Level1,Level1,Level1,,Level3,Level3,Level3,Level3
15,A,Level1,Level1,Level1,Level1,Level1,,,
16,D,Level1,Level1,,,,,,
17,B,Level1,Level1,Level1,,,,,
18,E,Level1,Level1,Level1,,,,,
19,F,Level1,Level1,Level1,Level1,Level1,,,
20,D,Level1,Level1,Level1,Level1,Level1,,,
21,A,Level1,Level1,,Level1,Level1,Level1,Level1,Level1
22,A,Level1,Level1,Level1,Level1,Level1,Level1,Level1,
23,D,Level1,Level1,Level1,Level1,Level1,Level1,Level2,
24,E,Level1,Level1,,,,,,
;
data want(drop=i);
set have;
array months{*} Month1-Month8;
do i=1 to dim(months)-1;
if (months[i] = '' and months[i+1] ne '') then delete;
end;
run;
HI, Thanks so much.
I think this is working but the problem I found is that in my original dataset I have 947 Obs.
When I run your code below, I got 842 Obs which is fine.
data want(drop=i);
set have;
array months{*} Month1-Month8; do i=1 to dim(months)-1;
if (months[i] = '' and months[i+1] ne '') then delete;
end;
run;
BUT then I run following code, to check the deleted obs, but I get 107 Obs. 107 + 842 does not total 947 obs.
data want(drop=i);
set have;
array months{*} Month1-Month8; do i=1 to dim(months)-1;
if (months[i] = '' and months[i+1] ne '') then Output;
end;
run;
@d0816, that is because the two are not additive. The contition
if (months[i] = '' and months[i+1] ne '')
can be true more than once for the same observation. However, you can delete an observation only once (Your first data step), while you can output an observation more than once (Your second data step).
You can check this by looking at the data created by your second data step. I suspect than you have multiple lines with the same ID.
The solution works for the first part of my original question. Solution by novisnorin also worked for the first part of my question. Thanks a lot.
something like this
data abc;
infile datalines missover dlm ='09'x;
input
ID 1-2 Store $ 3-4 Month1 $ Month2 $ Month3 $ Month4 $ Month5 $ Month6 $ Month7 $ Month8 $;
datalines;
1 A Level1 Level1 Level1 Level1
2 A Level1 Level1
3 B Level1 Level1 Level1 Level1 Level1
4 C Level1 Level1 Level2 Level2 Level2
5 B Level1
6 B Level1 Level1 Level1 Level3 Level3 Level1 Level1
7 A Level1 Level1
8 Level1 Level1 Level1 Level1 Level1
9 D Level1 Level4 Level4 Level4 Level4 Level4 Level4
10 D Level1 Level1 Level1
11 E Level1
12 E Level1 Level1 Level1 Level1 Level1 Level1 Level1
13 D Level1 Level1
14 B Level1 Level1 Level1 Level3 Level3 Level3 Level3
15 A Level1 Level1 Level1 Level1 Level1
16 D Level1 Level1
17 B Level1 Level1 Level1
18 E Level1 Level1 Level1
19 F Level1 Level1 Level1 Level1 Level1
20 D Level1 Level1 Level1 Level1 Level1
21 A Level1 Level1 Level1 Level1 Level1 Level1 Level1
22 A Level1 Level1 Level1 Level1 Level1 Level1 Level1
23 D Level1 Level1 Level1 Level1 Level1 Level1 Level2
24 E Level1 Level1
;
data abss;
length newval $ 100;
set abc;
tot= 0;
array varsnew (8) $ newmonth1-newmonth8;
array vars (8) $ month1-month8;
do i = 1 to 8;
if vars{i} = '' then varsnew{i} = '-';
else varsnew{i} =vars{i};
if vars{i} not in ('Level1') then tot=tot+1;
else tot=tot+0;
end;
newval= cats(of newmonth1-newmonth8);
if index(newval,'--' )> 0 or tot >1 then delete;
drop i newmonth1-newmonth8 newval tot;
run;
some mathematical thinking to make it easy:
data have;
input ID Store$ (Month1-Month8)($);
infile datalines dsd missover;
datalines;
1,A,Level1,Level1,Level1,Level1,,,,
2,A,Level1,Level1,,,,,,
3,B,Level1,Level1,,,Level1,,Level1,Level1
4,C,Level1,Level1,Level2,Level2,Level2,,,
5,B,Level1,,,,,,,
6,B,Level1,,Level1,Level1,Level3,Level3,Level1,Level1
7,A,Level1,Level1,,,,,,
8,A,Level1,Level1,Level1,Level1,Level1,,,
9,D,Level1,Level4,Level4,Level4,Level4,Level4,Level4,
10,D,Level1,Level1,Level1,,,,,
11,E,Level1,,,,,,,
12,E,Level1,Level1,Level1,Level1,Level1,Level1,Level1,
13,D,Level1,,Level1,,,,,
14,B,Level1,Level1,Level1,,Level3,Level3,Level3,Level3
15,A,Level1,Level1,Level1,Level1,Level1,,,
16,D,Level1,Level1,,,,,,
17,B,Level1,Level1,Level1,,,,,
18,E,Level1,Level1,Level1,,,,,
19,F,Level1,Level1,Level1,Level1,Level1,,,
20,D,Level1,Level1,Level1,Level1,Level1,,,
21,A,Level1,Level1,,Level1,Level1,Level1,Level1,Level1
22,A,Level1,Level1,Level1,Level1,Level1,Level1,Level1,
23,D,Level1,Level1,Level1,Level1,Level1,Level1,Level2,
24,E,Level1,Level1,,,,,,
;
data want;
set have;
array t(*) month:;
k=whichc(' ', of t(*));
if k>1 then do;
do _n_=dim(t) by -1 to k+1;
if not missing(t(_n_)) then do; _t=1;leave;end;
end;
end;
if _t then delete;
drop _t k;
run;
Hi, got the following warning message when running the code:
data want;
set have;
array t(*) month:;
WARNING: Defining an array with zero elements
k=whichc(' ', of t(*));
if k>1 then do; do _n_=dim(t) by -1 to k+1;
if not missing(t(_n_)) then do; _t=1;leave;end;
end;
end;
if _t then delete;
drop _t;
run;
NOTE: Numeric values have been converted to character values at the places given by:
(Line):(Column).
420:18
NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WANT may be incomplete. When this step was stopped there
were 0 observations and 11 variables.
WARNING: Data set WANT was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds
what source dataset are you testing against.
I tested again for you, here is my log
3952 data have;
3953 input ID Store$ (Month1-Month8)($);
3954 infile datalines dsd missover;
3955 datalines;
NOTE: The data set WORK.HAVE has 24 observations and 10 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
3980 ;
3981
3982 data want;
3983 set have;
3984 array t(*) month:;
3985 k=whichc(' ', of t(*));
3986 if k>1 then do;
3987 do _n_=dim(t) by -1 to k+1;
3988 if not missing(t(_n_)) then do; _t=1;leave;end;
3989 end;
3990 end;
3991 if _t then delete;
3992 drop _t;
3993 run;
NOTE: There were 24 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 19 observations and 11 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.00 seconds
This is my output:
ID | Store | Month1 | Month2 | Month3 | Month4 | Month5 | Month6 | Month7 | Month8 |
1 | A | Level1 | Level1 | Level1 | Level1 | ||||
2 | A | Level1 | Level1 | ||||||
4 | C | Level1 | Level1 | Level2 | Level2 | Level2 | |||
5 | B | Level1 | |||||||
7 | A | Level1 | Level1 | ||||||
8 | A | Level1 | Level1 | Level1 | Level1 | Level1 | |||
9 | D | Level1 | Level4 | Level4 | Level4 | Level4 | Level4 | Level4 | |
10 | D | Level1 | Level1 | Level1 | |||||
11 | E | Level1 | |||||||
12 | E | Level1 | Level1 | Level1 | Level1 | Level1 | Level1 | Level1 | |
15 | A | Level1 | Level1 | Level1 | Level1 | Level1 | |||
16 | D | Level1 | Level1 | ||||||
17 | B | Level1 | Level1 | Level1 | |||||
18 | E | Level1 | Level1 | Level1 | |||||
19 | F | Level1 | Level1 | Level1 | Level1 | Level1 | |||
20 | D | Level1 | Level1 | Level1 | Level1 | Level1 | |||
22 | A | Level1 | Level1 | Level1 | Level1 | Level1 | Level1 | Level1 | |
23 | D | Level1 | Level1 | Level1 | Level1 | Level1 | Level1 | Level2 | |
24 | E | Level1 | Level1 |
It worked for the first part of my original question. Thanks a lot.
did you try my answer, I think it works for 2nd part of question too
I don't know why my reply to your post is not being visible here. did you get my reply. I did it 3 times already.
Ok sorry, let me help you with the 2nd part ina bit. I didn't see it. too much coffee caffeine effect 😞
Your wrote:
"In the new dataset created I would like to keep only those with value as Level1 and consecutive missing values i.e. Delete ID 9 and 23."
can you explain more clearly as there many records with level1 and other levels? please
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.