- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I want to keep only those id's having a flag=1 for two or more consecutive months for the same variables for the first as well as the second month. You see some id have a flag=1 for different variables in the first and second month.
I want only the same variables in the first month to contribute to the case not a different variable in the second month.
Based on the data below, for instance id 1 have flag=1 for var 6 and 7 at month 7 and 8.... this perfectly applies to my case.
if you see id 2, there is a flag=1 for var 4 and 6 at month 12 2012, but in the next month, i.e. month 1 of 2013, the value for var 6 is 0 not 1 that leads id 2 to be excluded in the output.
The output desired in the dataset have below is only id '1'.
data have;
input id $ year month var1-var7;
datalines;
1 2012 7 0 0 0 0 1 1 1
1 2012 8 0 0 0 0 0 1 1
1 2012 9 1 0 0 0 0 0 0
1 2012 10 0 0 1 0 0 0 0
1 2012 11 0 1 0 1 0 1 0
1 2012 12 0 0 0 0 0 0 1
1 2013 1 0 0 1 1 1 0 0
2 2012 7 0 0 0 0 0 0 0
2 2012 8 0 0 0 0 0 0 0
2 2012 9 0 0 0 0 0 0 0
2 2012 10 0 0 0 0 0 0 0
2 2012 11 0 0 0 0 0 0 0
2 2012 12 0 0 0 1 0 1 0
2 2013 1 0 0 1 1 1 0 0
2 2013 3 1 1 0 0 0 0 0
3 2012 7 0 0 0 0 1 1 1
3 2012 8 0 0 0 0 0 0 0
3 2012 9 0 0 0 0 0 0 0
3 2012 10 0 0 1 0 0 0 0
4 2012 11 0 1 0 0 0 1 0
4 2012 12 0 0 0 0 0 0 0
4 2013 1 1 0 0 0 0 1 0
4 2013 4 0 0 1 1 0 0 0
;
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You've now clarified your object. For any pair of consecutive observations within an ID, you want to assign flag=1 for both of those observations if they have at least 2 variables with consecutive values of 1.
This is a single step solution that, for each ID, reads the series of observations twice. In the first pass, each record F=1,2,3, ... is compared to the prior record and a temporary variable consec_ones{F} counts the number of vars with a value of 1 in the current and prior record.
The second pass, rereads the series, and for each record (S=1,2,3) and if the current record has consec_ones>1 or the next record has consec_ones > 1 then set the flag:
data have;
input id $ year month var1-var7 ;
datalines;
1 2012 7 0 0 0 0 1 1 1
1 2012 8 0 0 0 0 0 1 1
1 2012 9 1 0 0 0 0 0 0
1 2012 10 0 0 1 0 0 0 0
1 2012 11 0 1 0 1 0 1 0
1 2012 12 0 0 0 0 0 0 1
1 2013 1 0 0 1 1 1 0 0
2 2012 6 0 0 1 1 1 0 0
2 2012 7 0 0 0 0 0 0 0
2 2012 8 0 0 0 0 0 0 0
2 2012 9 0 0 0 0 0 0 0
2 2012 10 0 0 0 0 0 0 0
2 2012 11 0 0 0 0 0 0 0
2 2012 12 0 0 0 1 0 1 0
2 2013 1 0 0 1 1 1 0 0
2 2013 3 1 1 0 0 0 0 0
3 2012 7 0 0 0 0 1 1 1
3 2012 8 0 0 0 0 0 0 0
3 2012 9 0 0 0 0 0 0 0
3 2012 10 0 0 1 0 0 0 0
4 2012 11 0 1 0 0 0 1 0
4 2012 12 0 0 0 0 0 0 0
4 2013 1 1 0 0 0 0 1 0
4 2013 4 0 0 1 1 0 0 0
run;
data want (drop=v f s);
set have (in=firstpass)
have (in=secondpass);
by id;
array consec_ones{30} _temporary_;
array var{*} var: ;
if first.id then call missing(F,S,of consec_ones{*});
if firstpass then do;
F+1;
consec_ones{F}=0;
do v=1 to dim(var);
if dif(var{v})=0 and var{v}=1 then consec_ones{F}+1;
end;
end;
if first.id then consec_ones{1}=0;
if secondpass;
S+1;
flag= max(consec_ones{S},consec_ones{S+1})>1;
run;
Edit: I added the if first.id then consec_ones{1}=0; statement to eliminate artificial creation of consec_one{1} from comparing the beginning of one ID to its predecessor at the end of the previous ID.
Notes:
- Set the size of the array consec_ones to at least 1 greater than the largest number of observations expected for any ID. I used 30, meaning I expected no more that 29 records for any ID.
- F=1,2,3,... for record sequence within the first pass through an ID. S=1,2,3 for the same during the second pass.
- The DIF(x) function is equivalent to X-lag(X). So if DIF(x)=0 and the current value of X=1, then you've revealed consecutive 1's.
The major "trick" here is to interleave dataset HAVE with itself - the pair of statements SET HAVE (in=firstpass) HAVE (in=secondpass); BY ID; .
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
ID1, included because of these records? Even though Var5 is not the same?
1 2012 7 0 0 0 0 1 1 1
1 2012 8 0 0 0 0 0 1 1
ID2, not included even though VAR4 is the same?
2 2012 12 0 0 0 1 0 1 0
2 2013 1 0 0 1 1 1 0 0
This is inconsistent with the rule you've specified, so you either need to clarify your data or your rules.
I want to keep only those id's having a flag=1 for two or more consecutive months for the same variables for the first as well as the second month.
My translation of your criteria:
- Flag=1
- On two or more consecutive months
- On same variable
@Abimal_Zippi wrote:
I want to keep only those id's having a flag=1 for two or more consecutive months for the same variables for the first as well as the second month. You see some id have a flag=1 for different variables in the first and second month.
I want only the same variables in the first month to contribute to the case not a different variable in the second month.
Based on the data below, for instance id 1 have flag=1 for var 6 and 7 at month 7 and 8.... this perfectly applies to my case.
if you see id 2, there is a flag=1 for var 4 and 6 at month 12 2012, but in the next month, i.e. month 1 of 2013, the value for var 6 is 0 not 1 that leads id 2 to be excluded in the output.
The output desired in the dataset have below is only id '1'.
data have; input id $ year month var1-var7; datalines; 1 2012 7 0 0 0 0 1 1 1 1 2012 8 0 0 0 0 0 1 1 1 2012 9 1 0 0 0 0 0 0 1 2012 10 0 0 1 0 0 0 0 1 2012 11 0 1 0 1 0 1 0 1 2012 12 0 0 0 0 0 0 1 1 2013 1 0 0 1 1 1 0 0 2 2012 7 0 0 0 0 0 0 0 2 2012 8 0 0 0 0 0 0 0 2 2012 9 0 0 0 0 0 0 0 2 2012 10 0 0 0 0 0 0 0 2 2012 11 0 0 0 0 0 0 0 2 2012 12 0 0 0 1 0 1 0 2 2013 1 0 0 1 1 1 0 0 2 2013 3 1 1 0 0 0 0 0 3 2012 7 0 0 0 0 1 1 1 3 2012 8 0 0 0 0 0 0 0 3 2012 9 0 0 0 0 0 0 0 3 2012 10 0 0 1 0 0 0 0 4 2012 11 0 1 0 0 0 1 0 4 2012 12 0 0 0 0 0 0 0 4 2013 1 1 0 0 0 0 1 0 4 2013 4 0 0 1 1 0 0 0 ;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you,
The difference here, in id 1 there is a flag=1 for two or more variables where the names of the variables is same for the two consecutive months , but in id 2 there is a flag=1 for two or more variables but the name of the variables are different, so I will not count id2.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Please try this:
proc sort data = have out=sorted;
by id year month;
run;
data want(drop=i);
set sorted;
cnt=0;
array v(*) var1--var7;
do i = 1 to dim(v)-1;
if (v(i)=1 and v(i+1)=1) then cnt=cnt+1;
end;
if cnt>=1;
run;
proc transpose data=want out=final(where=(col1 ne 0));
by id year month;
var var1--var7;
run;
proc sql;
select id, year, _name_,sum(col1) from final
group by id, year, _name_
having mean(month)=median(month) and sum(col1)>1;
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You've now clarified your object. For any pair of consecutive observations within an ID, you want to assign flag=1 for both of those observations if they have at least 2 variables with consecutive values of 1.
This is a single step solution that, for each ID, reads the series of observations twice. In the first pass, each record F=1,2,3, ... is compared to the prior record and a temporary variable consec_ones{F} counts the number of vars with a value of 1 in the current and prior record.
The second pass, rereads the series, and for each record (S=1,2,3) and if the current record has consec_ones>1 or the next record has consec_ones > 1 then set the flag:
data have;
input id $ year month var1-var7 ;
datalines;
1 2012 7 0 0 0 0 1 1 1
1 2012 8 0 0 0 0 0 1 1
1 2012 9 1 0 0 0 0 0 0
1 2012 10 0 0 1 0 0 0 0
1 2012 11 0 1 0 1 0 1 0
1 2012 12 0 0 0 0 0 0 1
1 2013 1 0 0 1 1 1 0 0
2 2012 6 0 0 1 1 1 0 0
2 2012 7 0 0 0 0 0 0 0
2 2012 8 0 0 0 0 0 0 0
2 2012 9 0 0 0 0 0 0 0
2 2012 10 0 0 0 0 0 0 0
2 2012 11 0 0 0 0 0 0 0
2 2012 12 0 0 0 1 0 1 0
2 2013 1 0 0 1 1 1 0 0
2 2013 3 1 1 0 0 0 0 0
3 2012 7 0 0 0 0 1 1 1
3 2012 8 0 0 0 0 0 0 0
3 2012 9 0 0 0 0 0 0 0
3 2012 10 0 0 1 0 0 0 0
4 2012 11 0 1 0 0 0 1 0
4 2012 12 0 0 0 0 0 0 0
4 2013 1 1 0 0 0 0 1 0
4 2013 4 0 0 1 1 0 0 0
run;
data want (drop=v f s);
set have (in=firstpass)
have (in=secondpass);
by id;
array consec_ones{30} _temporary_;
array var{*} var: ;
if first.id then call missing(F,S,of consec_ones{*});
if firstpass then do;
F+1;
consec_ones{F}=0;
do v=1 to dim(var);
if dif(var{v})=0 and var{v}=1 then consec_ones{F}+1;
end;
end;
if first.id then consec_ones{1}=0;
if secondpass;
S+1;
flag= max(consec_ones{S},consec_ones{S+1})>1;
run;
Edit: I added the if first.id then consec_ones{1}=0; statement to eliminate artificial creation of consec_one{1} from comparing the beginning of one ID to its predecessor at the end of the previous ID.
Notes:
- Set the size of the array consec_ones to at least 1 greater than the largest number of observations expected for any ID. I used 30, meaning I expected no more that 29 records for any ID.
- F=1,2,3,... for record sequence within the first pass through an ID. S=1,2,3 for the same during the second pass.
- The DIF(x) function is equivalent to X-lag(X). So if DIF(x)=0 and the current value of X=1, then you've revealed consecutive 1's.
The major "trick" here is to interleave dataset HAVE with itself - the pair of statements SET HAVE (in=firstpass) HAVE (in=secondpass); BY ID; .
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------