Re: Count successsive missing values in a row or register

dmarques1998 · Posted 04-01-2021 10:33 AM

Hi,

Is it possible to count sucessive number os missing values, i.e in the 3rd line count the missing values between "E" from "x" to "E" from "x+1". Ideally, the ideia is to count every successive missing values, and if they are interupt by a name, i.e E, Ambas or D, start a new count. I have already the total of missing values as you can see in the attached photo, However I cannot find a specific code or Macro to use in that situation.

Tom · Posted 04-01-2021 10:44 AM

That is a beautiful picture of your ~~baby~~ desktop. But can you share a description of your data and some actual data. To share data just post a simple SAS data step that reads the data from in-line text (aka CARDS or DATALINES). That way we can just copy and paste the text from your posting into our local SAS session and have example data to use to propose solutions.

dmarques1998 · Posted 04-01-2021 11:16 AM

Sorry, I'm new here. How can I copy the Data and put here?

Tom · Posted 04-01-2021 11:47 AM

@dmarques1998 wrote:
Sorry, I'm new here. How can I copy the Data and put here?

There are some posts for how to do this. I created this macro tool you could try and use.

If you have a large dataset a simple data step can generate some text to use. Take advantage of the OBS= dataset option to limit the number of observations.

data _null_;
  file log dsd ;
  set have (obs=10);
  put (_all_) (+0);
run;

Then just copy the text from the SAS log and you have the data for the first 10 observations in your existing HAVE dataset.

But you can just make up some simple data that demonstrates the issue. In that case why not jsut type it into the program editor.

data example;
  input id var1 var2 var3;
cards;
1 100 . 200
2 . . 300
;

A real advantage doing it this way is you can create some example data that clearly show the types of issues that are making the problem hard for you to solve. Much easier than trying to identify observations of the real data that have those situations represented.

Sometimes just taking the time to come up with good test data will help you clarify in your mind what the issue is and how to solve it.

dmarques1998 · Posted 04-01-2021 12:17 PM

As a hypothetical situation :

data example;
  input id var1 var2 var3 var4 var5 var6 var7 var8 var9 var10;
cards;
1 100 . 200 . . 300 . . . .
2 . . 300 . . . . 200 . . 
3 . . . . . . . . 100 . 
4 100 . . . . . . . 100 100
5 100 . . . 200 . . 200 . 
;

And, as a output I want something like that:

var1 var2 var3 var4 var5 var6 var7 var8 var9 var10 1stSeq LastSeq

1 100 . 200 . . 300 . . . .  0 4
2 . . 300 . . . . 200 . .    2 2
3 . . . . . . . . 100 .      8 1
4 100 . . . . . . . 100 100  0 0
5 100 . . . 200 . . 200 .    0 1

Ksharp · Posted 04-02-2021 08:28 AM

data example;
  input id var1 var2 var3 var4 var5 var6 var7 var8 var9 var10;
cards;
1 100 . 200 . . 300 . . . .
2 . . 300 . . . . 200 . . 
3 . . . . . . . . 100 . 
4 100 . . . . . . . 100 100
5 100 . . . 200 . . 200 . 
;

data want;
 set example;
 all=cats(of var:);
pid=prxparse('/\.+/');
s=1;e=length(all);
call prxnext(pid,s,e,all,p,l);
do i=1 by 1 while(p>0);
  if i=1 then FirstSeq=l;
  call prxnext(pid,s,e,all,p,l);
  if l ne 0 then LastSeq=l;
end;
drop all i s e p l pid;
run;

Kurt_Bremser · Posted 04-01-2021 10:45 AM

You have data (dates) in structure (variable names), which is always a bad idea.

Transpose your dataset to a long layout, and then you can use BY processing and a RETAINed variable:

data wide;
infile datalines dsd;
input id $ (fonte201912 fonte202001 fonte202002 fonte202003) ($);
datalines;
1,x,,x,x
2,x,x,,x
3,,,x,x
;

proc transpose data=wide out=long (rename=(col1=fonte));
by id;
var fonte:;
run;

data long2;
set long;
period = input(substr(_name_,6),yymmn6.);
format period yymmn6.;
drop _name_;
run;

data want;
set long2;
by id;
if first.id then counter = 0;
if fonte > ""
then counter = 0;
else counter + 1;
run;

Dataset LONG2 is how your datasets should be organized: long layout, date values stored as proper SAS data values with a display format to your liking.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

dmarques1998 · Posted 04-01-2021 11:08 AM

The problem is that I have a long data, almost 1.3M of rows. It is more convinient to keep the data like that. Is it possible?

ballardw · Posted 04-01-2021 10:49 AM

Missing detail: When the missing are interrupted you now have two or more groups of missing values. So you have two or more counts of missing values possible to report. What is the actual rule for which number to keep? The largest missing sequence? The first? The last sequence? Something else?

Consider your row 10. There are three series of missing values.

When you need to process variables in a single observation in a specific order then an ARRAY is almost always involved in the solution.

Dummy code because I am not going to attempt to create data from a picture.

data want;
   set have;
   array v (*) <your list of variables to process>;
  do i= 1 to dim (v);
     if missing(v[i]) then misscount= sum(misscount,1);
     /* logic would go here about how to handle the 
        interruptions, and possibly before the IF*/
  end;
run;

dmarques1998 · Posted 04-01-2021 11:07 AM

For now, I want only the first and the last sequence, however to future's work I would like to understand how to do for all the possible sequences in a row.

ballardw · Posted 04-01-2021 12:02 PM

@dmarques1998 wrote:
For now, I want only the first and the last sequence, however to future's work I would like to understand how to do for all the possible sequences in a row.

If you have N variables to test then you have to consider up to N/2 intervals (separate variables to record the count), or if the data has an odd N number of variable N/2+1.

When you say "first" and "last" that means we need to add at least one temporary variable indicate where the current sequence is first or not.

Which may mean that @Kurt_Bremser's base approach is the most robust as you can set a "missing sequence" counter variable as well as the actual count.

There are lots of processes that transpose data to one shape for a process to be robust or easier code and then shape back into a needed format (if actually needed).

Best would be to manually process a smaller number of variables, 6 or 7 maybe, with different patterns of missing values, maybe 10 or 15 rows and show the expected output for those.

Hint: putting multiple values into a single variable is 1) much more work to do and 2) extremely hard to work with after it is done.

dmarques1998 · Posted 04-16-2021 07:29 AM

Hi @Kurt_Bremser @ballardw , have upload this data in excel as an example. In that specific case how can I count sucessive missing values in SAS, in the following names: Peter, Ann, Sarah and Bianca, which are the first and de last sets of missing values

Thanks for your help!

Kurt_Bremser · Posted 04-16-2021 07:38 AM

Since you seem to have an issue with Excel, you should ask your questions on a forum for that.

If you have an issue with SAS data, post it as such (data step with datalines). I will positively NOT start typing values off a picture, I've got better things to do.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

dmarques1998 · Posted 04-16-2021 10:05 AM

 Data example;
input Name	N01/2020	N02/2020	N03/2020	N04/2020	N05/2020	N06/2020	N07/2020	N08/2020	N09/2020	N10/2020	N11/2020	        N12/2020;
          
John	E	D	A	.	.	.	.	.	A	A	A	E
Peter	E	E	D	A	E	D	A	.	.	.	.	.
Ann		.	.	.	.	A	D	E	A	D	E	D   .
Steve	E	.	A	.	E	.	A	.	E	.	A	.
Frank	A	E	.	.	A	E	.	.	A	E	.	.
Joanne	E	D	A	.	.	.	.	.	A	A	A	E
Sarah	E	E	D	A	E	D	A	.	.	.	.	.
Bianca	.	.	.	.	.	A	D	E	A	D	E	D
Theresa	E	.	A	.	E	.	A	.	E	.	A	.
Beatric	A	E	.	.	A	E	.	.	A	E	.	.

Sorry for my mistake @Kurt_Bremser, here is the dataset. As I said, I want to count the successsive missing values, for the following names: Peter, Ann, Sarah and Bianca which are the first and de last sets of missing values. can you help me?

thank you

ballardw · Posted 04-16-2021 10:44 AM

@dmarques1998 wrote:
 Data example;
input Name	N01/2020	N02/2020	N03/2020	N04/2020	N05/2020	N06/2020	N07/2020	N08/2020	N09/2020	N10/2020	N11/2020	        N12/2020;
          
John	E	D	A	.	.	.	.	.	A	A	A	E
Peter	E	E	D	A	E	D	A	.	.	.	.	.
Ann		.	.	.	.	A	D	E	A	D	E	D   .
Steve	E	.	A	.	E	.	A	.	E	.	A	.
Frank	A	E	.	.	A	E	.	.	A	E	.	.
Joanne	E	D	A	.	.	.	.	.	A	A	A	E
Sarah	E	E	D	A	E	D	A	.	.	.	.	.
Bianca	.	.	.	.	.	A	D	E	A	D	E	D
Theresa	E	.	A	.	E	.	A	.	E	.	A	.
Beatric	A	E	.	.	A	E	.	.	A	E	.	.
Sorry for my mistake @Kurt_Bremser, here is the dataset. As I said, I want to count the successsive missing values, for the following names: Peter, Ann, Sarah and Bianca which are the first and de last sets of missing values. can you help me?

thank you

That data step doesn't run. / is not an acceptable character in variable names by default.

No Datalines; or Cards; or any of the acceptable variants to tell SAS where the data begins.

And in general having data like dates in variable names is adding at least one additional headache to the problem.

Plus how to show the multiple interruptions is still not actually addressed. What would the output look like.

Count successsive missing values in a row or register