BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
nduksy
Obsidian | Level 7

Hello,

 

I want to search through all the datasets in a specific library to see if two specific columns have blank records, if they do, write to a new file, the name of the dataset(s) with blank records, today's and the number of blank records found per dataset. So the output dataset will have three columns: Dataset name, today's date and number of blank rows:

 

I am stuck here so far.................

 


data blankClinicIdNum blankClinic blankIdNum;
set appl1.s:;
runDate=today();
format runDate date9.;
put dt ;
if clinic =. and idnum =. then output blankClinicIdNum;
if clinic =. then output blankClinic;
if idnum =. then output blankIdNum;
run;

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

data missingData;
set appl1.s: indsname =source;;
runDate=today();
format runDate date9.;

datasetName = source;

Missing_CLINIC_ID=0; MISSING_CLINIC=0; MISSING_ID=0;

if clinic =. and idnum =. then Missing_Clinic_ID = 1;
if clinic =. then MISSING_CLINIC=1;
if idnum =. then MISSING_ID=1;
run;

proc means data=missingData N SUM MEAN;
class datasetName runDate;
var MISSING_:;
output out summary = want;
run;

proc print data=want;
label N = "# of Observations'
SUM = '# Missing'
MEAN = '% Missing';
format mean percent12.1;
run;

An alternative approach.

 


@nduksy wrote:

Hello,

 

I want to search through all the datasets in a specific library to see if two specific columns have blank records, if they do, write to a new file, the name of the dataset(s) with blank records, today's and the number of blank records found per dataset. So the output dataset will have three columns: Dataset name, today's date and number of blank rows:

 

I am stuck here so far.................

 


data blankClinicIdNum blankClinic blankIdNum;
set appl1.s:;
runDate=today();
format runDate date9.;
put dt ;
if clinic =. and idnum =. then output blankClinicIdNum;
if clinic =. then output blankClinic;
if idnum =. then output blankIdNum;
run;


 

View solution in original post

6 REPLIES 6
Reeza
Super User

data missingData;
set appl1.s: indsname =source;;
runDate=today();
format runDate date9.;

datasetName = source;

Missing_CLINIC_ID=0; MISSING_CLINIC=0; MISSING_ID=0;

if clinic =. and idnum =. then Missing_Clinic_ID = 1;
if clinic =. then MISSING_CLINIC=1;
if idnum =. then MISSING_ID=1;
run;

proc means data=missingData N SUM MEAN;
class datasetName runDate;
var MISSING_:;
output out summary = want;
run;

proc print data=want;
label N = "# of Observations'
SUM = '# Missing'
MEAN = '% Missing';
format mean percent12.1;
run;

An alternative approach.

 


@nduksy wrote:

Hello,

 

I want to search through all the datasets in a specific library to see if two specific columns have blank records, if they do, write to a new file, the name of the dataset(s) with blank records, today's and the number of blank records found per dataset. So the output dataset will have three columns: Dataset name, today's date and number of blank rows:

 

I am stuck here so far.................

 


data blankClinicIdNum blankClinic blankIdNum;
set appl1.s:;
runDate=today();
format runDate date9.;
put dt ;
if clinic =. and idnum =. then output blankClinicIdNum;
if clinic =. then output blankClinic;
if idnum =. then output blankIdNum;
run;


 

nduksy
Obsidian | Level 7
Thank you! This had everything I wanted and more.
nduksy
Obsidian | Level 7

Hi Sorry to bother you again. I want to be able to run the program repeatedly without updating the "runDate". Let me explain further - it is possible for the program to be ran today and 1 blank row is returned for dataset X and tomorrow when it is run, same dataset X now has two blank rows. I want to be able to report both occurrences as separate rows, but with different run dates. see sample table below:

 

Dataset Name Run Date Count
s123 04OCT2021 2
s123 05OCT2021 5

 

Thanks as always!

Reeza
Super User
So you want to append your results to a data set each time you run it. What happens if the run fails? What happens if you run it mulitple times in one day?

Here's a rough example of how I usually handle these.
https://gist.github.com/statgeek/353374a5d8ea4f0c89ce5d80a47f4a4c
Tom
Super User Tom
Super User

Assuming all of the datasets have the two variables and start with the same prefix you could use a variation on your attempt.

 

Let's make some sample data first:

data x1(keep=name sex) x2(keep=name sex age);
  set sashelp.class;
  if _n_=3 then sex=' ';
  if _n_=5 then name=' ';
run;

Now we can make a view that will generate 0/1 flags for missing.

data missing / view=missing;
  set x: (keep=name sex) indsname=indsname;
  dsname=indsname;
  blank_name = missing(name);
  blank_sex = missing(sex);
  blank_name_sex = missing(name) and missing(sex);
  keep dsname blank_: ;
run;

Which we can then use PROC SUMMARY to aggregate.

proc summary data=missing nway;
  by dsname;
  var blank_:;
  output out=summary sum= ;
run;

Results:

                                      blank_    blank_     blank_
Obs    dsname     _TYPE_    _FREQ_     name       sex     name_sex

 1     WORK.X1       0        19         1         1          0
 2     WORK.X2       0        19         1         1          0

With a little finagling you can get the result in one step.

data test ;
  set x: (keep=name sex) indsname=indsname end=eof;
  if indsname ne lag(indsname) and _n_>1 then do;
     output;
     call missing(blank_name,blank_sex,blank_name_sex);
  end;
  dsname=indsname;
  retain dsname;
  blank_name + missing(name);
  blank_sex + missing(sex);
  blank_name_sex + missing(name) and missing(sex);
  if eof then output;
  keep dsname blank_: ;
run;
       blank_    blank_     blank_
Obs     name       sex     name_sex    dsname

 1        1         1          0       WORK.X1
 2        1         1          0       WORK.X2

 

ballardw
Super User

You don't say what you are actually stuck on ...

 

I think that you want to add the INDSNAME option to capture the data set contributing to the current record.

 

data blankClinicIdNum blankClinic blankIdNum;
set appl1.s: INDSNAME=DSN;
   Datasetname=DSN;
   runDate=today();
   format runDate date9.;
   put dt ;/* what is this for???*/
   if clinic =. and idnum =. then output blankClinicIdNum;
   if clinic =. then output blankClinic;
   if idnum =. then output blankIdNum;
run;

The INDSNAME option on the SET statement creates a temporary variable holding the name, library and dataset, of the data set providing the current record. To have the value in the output data set you need to assign it to a permanent variable.

 

THEN you use something like Proc Freq, SQL , Means or Summary to count the combinations of datasetname and date.

If you want the counts of blankclinic and blankidnum in the same data set then combine (merge or join) the count data sets by datasetname and date (though the dates should be the same at this point) if you ever have these created on different dates..

 

 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 6 replies
  • 1049 views
  • 1 like
  • 4 in conversation