Re: SAS Data set shouldn't have as many variables

JimDillon · Posted 05-07-2020 10:44 AM

Attached is the log from a SAS program. The code could probably be a lot more efficient but ...

My goal is to see what jobs haven't been run in the past year.

I'm reading in a lot of data so the information at the bottom of page 12 (408619 observations) appears to be correct. However the log text near the bottom of page 23 (end of data set TASKS) doesn't seem quite correct - if, for each observation in data set PDB, I increment a counter, I should have 1 observation in data set TASKS with 579 variables, and not 407419 observations.

Why so many observations?

Shmuel · Posted 05-07-2020 11:24 AM

I have selected lines from your log and you have probably missed something:

The data set WORK.PDBIN has 408619 observations and 4 variables

The data set WORK.TASKS has 407419 observations and 578 variables.

The data set WORK.TASKNAMES has 407419 observations and 579 variables.

1400 PROC SORT DATA=TaskNames NODUPS;
1401 BY TASKNAME;
1402 RUN;

NOTE: There were 407419 observations read from the data set WORK.TASKNAMES.
NOTE: 407238 duplicate observations were deleted.
NOTE: The data set WORK.TASKNAMES has 181 observations and 579 variable

So what is your exact question?

Try to post the first step making the issue.

JimDillon · Posted 05-12-2020 02:25 PM

Shmuel,

Per your question on what I'm trying to accomplish. I have a list of tasks that are permitted to execute through our security product. Over the years we have stopped running some but have not removed their definitions from the security product. I'm going through now to see if these tasks have run in the past year/ two years, and if they have not, remove them from the security product. I have the list of the definitions for these tasks in the security product. I also have a list of of what tasks (and many other jobs) that have actually ran in the past year/ two years. My idea was to bump the list of 578 permitted tasks against what has actually run (408619 objects) and if there is no record of it having been run, remove it from the security product.

Kurt_Bremser · Posted 05-12-2020 02:55 PM

So you have your list of jobs allowed in a dataset. Sort that by the jobname.

Sort the jobs that gave run with nodupkey.

Merge the two datasets, and keep those that are in allowed, but not in pdbin.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Shmuel · Posted 05-12-2020 03:09 PM

You can select distinct names of jobs run per year (let's say 2018 and 2019)

then you can select all those jobs which run in 2018 but not in 2019,

thus you get directly what you want - list of jobs not used any more.

adapt years as need.

FreelanceReinh · Posted 05-07-2020 11:25 AM

Hello @JimDillon and welcome to the SAS Support Communities!

There's nothing in this DATA step that prevents the implied OUTPUT statement from writing one observation per DATA step iteration to dataset Tasks, resulting in the same number of observations as have been read from PDBIN. You can wrap a DO-UNTIL loop around the SET statement to obtain a single observation, as shown in the example below:

data want(keep=t1-t3);
do until(last);
  set sashelp.class end=last;
  select;
    when(name=:'A') t1+1;
    when(name=:'J') t2+1;
    when(name=:'M') t3+1;
    otherwise;
  end;
end;
run;

Note that neither an array nor an initialization are needed. The latter is due to the use of sum statements (t1+1) instead of ordinary assignment statements (t1=t1+1). They initialize the variables (t1, etc.) to zero by default.

Kurt_Bremser · Posted 05-07-2020 11:31 AM

You have no explicit output statement in the data step that creates TASKS, so it will write one observation for every observation read from the input dataset.

But you seem to have never been introduced to procedures like FREQ and SUMMARY. To create a count of your jobs, just use PROC FREQ on table PDBIN.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

JimDillon · Posted 05-08-2020 09:58 AM

Kurt, thanks for your suggestions on the procedures. However, I'm looking in the Base SAS 9.4 Procedures Guide and don't find the FREQ procedure - i do see the SUMMARY proc which also references the MEANS proc.

Kurt_Bremser · Posted 05-08-2020 10:45 AM

A simple count can also be done in SQL:

proc sql;
create table want as
  select job, count(*) as count
  from pdbin
  group by job
;
quit;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Kurt_Bremser · Posted 05-12-2020 04:54 AM

@JimDillon wrote:
Kurt, thanks for your suggestions on the procedures. However, I'm looking in the Base SAS 9.4 Procedures Guide and don't find the FREQ procedure - i do see the SUMMARY proc which also references the MEANS proc.

The FREQ procedure is under "Base SAS Procedures Guide. Statistical Procedures", The FREQ Procedure

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Kurt_Bremser · Posted 05-12-2020 05:01 AM

A quick example for creating a count with FREQ:

data pdbin;
input job :$8.;
datalines;
ABENDAID
AIITAS
AIITCP
AIITAS
ABENDAID
;

proc freq data=pdbin noprint;
tables job /out=want (drop=percent);
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ballardw · Posted 05-07-2020 11:35 AM

Any time you use a structure like:

Data new;
   set old;
  <do something>
run;

The data step processes every record in the data set OLD. If there were 100 records the <do something> code gets executed for each record, resulting in 100 records unless you do something to explicitly add records or only output the data for certain records (If <condition> then output;>

Since you read the entire input data set you get each record out with the variables incremented.

It might be a good idea to describe what you were attempting as this looks like a not very efficient attempt to count something and likely could be done with proc freq and/or transpose.

Your select is likely not doing what you expect anyway.

You show code like:

189          WHEN (JOB =: 'AS')       T19 = T19 + 1;
190          WHEN (JOB =: 'ASCH')     T20 = T20 + 1;

If you intend the =: to be the "begins with" then "ASCH<anytext>" will begin with 'AS' and get in the T19 slot.

Example:

data example;
   input job :$10.;
   array t(6);
   do i=1 to dim(t); t[i]=0;end;
   select;
      WHEN (JOB =: 'DSN')      T1 = T1 + 1;
      WHEN (JOB =: 'DSNCDBM1') T2 = T2 + 1;
      WHEN (JOB =: 'DSNCDIST') T3 = T3 + 1;
      WHEN (JOB =: 'DSNCIRLM') T4 = T4 + 1;
      WHEN (JOB =: 'DSNCMSTR') T5 = T5 + 1;
      WHEN (JOB =: 'DSNCWLMC') T6 = T6 + 1;
      otherwise ;
   end;
datalines;
DSN
DSNCDBM1
DSNCDIST
DSNCIRLM
DSNCMSTR
DSNCWLMC
;

Note that the only "counter" that gets incremented is for DSN.

If you don't mean to use "begins with" then you really need to be looking at Proc freq.

Or you may want a multilabel format based on the Job variable that would allow a procedure like Proc Tabulate or Report to show counts with the "DSN" group plus all the other "DSNxxx" job values.

Shmuel · Posted 05-07-2020 01:17 PM

I have selected few more rows from the log:

The data set WORK.PDBIN has 408619 observations and 4 variables

154         PROC SORT DATA=PDBIN NODUPS;
155          BY JOB;
156         RUN;

NOTE: There were 408619 observations read from the data set WORK.PDBIN.
NOTE: 1200 duplicate observations were deleted.
NOTE: The data set WORK.PDBIN has 407419 observations and 4 variables.
The data set WORK.TASKS has 407419 observations and 578 variables.

The data set WORK.TASKNAMES has 407419 observations and 579 variables.

1400        PROC SORT DATA=TaskNames NODUPS;
1401         BY TASKNAME;
1402        RUN;

NOTE: There were 407419 observations read from the data set WORK.TASKNAMES.
NOTE: 407238 duplicate observations were deleted.
NOTE: The data set WORK.TASKNAMES has 181 observations and 579 variables.

according to log you read 408619 rows from PDBIN.

there were 1200 duplicate JOBs.

clearing duplicates you remained 407419 JOBs.

then you counted tasks per selected jobs.

if you want only those jobs, you need a change in next step:

DATA Tasks (KEEP = /*TASKARRAY*/ JOB T1-T578);
Set PDBIN;
  BY JOB;    /* line added */
       Array TASKARRAY{578} T1-T578;
           /* initiallize the array */
          do i=1 to 578;
              TASKARRAY{i} = 0;
           end;
/* Search the PDB and for each task by that name found, increment counter */
SELECT;
    WHEN (JOB =: 'ABENDAID') T1 = T1 + 1;
     WHEN (JOB =: 'AIITAS')   T2 = T2 + 1;
    ........
     WHEN (JOB =: 'ZVIEW')    T578 = T578 + 1;
     OTHERWISE delete;   /* delete other jobs from input */
 END;
 Run;

Shmuel · Posted 05-07-2020 01:23 PM

but you don't need all those counters as each job has its own counter.
you just need:
data tasks;
set PDBIN;
by JOB;
if first.job then counter+1;
if JOB in (<list of JOBs) then output;
run;

JimDillon · Posted 05-08-2020 10:06 AM

Scmuel, thanks for your suggestion. I tried your first suggestion but I didn't see a decrease in the number of output observations. I then considered your second suggestion but it doesn't seem to support the 'begins with' concept that I have in place - and as pointed out, needs a little work - I don't know (as BallardW pointed out) if I have both "JOB =: 'AS'" and "JOB =: 'ASCH'" then the first would include the second - but how many jobs start with 'AS' other than 'ASCH'?

Registration is open

SAS Training: Just a Click Away