Solved: data quality check

BeatriceWang · Posted 07-26-2021 04:49 PM

Hi,

My question is referencing the below thread regarding how to skip a variable based on the value and go to the next variable.
https://communities.sas.com/t5/SAS-Programming/how-to-skip-a-variable-based-on-the-value-and-go-to-t...

I got a similar situation.

I tried the solution in the above thread. It seems working for me by performing some of the eye ball checking.

I need to conduct a data quality check.

Can somebody suggest an efficient approach (SAS code) to conduct a data quality check to ensure it has done the job appropriately?

Thank you very much in advance.

Patrick · Posted 07-26-2021 07:34 PM

@BeatriceWang The requirement in your other question here was: " If the grid_rate =D than i dont want that value and go on the the next grid rate". You didn't mention missings and that you want to skip them as well.

If you want to exclude multiple values then just change the condition from:

if rates[_i] ne 'D' then

to a list like:

if rates[_i] not in ('D',' ') then

Here implemented for one of the options posted in your other thread.

data have;
  infile datalines truncover dsd;
  input (contract_id rate1 rate2 rate3 rate4 want_rate1 want_rate2 want_rate3 want_rate4) ($);
  datalines;
XEA123,D,PRIME,TD49,TD58,PRIME,TD49,TD58
DEW343,,D,TD58,PRIME,TD49,TD58,PRIME
JRQ912,PRIME,TD49,TD58,,PRIME,TD49,TD58
;

data want(drop=_:);
  set have;
  array rates {*} rate1 - rate4;
  do _i=1 to dim(rates);
    if rates[_i] not in ('D',' ') then
      do;
        _ind=sum(_ind,1);
        rates[_ind]=rates[_i];
      end;
  end;
  _ind=sum(_ind,1);
  do _i=_ind to dim(rates);
    call missing(rates[_i]);
  end;
run;

View solution in original post

SASKiwi · Posted 07-26-2021 05:18 PM

@BeatriceWang - Your link doesn't appear to work.

BeatriceWang · Posted 07-26-2021 05:43 PM

The link works well for me. I tried directly clicking on it.

Reeza · Posted 07-26-2021 05:45 PM

I fixed the link for you, it didn't work previously. I assume that means I found the right link as well 🙂

SASKiwi · Posted 07-26-2021 05:53 PM

@BeatriceWang - Yep, must have been a website problem. Working now.

ballardw · Posted 07-26-2021 05:32 PM

Exactly what kind of check do you want?

BeatriceWang · Posted 07-26-2021 05:39 PM

I want to check the newvars consecutively stored the vars without blank fields on all records of my file(big).

I did some eye ball checking. It looks OK. But I could not eye check all of the records.

Please advise if you can.

ballardw · Posted 07-26-2021 07:26 PM

@BeatriceWang wrote:

I want to check the newvars consecutively stored the vars without blank fields on all records of my file(big).

I did some eye ball checking. It looks OK. But I could not eye check all of the records.

Please advise if you can.

What exactly does "consecutively stored the vars without blank fields" actually mean? We don't have any of your data, before or after so we have no idea which vars, and while it may be obvious to you, "consecutively stored" is not intuitively obvious. You can use the NMISS function if the variables are numeric to tell how many are missing (or blank).

Patrick · Posted 07-26-2021 07:34 PM

@BeatriceWang The requirement in your other question here was: " If the grid_rate =D than i dont want that value and go on the the next grid rate". You didn't mention missings and that you want to skip them as well.

If you want to exclude multiple values then just change the condition from:

if rates[_i] ne 'D' then

to a list like:

if rates[_i] not in ('D',' ') then

Here implemented for one of the options posted in your other thread.

data have;
  infile datalines truncover dsd;
  input (contract_id rate1 rate2 rate3 rate4 want_rate1 want_rate2 want_rate3 want_rate4) ($);
  datalines;
XEA123,D,PRIME,TD49,TD58,PRIME,TD49,TD58
DEW343,,D,TD58,PRIME,TD49,TD58,PRIME
JRQ912,PRIME,TD49,TD58,,PRIME,TD49,TD58
;

data want(drop=_:);
  set have;
  array rates {*} rate1 - rate4;
  do _i=1 to dim(rates);
    if rates[_i] not in ('D',' ') then
      do;
        _ind=sum(_ind,1);
        rates[_ind]=rates[_i];
      end;
  end;
  _ind=sum(_ind,1);
  do _i=_ind to dim(rates);
    call missing(rates[_i]);
  end;
run;

BeatriceWang · Posted 07-27-2021 07:15 AM

Thank you for all the responses. You guys are all very helpful.

My data looks like this:

var1	var2	var3	var4	var5	var6	newvar1	newvar2	newvar3	newvar4	newvar5	newvar6
mi	ab		cd			mi	ab	cd
ef			hi	xy		ef	hi	xy
wx	yz	jk	lm	no		wx	yz	jk	lm	no

The code you posted works for me.

Like I said earlier I need to do data quality check.

I am looking for efficient sas code to check if the job is appropriately done on all of the records in my file (big).

Please advise if you can. Thanks in advance.

Kurt_Bremser · Posted 07-27-2021 07:36 AM

You suffer from inefficient data design. Maxim 19: Long Beats Wide.

So, first transpose:

data have;
infile datalines dsd dlm=" " truncover;
input ID $ (var1 var2  var3  var4  var5  var6) ($);
datalines;
1 mi ab  cd
2 ef   hi xy 
3 wx yz jk lm no
;

proc transpose
  data=have
  out=long (
    rename=(_name_=var col1=value)
    where=(value ne " ")
  )
;
var var:;
by id;
run;

(I guess you have some identifying column like the ID I use here)

Then renumber:

data want;
set long;
by id;
if first.id
then i = 1;
else i + 1;
var = cats("var",i);
drop i;
run;

If you want, you can now re-transpose to wide:

proc transpose
  data=want
  out=want_wide (drop=_name_)
;
by id;
id var;
var value;
run;

Not that, throughout, the code needs not make any assumption about the number of entries for a single ID. And no space is wasted for missing values, after the initial transpose.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Patrick · Posted 07-27-2021 06:29 PM

@BeatriceWang

If you want to test if the code logic does what you need then create test cases. In your case create sample data that represent what you see in your data prior to running the code, then run the code against it and verify that the outcome meets expectations.

Writing "DQ" code you're asking for would require us to know how your data looks like - and if you know that then you can also create the test cases as explained above. That's how you test code.

Reeza · Posted 07-27-2021 06:35 PM

This seems like a different question than your original question. In that case, it may be better to post it as a new question with an more appropriate subject line eg "Timing a data step process"

data quality check

Re: data quality check

Re: data quality check

Re: data quality check

Re: data quality check

Re: data quality check

Re: data quality check

Re: data quality check

Re: data quality check

Re: data quality check

Re: data quality check

Re: data quality check

Re: data quality check

Re: data quality check

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away