Solved: Re: File Review: check valid values

A_Swoosh · Posted 08-05-2020 08:15 PM

Hi SAS Community,

I am importing pipe delimited .txt and .csv files into SAS and want to run a series of checks:

1. Count the number of records

2. Count the number of distinct records for ID

3. Whether the variables match a list of set variables

4. Are the values expected or in the correct format

Is there a way to output 1 dataset with all this information? As it currently stands,I produce these results into a text file that are the results of multiple proc frequencies and proc sql.

These are my expected variables:

data VARS;
input vars $;
datalines;
ID
BDAY
GENDER
ADDRESS1
ADDRESS2
CITY
STATE
ZIP
ZIP4
COUNTY_CODE
;
run;

This is my sample data:

data sample;
infile datalines dlm="|" missover dsd; 
input ID $2 BDAY $10 GENDER $ ADDRESS1 $ ADDRESS2 $10. CITY $ STATE $ ZIP $ ZIP4 $ COUNTY_CODE $;
datalines;
A1|20200420|M|123 Main St.|Suite 201|Juneau|AK|99802||02112
B2|4/20/2020|Male|124 Main St.||Juneau|AK|99802|Juneau
C3|4/20/2020|M|125 Main St.||Juneau|AK|99802||02112
4-1|20200420|Male|126 Main St.|Suite 101|Juneau|AK|99802||Juneau
;
run;

Conditions:

ID:should be alpha numeric and 2 characters

BDAY: should be YYYYMMDD format

Gender: should be 1 characters

Address1 and Address 2 should be character

City should character

State should be 2 characters

zip should be 5 characters

zip4 should be no more than 4 characters

county should be the fips code which is 5 characters

Results:

Line 1: accurate and meets all criteria

Line 2: dob is not in correct format, gender is not 2 characters, does not have zip4 variable, county does not have 5 character

Line 3: dob is not in correct format

Line 4: id is not alpha numeric with no special characters, and gender is not 2 characters

So far I have:

/*Variables match list? */
		proc sql;
			create table plan_vars as
			select strip(upcase(name)) as vars 
			from sashelp.vcolumn
			where libname='WORK' and memname =SAMPLE;
		quit;


		proc sql;
			create table comparevar_&file. as
			select a.*, b.*,
			case when a.vars = b.varsm then 'Match'
			else 'No' end as var_match
			from VARS as a
			full join plan_vars(rename=(vars=varsm)) as b
			on a.vars=b.varsm;
		quit;

/*output records */
		ods rtf file="&output.\FileReview.rtf";  
		ods noptitle; 
		options nodate nonumber;

		proc sql;
			create table ctrltot_&file. as
			select count(*) as Total_number_of_records,
				count(distinct ID) as Total_number_of_unique_ID,
				count (ID) as tot_ID
			from sample;
		quit;

		proc freq data = sample;
		   	tables _all_;
		   	format _numeric_ _character_ $miss.;
		run;

		proc freq data = ctrltot_&file.;
		tables _all_;
		run;

		proc freq data=comparevar_&file.;
			tables _all_;
			where var_match = 'No';
		run;

ChrisNZ · Posted 08-06-2020 05:55 PM

Something like this?

proc sql;
   select count(*)                                    as NB_RECORDS
         ,count(distinct ID)                          as NB_UNQ_ID
         ,sum(^prxmatch('/^\w\w$/',ID))               as NB_BAD_ID
         ,sum(^input(BDAY,yymmdd8.))                  as NB_BAD_BDAY
         ,sum(^prxmatch('/^[MF]$/i',GENDER))          as NB_BAD_GENDER
         ,sum(^prxmatch('/^[A-Z ]*$/i',CITY))         as NB_BAD_CITY
         ,sum(^prxmatch('/^\d{5} *$/i',ZIP))          as NB_BAD_ZIP
         ,sum(^prxmatch('/^\d{1,4} *$/i',ZIP4))       as NB_BAD_ZIP4
         ,sum(^prxmatch('/^\d{5} *$/i',COUNTY_CODE )) as NB_BAD_CC
   from TABLE;

High-Performance SAS Coding - Third Edition

View solution in original post

ChrisNZ · Posted 08-06-2020 01:44 AM

> Line 1: accurate and meets all criteria

How?

ID=A1 is not 7 characters, GENDER=M is not 2 characters etc

High-Performance SAS Coding - Third Edition

A_Swoosh · Posted 08-06-2020 11:50 AM

Sorry, my apologies. It was a sample and I forgot to change the length. It should be 2 and 1, respectively.

ChrisNZ · Posted 08-06-2020 05:55 PM

Something like this?

proc sql;
   select count(*)                                    as NB_RECORDS
         ,count(distinct ID)                          as NB_UNQ_ID
         ,sum(^prxmatch('/^\w\w$/',ID))               as NB_BAD_ID
         ,sum(^input(BDAY,yymmdd8.))                  as NB_BAD_BDAY
         ,sum(^prxmatch('/^[MF]$/i',GENDER))          as NB_BAD_GENDER
         ,sum(^prxmatch('/^[A-Z ]*$/i',CITY))         as NB_BAD_CITY
         ,sum(^prxmatch('/^\d{5} *$/i',ZIP))          as NB_BAD_ZIP
         ,sum(^prxmatch('/^\d{1,4} *$/i',ZIP4))       as NB_BAD_ZIP4
         ,sum(^prxmatch('/^\d{5} *$/i',COUNTY_CODE )) as NB_BAD_CC
   from TABLE;

High-Performance SAS Coding - Third Edition

A_Swoosh · Posted 08-07-2020 06:59 PM

Yes, this is exactly what I'm trying to accomplish. Thank you for the help.
I don't have much experience with Perl Regular Expressions but is there a way to find a series of values that is the length of the variable and if it matches then identify as bad? I want to match repeating but instead of n it's the length of the variable? I also want to identify an ID that is not alphanumeric. For example the ID is a description instead of the A235324.

Case 1:

ID: 0000000.

Case 2:

ID: Physician

Case 3:

ID: A235324

ChrisNZ · Posted 08-08-2020 07:25 AM

How can ID be 00000000 when you define it as $2?

High-Performance SAS Coding - Third Edition

A_Swoosh · Posted 08-08-2020 02:09 PM

I was trying to present another example so I'm clear about the syntax involved with perl expressions since I'm new to those expressions.

If I have another dataset where I'm trying to identify

data sample;
infile datalines dlm="|" missover dsd; 
input CATEGORY $ ID $;
datalines;
Physician|A123242
|0000000
PS220|A123456
run;

Case 3 has the proper format for each variable while case 2 has both wrong, and case 1 has CATEGORY wrong.

ChrisNZ · Posted 08-09-2020 05:56 PM

> I want to match repeating but instead of n it's the length of the variable

prxmatch('/(\d)\1{4}/',STR);

matches a digit repeated 4 times (ie 5 identical digits)

Another way could be testing:

repeat(first(STR),4)=STR

For the rest you can look at the ANYxxxx and NOTxxxx functions

High-Performance SAS Coding - Third Edition

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away