BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
djbateman
Lapis Lazuli | Level 10

This is just a quick question out of curiosity.  Please don't spend a lot of time trying to find a solution if you don't have one handy.

 

If possible, I would like to find any variables in a dataset that may contain leading zeroes.  The variable may be character or numeric.  For example, my clinical trials data might capture Site ID as a 3-digit value.  The first site would be entered as "001" as a character string.  I would like to loop through a dataset and find any such variables.  Is there a straightforward method for doing this?

 

My current thoughts are to loop through each variable.  If it is character string, I would check to see if there are any zeroes at the beginning of the string (no other digits between a "0" and a "." so "0.5" would not be considered a leading zero).  If it is numeric, I would convert it to a character string and then check like I would in the first step.  I would then flag any values that seem to fit this format.  I would then keep any variables that have at least one flag.  Does this sound like overkill?  It seems quite tedious and ineffective to me.  Is there a simpler way to accomplish this?

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

You're right about the output data set.  There are two types of approaches that I would consider, given that you want to automate rather than look through the printed output:

 

  • ODS should be able to capture the results as a data set.
  • Macro language should be able to loop through the character variables, creating a separate TABLES statement for each (and each with its own OUTPUT statement).

I would lean toward ODS, since I believe that it can create a single data set with all the information. Begin with:

 

ODS TRACE  ON;

 

Then run the PROC FREQ and find in the log the name of the ODS output that looks like it might be the right one.  For example, if you see something like ONEWAYFREQS, you would then add to the program:  ODS output ONEWAYFREQS=want;

 

Then work with the data set WANT to separate the parts you want vs. the parts you don't want.

 

Remember to turn ODS TRACE OFF at some point.

View solution in original post

16 REPLIES 16
Tom
Super User Tom
Super User

Numbers cannot have leading zeros. The value 1 and 001 are both stored as the same number.

 

You could check if the values are integers below a certain magnitude.  For example if you are looking for three digit values like 001 then check if the value is less than 100.

PaigeMiller
Diamond | Level 26

@djbateman wrote:

This is just a quick question out of curiosity.  Please don't spend a lot of time trying to find a solution if you don't have one handy.

 

If possible, I would like to find any variables in a dataset that may contain leading zeroes.  The variable may be character or numeric.  For example, my clinical trials data might capture Site ID as a 3-digit value.  The first site would be entered as "001" as a character string.  I would like to loop through a dataset and find any such variables.  Is there a straightforward method for doing this?

 

My current thoughts are to loop through each variable.  If it is character string, I would check to see if there are any zeroes at the beginning of the string (no other digits between a "0" and a "." so "0.5" would not be considered a leading zero).  If it is numeric, I would convert it to a character string and then check like I would in the first step.  I would then flag any values that seem to fit this format.  I would then keep any variables that have at least one flag.  Does this sound like overkill?  It seems quite tedious and ineffective to me.  Is there a simpler way to accomplish this?


Okay, I won't spend a lot of time on this. You will need some sort of intelligent looping that handles character variables different than numeric variables.

 

But I'm curious what benefit this would bring if implemented. I can't imagine one off the top of my head.

--
Paige Miller
djbateman
Lapis Lazuli | Level 10

After writing programs for my client that look for dirty data, they like to run test data through the program if the report comes out empty.  We get test data by exporting the raw SAS data into Excel so the data manager can modify it.  I try to read this back into SAS to run through my programs, but when I read in from Excel to SAS, it doesn't always end up with the exact same data structure (e.g., dates might be different formats, leading zeroes are lost, decimal precision might be different, and variable lengths might be different).  I'm writing a bit of code that will read in a test file, compare it to the raw data it was taken from, and format each variable to match exactly so when I run it through the program I don't end up with errors/warnings.  The loss of leading zeroes causes issues because if my code is merging a raw dataset with a test dataset by SITEID and the raw data has SITEID='001' and the test data has SITEID='1', then it won't find a match, and the test data becomes pointless.

 

Maybe I'm making more of an issue out of this than I need, but I can't find a better way to do this.  There are times where the amount of test data is minimal, so I will do something like:

 

data test;
    set raw;
    if subjid='001-101-101' then do;
/* manually type changes to the data */
    end;
run;

If there are better ways to do test data, I will gladly accept any help!

Kurt_Bremser
Super User

Have them save their spreadsheets to a usable text format, and read that with a data step, so that you get consistent results. Reading Excel files directly causes SAS to make assumptions and guesses, and produces inconsistent results.

djbateman
Lapis Lazuli | Level 10
Kurt,

I don't think I've every heard of a useable text format. What do you mean by that? Is this just a .txt file?
Kurt_Bremser
Super User

@djbateman wrote:
Kurt,

I don't think I've every heard of a useable text format. What do you mean by that? Is this just a .txt file?

It may be a .txt file with fixed column width, or tabs as delimiter, or a .csv with commas (or semicolons, because that's what Excel does).

All these file formats can be read with a data step where you as a programmer have full control over column types, sizes, formats.

Reading text files with data steps is an essential SAS skill; relying on Excel files only causes unnecessary stress.

 

djbateman
Lapis Lazuli | Level 10
Okay, I understand. Yes, I have been working with CSV files here and there, but it became cumbersome since you have to have a separate file for each dataset. I've just been playing around with physical .xlsx files since I can use SAS/ACCESS to PC Files to read in one xlsx file and create a separate table per tab in the file.
ballardw
Super User

@djbateman wrote:

After writing programs for my client that look for dirty data, they like to run test data through the program if the report comes out empty.  We get test data by exporting the raw SAS data into Excel so the data manager can modify it.  I try to read this back into SAS to run through my programs, but when I read in from Excel to SAS, it doesn't always end up with the exact same data structure (e.g., dates might be different formats, leading zeroes are lost, decimal precision might be different, and variable lengths might be different).  I'm writing a bit of code that will read in a test file, compare it to the raw data it was taken from, and format each variable to match exactly so when I run it through the program I don't end up with errors/warnings.  The loss of leading zeroes causes issues because if my code is merging a raw dataset with a test dataset by SITEID and the raw data has SITEID='001' and the test data has SITEID='1', then it won't find a match, and the test data becomes pointless.

 

Maybe I'm making more of an issue out of this than I need, but I can't find a better way to do this.  There are times where the amount of test data is minimal, so I will do something like:

 

data test;
    set raw;
    if subjid='001-101-101' then do;
/* manually type changes to the data */
    end;
run;

If there are better ways to do test data, I will gladly accept any help!


One suspects that part of the issue may be HOW your are reading the returned data into SAS. If Proc Import is involved that is likely part of the culprit.

 

A second issue can arise from someone manually entering values into cells in Excel without paying close attention to what happens when entered. Consider a "general" format cell in Excel, which is the default. Enter 5-1. Which gets turned into a date.

Astounding
PROC Star

A quick and dirty variation that still puts some of the work on  you:

 

proc freq data=have;
tables _character_;
format _character_ $1.;
run;
djbateman
Lapis Lazuli | Level 10
Astounding,

This seems to be a great solution that I'm looking for; however, I'm not able to get the output retained in a table. How would you output the results? When I use "tables _character_ / out=lead0;" the LEAD0 table only retains the values of the last character variable. Any ideas what I'm doing wrong?
Astounding
PROC Star

You're right about the output data set.  There are two types of approaches that I would consider, given that you want to automate rather than look through the printed output:

 

  • ODS should be able to capture the results as a data set.
  • Macro language should be able to loop through the character variables, creating a separate TABLES statement for each (and each with its own OUTPUT statement).

I would lean toward ODS, since I believe that it can create a single data set with all the information. Begin with:

 

ODS TRACE  ON;

 

Then run the PROC FREQ and find in the log the name of the ODS output that looks like it might be the right one.  For example, if you see something like ONEWAYFREQS, you would then add to the program:  ODS output ONEWAYFREQS=want;

 

Then work with the data set WANT to separate the parts you want vs. the parts you don't want.

 

Remember to turn ODS TRACE OFF at some point.

ballardw
Super User

@djbateman wrote:
Astounding,

This seems to be a great solution that I'm looking for; however, I'm not able to get the output retained in a table. How would you output the results? When I use "tables _character_ / out=lead0;" the LEAD0 table only retains the values of the last character variable. Any ideas what I'm doing wrong?

From documentation for the Tables OUT= option in Proc Freq:

OUT=SAS-data-set

 names an output data set that contains frequency or crosstabulation table counts and percentages. If more than one table request appears in the TABLES statement, the contents of the OUT= data set correspond to the last table request in the TABLES statement. The OUT= data set variable COUNT contains the frequencies and the variable PERCENT contains the percentages. For more information, see the section Output Data Sets. You can specify the following options to include additional information in the OUT= data set: OUTCUM, OUTEXPECT, and OUTPCT.

 

 

ballardw
Super User

The only numeric values that might have leading zeroes IF GREATER THAN OR EQUAL TO ONE would have to have a Zw. format applied. Those can be found by searching the DICTIONARY.COLUMNS table for numeric variables with a format like Zw. Might be fun if you have lots of custom formats that start with Z though.

 

Are the potential numeric values you are interested in only integers? if not then any value <  0.1 has a "leading zero". And depending on how you actually use this rule then potentially any value < 1 as many formats would display the value as 0.999 . So I think you need to spend a bit of time deciding and describing exactly what "leading zero" means for numeric values in your case.

ChrisNZ
Tourmaline | Level 20

1. Your question makes no sense whatsoever for numeric variables as mentioned.

 

2. For character variables, you have to decide on a format that the variables must follow, and enforce this format.

Maybe apply the z. format, or something else. Show us examples and we can guide you.

The only example you gave is 001-101-101, which either Excel will keep as is, or totally mess up if it decides to calculate.

 

3. In other words, your goal is still unclear.

 

 

 

 

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 16 replies
  • 1008 views
  • 1 like
  • 7 in conversation