BookmarkSubscribeRSS Feed
TP055972
Calcite | Level 5

Hello there!

 

I am currently a student working on a dataset and one question I have is there a way to identify incomplete/partial data in a dataset?

Looking at the rows 1 by 1 isn't ideal as the dataset has over 900,000 rows. 

 

Thank you for time. 

5 REPLIES 5
PeterClemmensen
Tourmaline | Level 20

What do you mean by incomplete/partial data?

 

Can you provide some sample data? Makes it much easier to help you.

TP055972
Calcite | Level 5

Hi there!

 

What I mean by incomplete/partial is a data entry has been made but the data entry doesn't contain the full text.

 

For example a data entry to Column City has been made and the user wants to enter Chicago(Example) but in the dataset shows Chi. 

 

Is there a way to identify this type of data in Sas Viya or Sas studio without going through the entire the rows of data? 

 

andreas_lds
Jade | Level 19

You will need to define what "incomplete / partial" means for every variable.

Do you have a list of valid entries for the variable city? Here's an idea using sas code:

data have;
   length City $ 25;
   input City $25.;
   datalines;
Boston
Chic
Detroit
L.A.
;

proc format;
   invalue ValidCities (upcase)
      'BOSTON',
      'CHICAGO',
      'DETROIT',
      'LOS ANGELES',
      'MIAMI' = 1
      other = 0
   ;
run;
   
   
data invalid;
   set have;
   
   if input(upcase(City), ValidCities.) = 0;
run;

The format ValidCities can be created by importing list.

 

Maybe there are easier methods with Viya, but i don't think some magic method exists to validate all variables.

TP055972
Calcite | Level 5
HI there!

Alright will try out this method

Thank you for your time
ballardw
Super User

@TP055972 wrote:

Hi there!

 

What I mean by incomplete/partial is a data entry has been made but the data entry doesn't contain the full text.

 

For example a data entry to Column City has been made and the user wants to enter Chicago(Example) but in the dataset shows Chi. 

 

Is there a way to identify this type of data in Sas Viya or Sas studio without going through the entire the rows of data? 

 


 

If I did not create the SAS data set I run Proc Contents to get properties of the variables as they exist. If you look at the contents and your variable has length of 4 and is character you know that NONE of the values will contain more than 4 characters. If you expect them to that is an indicator that how the data set was built is incorrect. Compare the results of Proc Contents with the description you were given of the data  (you do have some description of what should be in the data correct?). 

 

If the properties are reasonable then thing I do is explore the data. That typically means running summary statistics on numeric variables like price, height, weight, instrument measurements (NOT account number, phone number or Id numbers) to get maximum, minimum, mean values and number of values (lots of missing being of possible interest) and frequencies on text variables to get a feel for what is in the data regardless of what the documentation says should be ther.

 

Proc freq will give you a count of each value that appears in the data for each variable requested so I often start there for the character variables. So you can see 1) all the values of a variable that by default will be sorted by the value (so similar values should appear together such as "Chi" "Chicago" "Chic. (Example)" or what have you. If those values look "partial" the next steps likely depend on the exact values.

 

Basic code to get frequencies of all the character variables.

Proc Freq data=yourdatasetname;

   tables _character_;

run;

If there are some that don't want to look at you can use a data set option to drop them from consideration:

Proc freq data=yourdatasetname (drop=thisvar thatvar someothervarname);

   tables _character_;

run;

 

Once you identify suspect values you can subset your data to examine more details with code similar to:

data detail;
   set yourdatasetname;
   where City in ('Chi' 'Chi.' 'Chgago' 'chi' );
run;

Pick your variable and the IN operator finds exact matches (from your Proc Freq output) and will have the entire record so you would have all the information in the set associated with the suspect values.

What you do after that depends.

 

Note: SAS datasets have Variables not columns.

 

 

 

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 961 views
  • 3 likes
  • 4 in conversation