BookmarkSubscribeRSS Feed
deleted_user
Not applicable
Hello,

I need to analyze data fields in some large flat files and I was thinking about using PROC FREQ to get a list of unique values for every column for starters.

We have the Enterprise version of SAS, but i havn't used SAS in several years, so i'm wondering what SAS tools may be available to get a "brids eyer view" of the data, or at least get a unique list of values for every column in these files?

Right now, i'm thinking about creating a PROC FREQ for every column and from those outputs comparing unique values in order to determine data relationships, but before doing that, i wanted to check with you'all and see if you could suggest more effective approach to this kind of data analysis?

Thanks very much for any insights!
BobK
6 REPLIES 6
Doc_Duke
Rhodochrosite | Level 12
Enterprise Guide uses SAS in a client/server configuration, so somewhere you have the complete SAS. SAS has several products that include the word "Enterprise" as an adjective.

In Enterprise Guide, you can use the "Characterize data" task to get a quick view of a dataset. It won't do all you want, but it will give you a high level overview of each file.

Doc Muhlbaier
Duke
Paige
Quartz | Level 8
I don't see much value in obtaining a "brids eyer view" that consists of the unique values of every variable. I doubt it would help me in this situation.

Usually, for us to provide advice, and for you to make meaningful progress in understanding what the data is telling you, you need to formulate actual questions that you want to answer from the data.

Data analysis is pointless without some clear question(s) to answer — its like computing the average phone number from a list of phone numbers, of course you can do the calculations, but why bother?
Cynthia_sas
SAS Super FREQ
Hi, Paige:
Long ago, in a galaxy far, far away, when I was young SAS newbie (no comments from the peanut gallery, those of you who know how -long- ago that was when I was "young"), "characterize your data" meant doing a bunch of PROC FREQs on all the character variables in the dataset and a PROC UNIVARIATE on all the numeric variables (to find the extreme obs, the mean and the median values).

And, sure enough, if you check out the EG "Characterize Data" task, it's doing just that -- the equivalent of PROC CONTENTS, PROC FREQ and PROC UNIVARIATE -- along with some GCHARTS thrown in to graphically show you the data and some PROC PRINTS of the data.

When I worked for lawyers, we'd give them that big stack of paper and then they'd formulate the questions (let's see the salary history of these 20 people; who makes over the median salary in these job categories, etc, etc.) but they'd always want the reams and reams of initial "paper" first.

cynthia
Paige
Quartz | Level 8
Cynthia

I have been there and done that myself.

My recommendation above is because I have never ever found any value to the PROC FREQ/PROC MEANS on everything approach. If the researcher cannot specify in advance some things he might want to know about the data, if its just a huge big fishing expedition, then I don't expect much success other than by random chance.

So, if someone wants to compute means, or frequencies, on phone numbers, please go right ahead. I'll pass ...
deleted_user
Not applicable
Thanks Cynthia, i'm having the Enterprise Guide software installed this week for the first time, so i'm looking forward to giving it a try, especially the "Characterize Data" feature and i'm not expecting any miracles, but if i can just get lists of finite higher level control type values, that would be a good start, ie for example from a list of mortgage types i would be able to say that this file only contains ALT-A, JUMBO and SUBPRIME data ect.

Cynthia I do remember when i used a PROC FREQ on a dollar value in a billion record file and the output was huge, so i was wondering if there may be a feature in PROC FREQ or "Characterize Data" that will allow me to set a limit count of lets say 10000, so if the number of unique values cached in a PROC FREQ or CD exceed 10000, then the PROC FREQ or CD would abort?

Thanks Cynthia and all for you insights!
BobK
Doc_Duke
Rhodochrosite | Level 12
"Characterize data" does allow you to set a limit for the number of distinct values that it shows (the default is 30). It displays the **first** k distinct values, which can be misleading if there is some inherent order in the data that you don't know about. A separate limitation to "Characterize data" is that it only does the frequencies on character variables; it does mean/medians/etc. on numerics. Maybe EGuide 4.2 will be smarter.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 701 views
  • 0 likes
  • 4 in conversation