BookmarkSubscribeRSS Feed
AJChamberlain
Obsidian | Level 7

Greetings SAS people,

 

I have a dataset with a number of different alphanumeric variables in it, something like this:

 

RecordSeededCountryCentre - CandidateTask Language CodeSectionDateMonthYearCriteriaScoreYearMonthQtrFormatVersionCentreKeyMark
1nUnited Arab EmiratesAX001170389525345BPFW2AC07432642817-Mar-183201813201831CD45R67FFDBCn
2nUnited Arab EmiratesAX001170389525345BPFW2AC07432642817-Mar-183201828201831CD45R67FFDBCn
3nUnited Arab EmiratesAX001170389525345BPFW2AC07432642817-Mar-183201835201831CD45R67FFDBCn
4nUnited Arab EmiratesAX001170389525345BPFW2AC07432642817-Mar-183201842201831CD45R67FFDBCn
5nUnited Arab EmiratesAX001170389525345BPFW2AC07432642817-Mar-183201854201831CD45R67FFDBCn
6nUnited Arab EmiratesAX001170389525345BPFW2AC07432642817-Mar-183201868201831PB45R67FFDBCy
7nUnited Arab EmiratesAX001170389525345BPFW1AC07432642817-Mar-183201814201831PB45R67FFDBCy
8nUnited Arab EmiratesAX001170389525345BPFW1AC07432642817-Mar-183201821201831PB45R67FFDBCn
9nUnited Arab EmiratesAX001170389525345BPFW1AC07432642817-Mar-183201832201831PB45R67FFDBCn
10nUnited Arab EmiratesAX001170389525345BPFW1AC07432642817-Mar-183201847201831PB45R67FFDBCy

 

This is a massive dataset (65m rows) and all the values change, sometimes there are only two different values, sometimes there are thousands. My task is to transform the data so that all of the content of every variable is numeric only. I could go through and identify all of the variations in data for each variable that is not already a numeric and change them all individually but that feels quite painful. Is there a clever way in SAS to transform a variable so that the transformed value is numeric only, and the system provides some kind of key to show what each original and transformed variable is?

 

Note I am not trying to convert the formats here, it's the contents that need to change.

 

Thanks

A

 

 

6 REPLIES 6
Reeza
Super User
Sounds like you want to create a snowflake or star schema data instead of what you have there? Do you have access to SAS DI or SQL Server, they're usually used to do this type of analysis.

This paper outlines some of the base method you can use here:
http://support.sas.com/resources/papers/proceedings11/145-2011.pdf

ballardw
Super User

The key to show a value that has been assigned a numeric value  is custom format such as:

Proc format;

   value yesno

   1 = 'Yes'

   0 = 'No'

   ;

run;

And then associate the format with a variable that takes numeric values of 0 and 1.

A similar INVALUE statement can create a numeric value when reading a variable.

So the tasks involved are

1) getting UNIQUE values

2) decide on format VALUE and INVALUE names (31 character limit for character variables as you need a $ in the name )

3) create the formats. Proc Format can use a specially structured data set to create a format so creating that set

4) use a data step to build a new dataset with the numeric variables and assign the formats.

 

Here is an example using a data set you should have available as a start.

data junk;
   set sashelp.class ;
   array c _character_;
   length vn $ 32. val $ 100;
   do i= 1 to dim(c);
      vn=vname(c[i]);
      val = upcase(c[i]);
      vl = vlength(c[i]);
      output;
   end;
   keep  vn val vl;
run;
proc sql;
   create table use as
   select distinct vn, val, vl
   from junk
   order by vn, val;
run;
/* create a data set with numeric values from 1 to n for each 
   unique value of the variable by name
*/
data makefmts;
   set use;
   by vn;
   retain num;
   if first.vn then num=1  ;
   else num+1;
run;

data cntlin;
   set makefmts;
   length fmtname $ 32 start label $ 100 hlo $ 3;
   fmtname=vn; 
   /* characteristics of the informat*/
   type='I' ;
   start = upcase(strip(val));
   label= strip(put(num,best5.));
   hlo='U'; 
   output;
   /* characteristics of the FORMAT*/
   type='N';
   start= strip(put(num,best5.));
   label= upcase(strip(val));
   default=vl;
   output;
run;

proc sort data=cntlin;
   by  fmtname type num;
run;

proc format library=work cntlin=cntlin cntlout=work.cntlout;
run;

data work.newclass;
   set sashelp.class;
   numname = input(name,name.);
   numsex  = input(sex,sex.);
   format numname name. numsex sex.;
run;
   

Warnings:

If you keep adding values there is no guarantee that a specific value will maintain the same numeric value. You likely should place the CNTLIN and generated CNTLOUT data sets in permanent library.

If you add variables then you have another start from scratch scenario.

Note that the SORTED order of the CNTLIN data set is very critical.

This uses the name of the variable to create both and informat and format with the same name. If you already have such then you may need to play around with the FMTSEARCH  system option and where to place the results of this Proc Format code.

All of the character values will be treated in uppercase. If this is not acceptable then you will need to write a lot of manual code. The purpose of the uppercase options are so that values like "UNITED ARAB EMIRATES" "United Arab Emirates" and "united arab emirates" all would get the same numeric value. If you want the same spelling but different combinations of upper and lower case to get different numeric values you are on your own as there are obviously many more rules involved than you have specified.

The length for VAL in the first data step should be length of the longest expected value of any of the character variables. If you have some sort of open text variable that would contain a paragraph of description you should really consider if it is needed at all for whatever the process may be.

 

Anything involving dates will in the long run benefit from creating a actual SAS date values you can generate Month, Year, yearmonth and Qtr (assuming it is a calendar quarter) all from the single date as needed if you have that as a date value and sorting and analysis will be much cleaner. By the way a year month like 20183 will not sort properly in either numeric or character form. You would have to make sure that the months were all two digits so that 201803 comes before 201812 if character and if you use a numeric 201812 it will come after 20191.

Reeza
Super User
Echoing the concern over dates, you should just need a single field for each date, which then can link to a date dimension with all the other months/days as necessary. I have some code to generate date dimensions if needed.
Tom
Super User Tom
Super User

Do you want to treat each variable separately or all of them together.  So if TASK and FORMAT both have the value 'XX' should they get the same number to replace it? Or is it ok if XX in TASK is mapped to a different number than XX in FORMAT?

AJChamberlain
Obsidian | Level 7

Thanks to everyone who's helped with this. Where I've got to is I am going to create a numeric variable with a numeric value for each distinct value in my original data. What I need now is to be able to show a comparison between the two variables clearly.

 

So for example, if I have a table that includes this:

 

Var 1Var 2
abc1231
abc1231
abc1231
def3452
def3452
def3452
zxc3343
zxc3343
bnm5674
bnm5674

 

Then I want to be able to create a report, that looks like this, it's a bit like a cut-down Proc Freq:

 

Var 1Var 2
abc1231
def3452
zxc3343
bnm5674

 

Can anyone help me with a way to create this kind of report?

 

Thanks

A

Reeza
Super User
PROC SORT with NODUPKEY or PROC SQL with SELECT DISTINCT or as you mentioned, use PROC FREQ.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 783 views
  • 0 likes
  • 4 in conversation