Solved: PLM restore not scoring because of variable format

Negarev · Posted 05-03-2021 04:31 PM

Hi,

Using SAS EG 9.4, I'm trying to score a dataset using PLM restore. Never had any issues with this procedure until now. My problem is that SAS issues a warning related to one variable required to score the dataset and that procedure does not come through - all predictions are missing.

Here is the warning:

Apparently, this is a bug fixed in release 9.3:

http://support.sas.com/kb/42/902.html

They say in the link above that this warning can be circumvented by avoiding to specify a user-defined format. However, that specific variable was not user-defined - it is a character var of length 10 that comes from an Excel doc that I imported into SAS. I haven't tinkered around that variable at all.

Anybody have the same issue and fixed it?

Negarev · Posted 05-04-2021 04:51 PM

Hi All,

OK quick update: my trick has worked. It seems that SAS has attributed some custom format to my character variables during the import. This custom format is then detected as incompatible when PLM RESTORE tries to score the data.

Thanks everyone for your help!

@ballardw : Could you please give me a detailed example on how to import a CSV onto SAS via a DATA step?

View solution in original post

Tom · Posted 05-03-2021 04:59 PM

Not sure that the fact that you are using Enterprise Guide to submit the SAS code has anything to do with this. Unless you are using some feature of Enterprise Guide to generate the PROC PLM code for you.

What version of SAS are using? Look at the value of SYSVLONG macro variable.

What is the code you submitted? Show the SAS LOG messages for the full PROC PLM step.

Reading data from Excel files is a good way to have variables convert seemingly randomly between character and numeric. Since Excel is a spreadsheet and not a database a single column and contain any type of data. Especially if humans are involved in entering the values.

So perhaps you have attached the wrong formats for the type of variables you actually got this time when you read the Excel file into a SAS dataset?

Are there any FORMAT statements in your code?

Negarev · Posted 05-03-2021 05:16 PM

Thanks for your reply @Tom !

OK so the value for SYSVLONG is 9.04.01M5P091317

I use the PLM restore procedure in a macro, because I have half a dozen different segments to score:

%macro scoring(type,seg,optional,scoredat);
proc plm restore=PARMS.&type._&optional._&seg.;
score data= &scoredat. out=&type._&scoredat._Pred; /* evaluate the model on new data */
run;
%mend;

/*Scoring segment 1 out of 6*/
%scoring(MV,PAS,No_Outl,STD_WHK_PAS)

Yes, I'm aware of the shortcomings of Excel, but this variable is made only of characters, so I don't see how it could be converted into a numeric variable.

There is no format involved in my code for this variable - I just import the mappings (one per Excel sheet):

PROC IMPORT DATAFILE="&INF.\NZ_STD_FILE_2021_nocount.xlsx" DBMS=XLSX
OUT=PAS REPLACE;
SHEET=PAS;
RUN;
PROC IMPORT DATAFILE="&INF.\NZ_STD_FILE_2021_nocount.xlsx" DBMS=XLSX
OUT=SUV REPLACE;
SHEET=SUV;
RUN;
PROC IMPORT DATAFILE="&INF.\NZ_STD_FILE_2021_nocount.xlsx" DBMS=XLSX
OUT=PU_CC REPLACE;
SHEET=PU_CC;
RUN;
PROC IMPORT DATAFILE="&INF.\NZ_STD_FILE_2021_nocount.xlsx" DBMS=XLSX
OUT=Van_Bus REPLACE;
SHEET=Van_Bus;
RUN;

PROC IMPORT DATAFILE="&INF.\NZ_STD_FILE_2021_nocount.xlsx" DBMS=XLSX
OUT=Location REPLACE;
SHEET=Location;
RUN;PROC IMPORT DATAFILE="&INF.\NZ_STD_FILE_2021_nocount.xlsx" DBMS=XLSX
OUT=Channel REPLACE;
SHEET=Channel;
RUN;

Then I make a quick adjustment on one of the variables and finally I append all of the mappings together:

DATA Van_Bus;
SET Van_Bus;
RENAME STD_Segment=Segment;
run;

DATA STD_Combined;
Length STD_MAKE $25
STD_MODEL $40
V_MAKE $40
Segment $15
V_MODEL $40
Correct_VehicleTypeCode $2
AU_VFACT_CATEGORY $20;
set Van_Bus PAS SUV PU_CC Location Channel;
run;

The format statements do not involve the variable I'm having an issue with. Without these format statements, I get a warning that these variable have several lengths and that it can lead to truncation.

ballardw · Posted 05-03-2021 05:29 PM

And just what does Proc Contents say about the variable STAT_MAKE in the data sets used by the proc?

I am no longer amazed by what can come out of Excel that "doesn't seem possible" because the column is supposed to be some description.

This sounds like a process that is repeated enough that reliance on Proc Import should have been removed long ago.

Negarev · Posted 05-03-2021 06:02 PM

Thanks @Reeza and @ballardw for your replies!

OK so the type and format of STAT_MAKE in all of the source files is the same. It is also the same in the final, appended file:

@ballardw : what alternatives to proc import are there?

ballardw · Posted 05-03-2021 07:12 PM

@Negarev wrote:

Thanks @Reeza and @ballardw for your replies!

OK so the type and format of STAT_MAKE in all of the source files is the same. It is also the same in the final, appended file:

@ballardw : what alternatives to proc import are there?

Basically if reading multiple files of the same structure that should have the same variable types, name, lengths, formats and informats things come down to a data step to impose order. One way is to save each of the sheets as a separate CSV, or other delimited format, and write a data step to read them. Another is to use a Libname statment for an Excel file. Then parse the existing columns of sheets with a data step to standardize types.

I wonder just how many variables you are using in your model. The above picture of the proc contents where it shows 4 of 14 (sort of shown variables) as $1 makes me wonder about the other variables because $1 is the typical setting when the first rows of a spreadsheet have blank cells. You may be getting odd behaviors if other model variables have values that did not appear when the scoring data was made though I would expect different error messages for that case.

Negarev · Posted 05-03-2021 07:30 PM

Thanks for sharing, I had no idea it was possible to read a CSV through a data step! Would you mind elaborating on that please? PROC IMPORT is actually the standard taught in Programming 1, so I didn't know there were other ways to do that - maybe that could solve my current problem...

Reeza · Posted 05-03-2021 08:36 PM

A data step is good for CSVs but unfortunately Excel files with many sheets are problematic. Excel doesn't enforce types so you can get lucky and it can work sometimes but I wouldn't bet my job on it.

Reeza · Posted 05-03-2021 08:35 PM

You need to show this for the original data set you used to build the model and for the data set you're trying to score. They need to match. Seeing one isn't confirmation. The reason you'll get requests for screenshots/logs from us, is the fact that EM is a lesser known product so many of us won't have access to it, even if we've worked with it historically.

@Negarev wrote:

Thanks @Reeza and @ballardw for your replies!

OK so the type and format of STAT_MAKE in all of the source files is the same. It is also the same in the final, appended file:

@ballardw : what alternatives to proc import are there?

Tom · Posted 05-03-2021 06:33 PM

You have not shown the LOG.

You have not shown enough code. Where is the code that produced the item store that the error message is complaining about?

https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.4/statug/statug_plm_overview.htm

The PLM procedure performs postfitting statistical analyses for the contents of a SAS item store that was previously created with the STORE statement in some other SAS/STAT procedure. An item store is a special SAS-defined binary file format used to store and restore information with a hierarchical structure.

Negarev · Posted 05-03-2021 11:30 PM

No, the error is not caused by the project that builds the model. The reason is that I am able to score that exact same model if I just use my old mapping file. This old mapping file works, but has a number of problems that I won't go into much detail here. This is the reason I'm building a new mapping file.

I think I have made some progress. In my original mapping file, the character variables have a different format. If I just open the dataset and right click on the variable > Properties, this is what I get:

The format is blank.

However if I go to my new mapping file:

The format is not blank and is the same as the informat.

I've found out that if I just duplicate the variable and drop the old version like so:

DATA STD_Combined2;
set STD_Combined;
STAT_MAKE2=STAT_MAKE;

run;



DATA STD_Combined3;
set STD_Combined2;
DROP STAT_MAKE;

RENAME STAT_MAKE2=STAT_MAKE;

run;

The format becomes blank as well (same as the informat):

If I try to run the PROC PLM RESTORE now, the warning shifts to another variable of the model (STD_LOCATION):

So I reckon if I just do this with every character variable used in the model, it should work...

I know this is very strange, but if anyone has a different solution I'd be glad to try it.

Tom · Posted 05-04-2021 12:17 AM

Another "feature" of PROC IMPORT. It has a nasty habit of attaching the $ format to character variables for no good reason. You use a FORMAT to tell SAS that you want to display the values of the field in a special way. SAS does not need special instructions for displaying character variables.

You can use the FORMAT statement with a list of variables without any format specification after them to remove the formats from a variable. You can use the special _CHARACTER_ variable list to specify all of the character variables.

So to remove the formats from all of the character variables use something like:

DATA STD_Combined2;
  set STD_Combined;
  format _character_ ;
run;

You can use PROC DATASETS to modify the format without having to re-write the whole dataset.

proc datasets nolist lib=work;
  modify STD_Combined;
    format _character_ ;
  run;
quit;

You could even try just add the FORMAT statement to your PROC PLM step.

%macro scoring(type,seg,optional,scoredat);
proc plm restore=PARMS.&type._&optional._&seg.;
  format _character_ ;
score data= &scoredat. out=&type._&scoredat._Pred; /* evaluate the model on new data */
run;
%mend;

If you know the formats attached to all of the vairables in the original analysis then perhaps you can include those in the format statement also. Just remember that when using the format statement to remove formats those vairables need to be last ones listed. You can have multiple format statements. In the same vairable appears multiple times The last format "wins".

Negarev · Posted 05-04-2021 04:51 PM

Hi All,

OK quick update: my trick has worked. It seems that SAS has attributed some custom format to my character variables during the import. This custom format is then detected as incompatible when PLM RESTORE tries to score the data.

Thanks everyone for your help!

@ballardw : Could you please give me a detailed example on how to import a CSV onto SAS via a DATA step?

Reeza · Posted 05-04-2021 04:58 PM

Run a proc import in Base SAS on a CSV and see the code generated in the log. You probably usually skip it over, but it is there in detail. FYI - When you use PROC IMPORT it has to scan the full file first to determine lengths and formats which usually takes the most time to import the data. A data step will be significantly faster.

ballardw · Posted 05-04-2021 06:15 PM

@Negarev wrote:

Hi All,

OK quick update: my trick has worked. It seems that SAS has attributed some custom format to my character variables during the import. This custom format is then detected as incompatible when PLM RESTORE tries to score the data.

Thanks everyone for your help!

@ballardw : Could you please give me a detailed example on how to import a CSV onto SAS via a DATA step?

One way is to use proc import on the csv with the GUESSIONGROWS=MAX; option set. The Log will contain data step code to read the data. You can copy that code from the log and paste into the editor. Clean up by removing line numbers if present in the log. Look at the INFORMAT statements at the top of the code. Those are setting the length and type of the variables. If the informat does not make sense: reading a column that should be numeric as character or vice versa set the type by changing the informat. For character values set the informat long enough for the longest expected value. If a value should be date, time or datetime value make sure the informat matches your expected layout. Remove most of the formats except for date, time or datetime variables.

Change variable names to a "nicer" or standard version.

If you have data that is mostly numeric but has stuff like "NA" "NULL" or some other text that basically means missing you can create a custom informats to read that data assigning a missing value for those text strings and read the rest as 12. or similar informat.

If you have values that should be dates but are missing delimiters to tell SAS that 01212021 should be 21JAN2021 look up the various date, time or datetime INFORMATS. There will usually be one that will work. Caution 6 digit strings with 2-digit years need to be watched closely as 010203 could represent as many as 6 different dates depending on which century you expect the 2-digits to represent.

Add labels and maybe if you have common data issues you can include preliminary data checking.

Other ideas: you can write custom informats for variables that should have a limited range of values that write error messages to the log for you.

proc format ;
invalue codevalue 
'1','2','3','4'=_same_
'NA' = .
other=_error_
;
run;

data example;
   input x :codevalue.;
datalines;
1
NA
2
3
4
9
88
;

I incorporated an example of how to read text NA as missing without being an error.

The _same_ instruction says to just treat the values in the list as the "same" value. Since this is a numeric informat that means the values are the numbers 1 to 4. Anything other than NA or those digits is invalid data and appears as such in the log. So you could go to the source to find out why these are appearing or take other action.

I use a lot of these formats as I have some data sources that are manually entered and are constantly getting the wrong code values (which can be LOTS longer than shown) in fields.

The default INFORMAT statements will be in column order as will the Input statement. The Input statement must match the column order of the data. The Informat or other statments like ATTRIB that could be used to set the property do not.

Once you have the data step working for one file then you only need to change the input file name and output data set(if needed) and rerun it. There are multiple posts on this forum on ways to read multiple identical structure text files if that is useful.

If you have a source that randomly changes column order but not variables you can adjust the INPUT statement to match the changed order.

PLM restore not scoring because of variable format

Re: PLM restore not scoring because of variable format

Re: PLM restore not scoring because of variable format

Re: PLM restore not scoring because of variable format

Re: PLM restore not scoring because of variable format

Re: PLM restore not scoring because of variable format

Re: PLM restore not scoring because of variable format

Re: PLM restore not scoring because of variable format

Re: PLM restore not scoring because of variable format

Re: PLM restore not scoring because of variable format

Re: PLM restore not scoring because of variable format

Re: PLM restore not scoring because of variable format

Re: PLM restore not scoring because of variable format

Re: PLM restore not scoring because of variable format

Re: PLM restore not scoring because of variable format

Re: PLM restore not scoring because of variable format

SAS Innovate 2025: Call for Content