Solved: Re: proc transpose: duplicate error related to underscores

dcortell · Posted 06-22-2022 11:01 AM

Hi all

I performed the following:

214        proc transpose data=rank_score out=rank_score_tr
215        (drop=_name_ _suffix_ _label_);
216        by url;
217        id _term_;
218        var N;
219        run;

ERROR: The ID value "_a" occurs twice in the same BY group.
ERROR: The ID value "_s" occurs twice in the same BY group.
ERROR: The ID value "u_" occurs twice in the same BY group.
ERROR: The ID value "y_" occurs twice in the same BY group.

Getting the error related to the above "supposed" duplicates

I checked, no duplicate records for any of the above IDs

Also, performing the above in Viya does not provide error. Running it in sas 9.4 batch environment trigger the error.

Any idea about what it could be related?

Tom · Posted 06-22-2022 11:29 AM

Looks like some of the values of the variable you listed in the ID statement are NOT valid SAS names. So PROC TRANSPOSE had to convert them into something that would be a valid SAS name.

Also consider the case where you have lowercase and uppercase versions of the same string. That will look like two different values to PROC SORT or a simple BY processing, but would both be the same variable name since SAS variable names are NOT case sensitive.

You could try running it with the option VALIDVARNAME set to ANY, but then you will end up with a dataset with ridiculous variable names.

What is the purpose of the PROC TRANSPOSE step anyway?

If you are just trying to make a tabular report then use PROC REPORT or PROC TABULATE.

proc report data=rank_score ;
  column url 'N'n,_term_;
  define url / group;
  define _term_ / across ' ';
  define 'N'n / ' ';
run;

View solution in original post

dcortell · Posted 06-22-2022 11:17 AM

View of sample data already sorted before the proc transpose

Tom · Posted 06-22-2022 11:29 AM

Looks like some of the values of the variable you listed in the ID statement are NOT valid SAS names. So PROC TRANSPOSE had to convert them into something that would be a valid SAS name.

Also consider the case where you have lowercase and uppercase versions of the same string. That will look like two different values to PROC SORT or a simple BY processing, but would both be the same variable name since SAS variable names are NOT case sensitive.

You could try running it with the option VALIDVARNAME set to ANY, but then you will end up with a dataset with ridiculous variable names.

What is the purpose of the PROC TRANSPOSE step anyway?

If you are just trying to make a tabular report then use PROC REPORT or PROC TABULATE.

proc report data=rank_score ;
  column url 'N'n,_term_;
  define url / group;
  define _term_ / across ' ';
  define 'N'n / ' ';
run;

dcortell · Posted 06-22-2022 11:41 AM

The transpose is an intermediate step needed to have all the unique column "ID" values transposed horizontally, which later will be stored in a macro var, which will be called iteratively later in the process.

I checked, and no, example: "_a" and "_A" values were stored in the ID column

And if I'm not wrong, the fact that a variable name is starting with an underscore is accepted by SAS. So not sure what is causing this in sas 9.4, while in Viya no problem rise for the same procedure

Tom · Posted 06-22-2022 12:36 PM

@dcortell wrote:

The transpose is an intermediate step needed to have all the unique column "ID" values transposed horizontally, which later will be stored in a macro var, which will be called iteratively later in the process.

I checked, and no, example: "_a" and "_A" values were stored in the ID column

And if I'm not wrong, the fact that a variable name is starting with an underscore is accepted by SAS. So not sure what is causing this in sas 9.4, while in Viya no problem rise for the same procedure

It is not the underscores. It is the DUPLICATES. You might have created duplicates you didn't think you had by giving it two distinct values that map to the same actual variable name. Here are some examples.

data have;
  input id name $char40. ;
  dummy=1;
cards;
1 _a
1 _A
2 /B
2 _B
3 /c
3 *C
4 VAR45678901234567890123456789012aaa
4 VAR45678901234567890123456789012xxx
;

180  proc transpose data=have out=want;
181    by id;
182    id name;
183    var dummy;
184  run;

ERROR: The ID value "_B" occurs twice in the same BY group.
NOTE: The above message was for the following BY group:
      id=2
ERROR: The ID value "VAR45678901234567890123456789012" occurs twice in the same BY group.
NOTE: The above message was for the following BY group:
      id=4
WARNING: 2 BY groups omitted due to earlier errors.
WARNING: Variable _A already exists on file WORK.WANT.
WARNING: Variable _C already exists on file WORK.WANT.
NOTE: There were 8 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 2 observations and 6 variables.
NOTE: PROCEDURE TRANSPOSE used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds

Notice that you get two different types of notes in the log depending on whether you have exact name matches or just situations where the case of the names are different. But really both are a problem.

Why would you need to TRANSPOSE the data to get a list of the VALUES into macro varaibles?

proc sql noprint;
  select distinct nliteral(upcase(name)) into :mvar separated by '|' from have;
quit;

dcortell · Posted 06-22-2022 12:48 PM

I'm getting confused. Why the system should consider duplicates situations like "

VAR45678901234567890123456789012aaa

and

VAR45678901234567890123456789012xxx

if they are two different IDs, if only in the last three characters?

The same for "/B" or "_B" ?

Tom · Posted 06-22-2022 12:53 PM

A valid SAS name can be a maximum of 32 bytes long.

It can only contain letters, digits and underscores.

It cannot start with a digit.

In both cases two different values of the name string are mapped to the same SAS variable name.

In the case of the long name strings the first 32 bytes of the two different strings are the same.

In the case of the name strings with the invalid characters in them when the invalid characters are replaced with underscores to create valid names they end up with the same name.

dcortell · Posted 06-22-2022 12:54 PM

Terrific, all clear now. In debt Tom

dcortell · Posted 06-22-2022 01:43 PM

Ok I investigated a little more and I still need to transpose the columns cause the transposed "bigrams" will be needed as column for modelling purposes. Each column will represent a variable to be included as input variable in a model which need the vars in column. Any chance to still be able to transpose the bigrams, also we with initial characters like " ' " ?

Tom · Posted 06-22-2022 02:54 PM

@dcortell wrote:

Ok I investigated a little more and I still need to transpose the columns cause the transposed "bigrams" will be needed as column for modelling purposes. Each column will represent a variable to be included as input variable in a model which need the vars in column. Any chance to still be able to transpose the bigrams, also we with initial characters like " ' " ?

Why do you need the data in that form to do the analysis? What is the analysis you are doing?

Why do you need to use bigram values as the variable names for the analysis? Just use generic names for the variables.

For example give each bigram a distinct number and distinct valid SAS variable name.

proc sort data=have;
  by bigram ;
run;
data make_names;
  set have;
  by bigram ;
  bigram_num + first.bigram;
  length name $32 ;
  name = cats('bigram',bigram_num);
run;

If you want you could use the IDLABEL statement in PROC TRANSPOSE to use the BIGRAM value as the LABEL of the generated variables instead of trying to use it as the NAME of the variable.

dcortell · Posted 06-22-2022 12:28 PM

Just adding the validvarname=any option solved the topic in sas 9.4. I think the error was related to the values like " 'a" or " 'b", despite the log was pointing to the "_a" or "_b" values of the ID variable.Thanks.

Tom · Posted 06-22-2022 12:44 PM

@dcortell wrote:

Just adding the validvarname=any option solved the topic in sas 9.4. I think the error was related to the values like " 'a" or " 'b", despite the log was pointing to the "_a" or "_b" values of the ID variable.Thanks.

So the reason it worked in VIYA is someone had accidentally set VALIDVARNAME to ANY in your VIYA session.

Ksharp · Posted 06-23-2022 09:33 AM

Try LET option if you know that is right.

proc transpose data=rank_score out=rank_score_tr LET .............;

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away