Hello all,
I have a dataset in SAS, which was imported from excel (unfortunately a SAS download from the master database is not an option). Also, I would prefer to manipulate this data in SAS as opposed to excel. Anyhow...
I have cells with multiple codes within them... usually like so 384849495959.... They are usually in groups of 4, with atleast 2 separate groups in a cell, and potentially up to 10 or more groups.
When I break them up using the substring command. This seems to work well... However, SAS seems confused about the number groups beyond the first one. For example.
data adt.all2;
set adt.all1;
if ap =2 then psic1 = substrn(var3,1,4);
if ap =2 then psic2 = substrn(var3,5,6);
run;
Cell 1 after code newcell1 newcell2
35554555 --------> 3555 4555
I have to specific the psic2 length at 6 even though it is really 4 or it unecessarily cuts off the number i want in the cell..
From this, I can reformat newcell1 as a numeric cell, while newcell2 will give me problems. Ive tried the to multiply, as well as compress, left, put and input functions to try and get newcell2 into a 'numeric variable.' However, I get errors everytime. From my reading it seems as though this has something to do with the hexadecimal representation. I get that it likely has to do with that, but that doesnt really help me in resolving the problem so that I have a numeric variable in newcell2 to work with. And re-inputting data is not really an option.
Anyhow, thanks for your help - advice is certainly appreciated.
Wonderful, thank you!
So, at least these five examples all have hex characters 0D0A -- carriage return, line feed -- as "delimiters" between the first two 4-digit codes. So, to get rid of these (and only these) unwanted characters, you can use:
var3_new=compress(var3,'0D0A'x);
Edit: Or even better: var3=...
Hello @r4321,
If your codes consist of a known set of characters, e.g., only digits 0, ..., 9, you could apply the COMPRESS function with the K ("keep") modifier.
Example:
var3c=compress(var3,,'kd'); /* removes all characters but digits */
If you're interested in the "offending" non-numeric characters that are contained in VAR3, you could post the output (5 lines) of the following PROC PRINT step:
proc print data=adt.all1(obs=5);
where verify(var3, ' 0123456789');
format var3 $hex44.;
var var3;
run;
This will show the hex codes of the first (up to) 22 characters of the first 5 VAR3 values in dataset ADT.ALL1 which contain characters other than blanks or digits. Possibly, these first 5 values are representative for the dataset.
Edit: As soon as we know the offending characters, we can specify them in the second argument of the COMPRESS function (then without the K modifier) in order to remove them. But the "keep" approach might be more elegant in your case.
Freelance, thanks for your help --- Ive tried a couple things based on your recommendations. Still stuck, however.
One more
@r4321 wrote:
One more
[Screenshot (19).png]
Thanks for posting the screenshots.
This last one is easy: PSIC2 contains 4-digit codes such as 3534. In this case, the result of input(psic2,hex.) is the integer represented by the hexadecimal number 3534, which is 13620. BYTE(13620) requests the 13621th character of the ASCII collating sequence, which contains only 256 characters, hence the "Invalid argument" notes.
Are your 4-digit codes hexadecimal values? This could at least explain why not all of them can be converted to numeric values using standard numeric informats.
I think, the best way to find out what's going on is to look (with a $HEXw. format) at the original long character strings (before they are cut into pieces), especially those which cause problems.
This is why I had suggested the PROC PRINT step applying $HEX44. format to (a few values of) variable VAR3 in dataset ADT.ALL3.
Your screenshot no. 18 shows that no observations in ADT.ALL3 have VAR3C values which contain non-blank, non-numeric characters. I am surprised to see that your permanent dataset ADT.ALL3 contains a variable named VAR3C. I had used this name in my example code
var3c=compress(var3,,'kd');
Therefore, I suspect that you newly created VAR3C in ADT.ALL3 using the above line of code. In this case, the result "No observations ..." is only logical, because this VAR3C, by definition, contains only digits if anything.
So, to obtain a non-trivial result from the abovementioned PROC PRINT step, please replace "var3c" by the name of one of the variables containing those raw "multiple codes" (variable length $70 or so).
Thanks for your comments.. heres a couple screenshots based on them
Heres another...which shows what the digits actually are for the first 5 observations in the dataset. Thanks.
Wonderful, thank you!
So, at least these five examples all have hex characters 0D0A -- carriage return, line feed -- as "delimiters" between the first two 4-digit codes. So, to get rid of these (and only these) unwanted characters, you can use:
var3_new=compress(var3,'0D0A'x);
Edit: Or even better: var3=...
Freelance,
Not quite sure how the you figured that out! but it works!
Thanks a lot, I unfortunately spent most of the day trying to get this figured out...So I really appreciate your patience and help!!
This command works great
data adt.all2;
set adt.all1;
var3_new =compress(var3,'0D0A'x);
run;
But did you mean try this instead?
data adt.all2;
set adt.all1;
var3=...;
run;
Or...
data adt.all2;
set adt.all1;
var3_new =compress(var3=...);
run;
?
The first one worked, but not sure what you mean by the var3=...
Anyhow, youre great, thanks!
@r4321 wrote:
But did you mean try this instead?
data adt.all2;
set adt.all1;
var3=...;
run;
You're welcome. Sorry for being too lazy (in the middle of the night, CET) to write out the alternative code:
var3=compress(var3,'0D0A'x);
i.e., the suggesition to overwrite VAR3 rather than create a new variable VAR3_NEW (so as to "save" the original value with the "precious" '0D0A'x characters). The three periods "..." were meant to denote an ellipsis! 🙂
Please provide some additional details about your variables. I am unable to duplicate your issue with numeric or a string variable that that starts with the characters shown.
Since you mention up to 10 groups I have to assume that var3 is character of at least 40 characters. Do you explicitly assign a length to the psic variables? a format?
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.