Hi Everyone,
I have a character variable (p1q1_living) for which data was entered inconsistently and some respondents have picked two age ranges (sample SAS dataset attached). I need to find a way to extract the start and last years of the reported year ranges and put them together using either hyphen or "to". I couldn't come up with a proper string function. So, I tried the attached manual conversion but it at least didn't work for people who reported more than one year ranges.
Thank you for your help in advance.
Data PRISM.p1q1_rawdata4;
Set PRISM.p1q1_rawdata3;
If p1q1_living = "18 to 27 years" then p1q1_living= "18-27"; If p1q1_living = "19 to 35 years" then p1q1_living= "19-35";
If p1q1_living = "19 to 42 years" then p1q1_living= "19-42"; If p1q1_living = "19 to 50 years" then p1q1_living= "19-50";
If p1q1_living = "20 to 32 years" then p1q1_living= "20-32"; If p1q1_living = "20 to 39 years" then p1q1_living= "20-39";
If p1q1_living = "20 to 40 years" then p1q1_living= "20-40"; If p1q1_living = "21 to 30 years" then p1q1_living= "21-30";
If p1q1_living = "19-33 years, 36-52 years" then p1q1_living= "19-33" AND p1q1_living= "36-52";
If p1q1_living = "22-39, 39-41" then p1q1_living= "22-39" AND p1q1_living= "39-41";
If p1q1_living = "21 to 39 years" then p1q1_living= "21-39"; If p1q1_living = "22 to 46 years" then p1q1_living= "22-46";
If p1q1_living = "23 to 46 years" then p1q1_living= "23-46"; If p1q1_living = "24 to 38 years" then p1q1_living= "24-38";
If p1q1_living = "24 to 43 years" then p1q1_living= "24-43"; If p1q1_living = "25 to 35 years" then p1q1_living= "25-35";
If p1q1_living = "26 to 40 years" then p1q1_living= "26-40"; If p1q1_living = "26 to 56 years" then p1q1_living= "26-56";
If p1q1_living = "27 to 46 years" then p1q1_living= "27-46"; If p1q1_living = "29 to 41 years" then p1q1_living= "29-41";
If p1q1_living = "29 to 49 years" then p1q1_living= "29-49"; If p1q1_living = "32 to 40 years" then p1q1_living= "32-40";
If p1q1_living = "32 to 43 years" then p1q1_living= "32-43"; If p1q1_living = "32 to 47 years" then p1q1_living= "32-47";
If p1q1_living = "32 to 48 years" then p1q1_living= "32-48"; If p1q1_living = "34 to 43 years" then p1q1_living= "34-43";
If p1q1_living = "35 to 45 years" then p1q1_living= "35-45"; If p1q1_living = "45 to 62 years" then p1q1_living= "45-62";
If p1q1_living = "18 to 28" then p1q1_living= "18-28"; If p1q1_living = "22 to 42" then p1q1_living= "22-42";
If p1q1_living = "23 to 43" then p1q1_living= "23-43"; If p1q1_living = "24 to 38" then p1q1_living= "24-38";
If p1q1_living = "25 to 37" then p1q1_living= "25-37"; If p1q1_living = "28 to 43" then p1q1_living= "28-43";
If p1q1_living = "30 to 43" then p1q1_living= "30-43"; If p1q1_living = "31 to 51" then p1q1_living= "31-51";
If p1q1_living = "ASKU" or p1q1_living= "NA" OR p1q1_living= "REF" then p1q1_living= " ";
Run;
ID | p1q1_living | ||||||||
12007c | 17-35 | ||||||||
12024c | 18 to 27 years | ||||||||
12025c | 18-25 | ||||||||
12032c | 18-30 | ||||||||
12033c | 18-45 | ||||||||
12043c | 19 to 50 years | ||||||||
12044 | 19-27 | ||||||||
12045c | 19-33 years, 36-52 years | ||||||||
12048c | 20 to 40 years | ||||||||
12049c | 20-34 | ||||||||
12314c | 21 to 39 years |
/*Same idea*/ data have; infile datalines truncover dlm='|'; input ID:$10. p1q1_living:$30.; datalines; 12007c|17-35 12024x| 12024c|18 to 27 years 12025c|18-25 12032c|18-30 12033c|18-45 12043c|19 to 50 years 12044|19-27 12045c|19-33 years, 36-52 years 12048c|20 to 40 years 12049c|20-34 12314c|21 to 39 years ; data want; set have; n=countw(p1q1_living,,'kd'); if mod(n,2)=0 and n ne 0 then do; do i=1 to n by 2; want=catx('-',scan(p1q1_living,i,,'kd'),scan(p1q1_living,i+1,,'kd')); output; end; end; else output; drop n i; run; proc print noobs;run;
Hi Everyone,
I have a character variable (p1q1_living) for which data was entered inconsistently and some respondents have picked two age ranges (sample SAS dataset attached). I need to find a way to extract the start and last years of the reported year ranges and put them together using either hyphen or "to". I couldn't come up with a proper string function. So, I tried the attached manual conversion but it at least didn't work for people who reported more than one year ranges.
Thank you for your help in advance.
Data p1q1_rawdata4;
Set PRISM.p1q1_rawdata3;
If p1q1_living = "18 to 27 years" then p1q1_living= "18-27"; If p1q1_living = "19 to 35 years" then p1q1_living= "19-35";
If p1q1_living = "19 to 42 years" then p1q1_living= "19-42"; If p1q1_living = "19 to 50 years" then p1q1_living= "19-50";
If p1q1_living = "20 to 32 years" then p1q1_living= "20-32"; If p1q1_living = "20 to 39 years" then p1q1_living= "20-39";
If p1q1_living = "20 to 40 years" then p1q1_living= "20-40"; If p1q1_living = "21 to 30 years" then p1q1_living= "21-30";
If p1q1_living = "19-33 years, 36-52 years" then p1q1_living= "19-33" AND p1q1_living= "36-52";
If p1q1_living = "22-39, 39-41" then p1q1_living= "22-39" AND p1q1_living= "39-41";
If p1q1_living = "21 to 39 years" then p1q1_living= "21-39"; If p1q1_living = "22 to 46 years" then p1q1_living= "22-46";
If p1q1_living = "23 to 46 years" then p1q1_living= "23-46"; If p1q1_living = "24 to 38 years" then p1q1_living= "24-38";
If p1q1_living = "24 to 43 years" then p1q1_living= "24-43"; If p1q1_living = "25 to 35 years" then p1q1_living= "25-35";
If p1q1_living = "26 to 40 years" then p1q1_living= "26-40"; If p1q1_living = "26 to 56 years" then p1q1_living= "26-56";
If p1q1_living = "27 to 46 years" then p1q1_living= "27-46"; If p1q1_living = "29 to 41 years" then p1q1_living= "29-41";
If p1q1_living = "29 to 49 years" then p1q1_living= "29-49"; If p1q1_living = "32 to 40 years" then p1q1_living= "32-40";
If p1q1_living = "32 to 43 years" then p1q1_living= "32-43"; If p1q1_living = "32 to 47 years" then p1q1_living= "32-47";
If p1q1_living = "32 to 48 years" then p1q1_living= "32-48"; If p1q1_living = "34 to 43 years" then p1q1_living= "34-43";
If p1q1_living = "35 to 45 years" then p1q1_living= "35-45"; If p1q1_living = "45 to 62 years" then p1q1_living= "45-62";
If p1q1_living = "18 to 28" then p1q1_living= "18-28"; If p1q1_living = "22 to 42" then p1q1_living= "22-42";
If p1q1_living = "23 to 43" then p1q1_living= "23-43"; If p1q1_living = "24 to 38" then p1q1_living= "24-38";
If p1q1_living = "25 to 37" then p1q1_living= "25-37"; If p1q1_living = "28 to 43" then p1q1_living= "28-43";
If p1q1_living = "30 to 43" then p1q1_living= "30-43"; If p1q1_living = "31 to 51" then p1q1_living= "31-51";
If p1q1_living = "ASKU" or p1q1_living= "NA" OR p1q1_living= "REF" then p1q1_living= " ";
Run;
Example data as data step, or at least plain text so we can create a data set.
You do not say what you expect to have for output for the "people who reported more than one year ranges".
Doesn't work is awful vague.
Are there errors in the log?: Post the code and log in a code box opened with the "</>" to maintain formatting of error messages.
No output? Post any log in a code box.
Unexpected output? Provide input data in the form of data step code pasted into a code box, the actual results and the expected results. Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the "</>" icon or attached as text to show exactly what you have and that we can test code against.
I bet that you have some conversion of numeric to character notes. Probably from this line as an example which is why request logs.
If p1q1_living = "19-33 years, 36-52 years" then p1q1_living= "19-33" AND p1q1_living= "36-52";
421 data example; 422 p1q1_living = "19-33 years, 36-52 years"; 423 If p1q1_living = "19-33 years, 36-52 years" then p1q1_living= "19-33" AND p1q1_living= 423! "36-52"; 424 run; NOTE: Character values have been converted to numeric values at the places given by: (Line):(Column). 423:66 NOTE: Numeric values have been converted to character values at the places given by: (Line):(Column). 423:74 NOTE: Invalid numeric data, '19-33' , at line 423 column 66. p1q1_living=0 _ERROR_=1 _N_=1 NOTE: The data set WORK.EXAMPLE has 1 observations and 1 variables.
Column 66 is the "19-33" after the "then p1q1_living=" and happens because the AND that follows is treated as a LOGICAL comparison, which in SAS are numeric 1/0 values for true/false. The second conversion at column 74 is at the AND as trying to get the types to match.
This is where you have to describe and better show us what you expect for a result.
This code
If p1q1_living = "19-33 years, 36-52 years" then p1q1_living= "19-33" AND p1q1_living= "36-52";
If p1q1_living = "22-39, 39-41" then p1q1_living= "22-39" AND p1q1_living= "39-41";
obviously does not work as intended. You cannot assign two values to the same variable at the same time. What it does is this: It assigns (or tries to assign) a boolean value. On the left side of the AND is the character value "22-39" which SAS tries to convert to a number, returning a missing value. On the right side is the expression p1q1_living= "39-41", which evaluates to 0 (false). So the expression evaluates to . AND 0, which again evaluates to 0. Your variable finally gets the value "0", as the result is converted to character.
What do you really want do do? If you want both values, there are two possibilities:
If p1q1_living = "22-39, 39-41" then do;
p1q1_living= "22-39";
p1q1_living2= "39-41";
end;
If p1q1_living = "22-39, 39-41" then do;
p1q1_living= "22-39";
output;
p1q1_living= "39-41";
end;
if you go with this solution, remember to put another OUTPUT statement at the end of your datastep.
But if you just want to get the first age and the last age, (e.g. "19-33 years, 36-52 years" becomes "19-52"), use SCAN:
p1q1_living = cats(scan(p1q1_living,1,' -','A'),'-',scan(p1q1_living,-1,' -,','A'));
(by adding the 'A' modifier to the scan function, we also consider alphabetical characters delimiters, so that only the numbers are considered words)
Below how this could work for the sample data you've provided.
If there are two date ranges then the code will create two observations. If you want a different treatment for such cases then you need to tell us exactly what the desired result should be.
data have;
infile datalines truncover dlm='|';
input ID:$10. p1q1_living:$30.;
datalines;
12007c|17-35
12024x|
12024c|18 to 27 years
12025c|18-25
12032c|18-30
12033c|18-45
12043c|19 to 50 years
12044|19-27
12045c|19-33 years, 36-52 years
12048c|20 to 40 years
12049c|20-34
12314c|21 to 39 years
;
data want;
set have;
length cleansed $10;
do _i=1 by 1;
cleansed=scan(p1q1_living,_i,',');
if _i>1 and missing(cleansed) then leave;
cleansed=prxchange('s/^(\d+)[^\d]+(\d+).*[^\d]/$1-$2/oi',-1,strip(cleansed));
output;
end;
drop _i;
run;
proc print data=want;
run;
An alternative to @Patrick 's regular experssions:
data have;
input ID $ 6. p1q1_living $ 8-40;
dumy_variable= _N_; /* just for obs number */
cards;
12007c 17-35
12024c 18 to 27 years
12025c 18-25
12032c 18-30
12033c 18-45
12043c 19 to 50 years
12044 19-27
12045c 19-33 years, 36-52 years
12048c 20 to 40 years
12049c 20-34
12314c 21 to 39 years
;
run;
proc print;
run;
data want;
set have;
p1q1_living=compress(p1q1_living,', -',"KD"); /* keep digits, commas, hyphens, and spaces only */
do _N_=1 to countw(p1q1_living,",");
p1q1_living_NEW=translate(compbl(strip(scan(p1q1_living,_N_,","))),"-"," ");
output;
end;
run;
proc print;
run;
Bart
/*Same idea*/ data have; infile datalines truncover dlm='|'; input ID:$10. p1q1_living:$30.; datalines; 12007c|17-35 12024x| 12024c|18 to 27 years 12025c|18-25 12032c|18-30 12033c|18-45 12043c|19 to 50 years 12044|19-27 12045c|19-33 years, 36-52 years 12048c|20 to 40 years 12049c|20-34 12314c|21 to 39 years ; data want; set have; n=countw(p1q1_living,,'kd'); if mod(n,2)=0 and n ne 0 then do; do i=1 to n by 2; want=catx('-',scan(p1q1_living,i,,'kd'),scan(p1q1_living,i+1,,'kd')); output; end; end; else output; drop n i; run; proc print noobs;run;
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.