Hello all,
I have a very messy dataset, and I need to split one column into many based on a ',' delimiter.
RISK FACTORS
H1, H5, H6, H10 & Scoring sheet
H1, H6, H10, C1; R4 & Scoring sheet
H1,2,3,5,6,10; C1,3,R3,5
H1-2,5-6,10;C1;R4
H1-2,5-6,9-10;C1-2,4;R3-5
H:5,6,8,10
H; 1,5,6,7,8,10; C3,5; H:4,5
HA, H#, H6, H*, R4, R5, & scoring sheet
The data needs to be in H1, H2, H3, H4, etc, format.
I think the scan() command would be best for this data to split it up once it is cleaned, but I have no idea how to force the format that I need.
Any ideas? Thanks in advance.
IF the code that I showed for parsing one type of values works then perhaps use that a pattern for "re-entering" data.
That would be a time to remove/ignore things like "Screening Codes" or other not obviously useful text.
What is the name of the current variable holding the value?
What names do you want the values placed in? Which variable gets which value for each of your "example" rows?
You should show how you want the result to look as it is not very clear. I am afraid that "The data needs to be in H1, H2, H3, H4, etc, format." does not make much sense when your data looks like "HA, H#, H6, H*, R4, R5, & scoring sheet"
The variable name is Risk_Factors.
I am also confused.
The data should look like H1, H5, H6, H10 & Scoring sheet, with a letter before the number. Disregard Scoring Sheet, I don't know what that is.
Once the column is reformatted, I will create flag variables for H1, H2, H3, H5, H6, H8, H10, C1, C3, R3, and R5.
I imagine that the data would look like the attached document once it is finished.
A recent example of code that will split text strings into separate records based upon a comma as a delimiter
Your PDF shows 3 columns, H C and R. How does that relate to data like HA, H#, H6, H*, R4, R5, & scoring sheet? That is what you have said is in your data.
Help us help you.do you mean that H1, H5, H6, H10 & Scoring sheet are the names of your variables? What are the entire range of possible letter number combinations in the data?
If you show an actual result there is likely to be not need for separate steps to "reformat" and then "add flags". If the goal is to create flags, then show what the flag results should be.
Incomplete rules:
Consider this line:
H1-2,5-6,9-10;C1-2,4;R3-5
does a - mean that a continuous sequence is involved? Such as R3-5 means that at the end there should be an R3 R4 R5? If there is to be a "flag" then what is the value of the flag.
Also, did you retype the example data? The line
H1,2,3,5,6,10; C1,3,R3,5
does not show a ; before the R that appears in other lines (very helpful if the ; is actually consistent).
How many H, C and R flags are to be created for each?
Someone needs to go to a class on data entry. It appears, which you should state in your rules, that some of the values are entered as LetterNumber combinations, some are entered are ranges in different formats: H1-2 H:5,6,8,10.
But we need some rule of what the heck to do with H# HA H* (you said letter and number #; # A and * are not numbers).
Here is one way that creates flags with values of 1 when the H, C and R a delimited by semicolons. That is only one case of your data. I create the flags here because the steps to make columns, especially with H , C and R values on the same "row" as your PDF shows, is actually harder.
data example; Risk_Factors ="H1-2,5-6,9-10;C1-2,4;R3-5"; array h(10); array c(10); array r(10); length tword rpiece $ 20; /* process semicolon delimited*/ /* this breaks the string down to pieces to process "like" values */ do i= 1 to countw(risk_factors,';'); tword = scan(risk_factors,i,';'); /* if the subpiece has H*/ /* separate value on commas*/ if index(tword,'H')> 0 then do j=1 to countw(tword,','); rpiece= scan(tword,j,','); /*remove Letter if present in the "range" piece*/ rpiece= compress(rpiece,'H'); if index(rpiece,'-')>0 then do; /* dash for a range*/ do k=input(scan(rpiece,1),f2.) to (input(scan(rpiece,2),f2.)); h[k]=1; end; end; else do; /* not a range, assumes single value*/ h[input(rpiece,f2.)]=1; end; end; /* similar for C and R*/ if index(tword,'C')> 0 then do j=1 to countw(tword,','); rpiece= scan(tword,j,','); rpiece= compress(rpiece,'C'); if index(rpiece,'-')>0 then do; /* dash for a range*/ do k=input(scan(rpiece,1),f2.) to (input(scan(rpiece,2),f2.)); c[k]=1; end; end; else do; /* not a range, assumes single value*/ c[input(rpiece,f2.)]=1; end; end; if index(tword,'R')> 0 then do j=1 to countw(tword,','); rpiece= scan(tword,j,','); rpiece= compress(rpiece,'R'); if index(rpiece,'-')>0 then do; /* dash for a range*/ do k=input(scan(rpiece,1),f2.) to (input(scan(rpiece,2),f2.)); r[k]=1; end; end; else do; /* not a range, assumes single value*/ r[input(rpiece,f2.)]=1; end; end; end; drop i j k tword rpiece; run;
The general approach to garbage data like this is to
1) identify similar types of record
2) program an Identification piece to determine which pattern is needed
3) parse the data
4) send the protocol designer, who ever created the experiment/data collection system, back to school so that all data entry is performed to the same standard.
Note: If there are only a "few" records this may be easier done by hand. Few depends on complexity but I suspect up to a 100 or so reentering the data, or making the version you use, may be done by hand. 1000's not so much.
Where there isn't a clean delimiter between H, C and R such as the ; then you have the joy of searching for the position in the risk_factor string for each of H, C and R (Index or Find functions) subsetting the string using those values (note there are 6 possible branches to deal with: H only; C only; R only; H&C; H&R; C&R or H&C&R.
IF the code that I showed for parsing one type of values works then perhaps use that a pattern for "re-entering" data.
That would be a time to remove/ignore things like "Screening Codes" or other not obviously useful text.
This is what I was trying to avoid, but it ended up being the only way to get things done. I used if/then statements to re-write the column, then used the index statement to find and separate the various values into their own columns. Since the index statement simply returns the numerical place of the cited value, I used more if/then statements to assign 0 if the index statements returned a value of 0 (letter/number value not present in text), or a 1 if the index statement returned anything other than 0 (meaning that the letter/number value was present in the text).
It was tedious but it worked! Thank you so much for your help and ideas.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.