BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
bazingarollcall
Fluorite | Level 6

Hello all, 

 

I have a very messy dataset, and I need to split one column into many based on a ',' delimiter.

 

 RISK FACTORS

H1, H5, H6, H10 & Scoring sheet

H1, H6, H10, C1; R4 & Scoring sheet

H1,2,3,5,6,10; C1,3,R3,5

H1-2,5-6,10;C1;R4

H1-2,5-6,9-10;C1-2,4;R3-5

H:5,6,8,10

H; 1,5,6,7,8,10; C3,5; H:4,5

HA, H#, H6, H*, R4, R5, & scoring sheet


The data needs to be in H1, H2, H3, H4, etc, format.

 

I think the scan() command would be best for this data to split it up once it is cleaned, but I have no idea how to  force the format that I need. 

 

Any ideas? Thanks in advance. 

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

IF the code that I showed for parsing one type of values works then perhaps use that a pattern for "re-entering" data.

That  would be a time to remove/ignore things like "Screening Codes" or other not obviously useful text.

View solution in original post

7 REPLIES 7
ballardw
Super User

What is the name of the current variable holding the value?

What names do you want the values placed in? Which variable gets which value for each of your "example" rows?

You should show how you want the result to look as it is not very clear. I am afraid that "The data needs to be in H1, H2, H3, H4, etc, format." does not make much sense when your data looks like "HA, H#, H6, H*, R4, R5, & scoring sheet"

bazingarollcall
Fluorite | Level 6

The variable name is Risk_Factors.

I am also confused.

 

The data should look like H1, H5, H6, H10 & Scoring sheet, with a letter before the number. Disregard Scoring Sheet, I don't know what that is.

 

Once the column is reformatted, I will create flag variables for H1, H2, H3, H5, H6, H8, H10, C1, C3, R3, and R5.

 

I imagine that the data would look like the attached document once it is finished.

 

 

 

PaigeMiller
Diamond | Level 26

A recent example of code that will split text strings into separate records based upon a comma as a delimiter

 

https://communities.sas.com/t5/SAS-Programming/Cleaning-data-by-getting-the-value-into-the-correct-v...

--
Paige Miller
ballardw
Super User

Your PDF shows 3 columns, H C and R. How does that relate to data like HA, H#, H6, H*, R4, R5, & scoring sheet? That is what you have said is in your data.

 

Help us help you.do you mean that H1, H5, H6, H10 & Scoring sheet are the names of your variables? What are the entire range of possible letter number combinations in the data?

If you show an actual result there is likely to be not need for separate steps to "reformat" and then "add flags". If the goal is to create flags, then show what the flag results should  be.

Incomplete rules:

Consider this line:

H1-2,5-6,9-10;C1-2,4;R3-5

does a - mean that a continuous sequence is involved? Such as R3-5 means that at the end there should be an R3 R4 R5? If there is to be a "flag" then what is the value of the flag.

Also, did you retype the example data? The line

H1,2,3,5,6,10; C1,3,R3,5

does not show a ; before the R that appears in other lines (very helpful if the ; is actually consistent).

How many H, C and R flags are to be created for each?

 

Someone needs to go to a class on data entry. It appears, which you should state in your rules, that some of the values are entered as LetterNumber combinations, some are entered are ranges in different formats: H1-2  H:5,6,8,10.

But we need some rule of what the heck to do with H# HA H* (you said letter and number #; # A and * are not numbers).

 

Here is one way that creates flags with values of 1 when the H, C and R a delimited by semicolons. That is only one case of your data. I create the flags here because the steps to make columns, especially with H , C and R values on the same "row" as your PDF shows, is actually harder.

data example;
   Risk_Factors ="H1-2,5-6,9-10;C1-2,4;R3-5";
   array h(10);
   array c(10);
   array r(10);
   length tword rpiece $ 20;
   /* process semicolon delimited*/
   /* this breaks the string down to pieces
     to process "like" values
   */
   do i= 1 to countw(risk_factors,';');
      tword = scan(risk_factors,i,';');
      /* if the subpiece has H*/
      /* separate value on commas*/
      if index(tword,'H')> 0 then do j=1 to countw(tword,',');
         rpiece= scan(tword,j,',');
         /*remove Letter if present in the "range" piece*/
         rpiece= compress(rpiece,'H');
         if index(rpiece,'-')>0 then do;
            /* dash for a range*/
            do k=input(scan(rpiece,1),f2.) to (input(scan(rpiece,2),f2.));
               h[k]=1;
            end;
         end;
         else do;
           /* not a range, assumes single value*/
              h[input(rpiece,f2.)]=1;
         end;
      end;
      /* similar for C and R*/
      if index(tword,'C')> 0 then do j=1 to countw(tword,',');
         rpiece= scan(tword,j,',');
         rpiece= compress(rpiece,'C');
         if index(rpiece,'-')>0 then do;
            /* dash for a range*/
            do k=input(scan(rpiece,1),f2.) to (input(scan(rpiece,2),f2.));
               c[k]=1;
            end;
         end;
         else do;
           /* not a range, assumes single value*/
              c[input(rpiece,f2.)]=1;
         end;
      end;
      if index(tword,'R')> 0 then do j=1 to countw(tword,',');
         rpiece= scan(tword,j,',');
         rpiece= compress(rpiece,'R');
         if index(rpiece,'-')>0 then do;
            /* dash for a range*/
            do k=input(scan(rpiece,1),f2.) to (input(scan(rpiece,2),f2.));
               r[k]=1;
            end;
         end;
         else do;
           /* not a range, assumes single value*/
              r[input(rpiece,f2.)]=1;
         end;
      end;
   end;
   drop i j k tword rpiece;
run;
      
      
   

The general approach to garbage data like this is to

1) identify similar types of record

2) program an Identification piece to determine which pattern is needed

3) parse the data

 

4) send the protocol designer, who ever created the experiment/data collection system, back to school so that all data entry is performed to the same standard.

 

Note: If there are only a "few" records this may be easier done by hand. Few depends on complexity but I suspect up to a 100 or so reentering the data, or making the version you use, may be done by hand. 1000's not so much.

 

Where there isn't a clean delimiter between H, C and R such as the ; then you have the joy of searching for the position in the risk_factor string for each of H, C and R (Index or Find functions) subsetting the string using those values (note there are 6 possible branches to deal with: H only; C only; R only; H&C; H&R; C&R or H&C&R.

 

 

 

 

bazingarollcall
Fluorite | Level 6
Thanks so much for your detailed response.

I suspect that you are right, this is a much harder solution than I originally thought.

Your PDF shows 3 columns, H C and R. How does that relate to data like HA, H#, H6, H*, R4, R5, & scoring sheet? That is what you have said is in your data.



Help us help you.do you mean that H1, H5, H6, H10 & Scoring sheet are the names of your variables? What are the entire range of possible letter number combinations in the data?
-Risk_Factors is the name of the variable containing several different combinations. H1, H5, H6, H10, Scoring Sheet, etc, are all possible risk factors for whatever outcome is indicated on the observation. The range is H1 through H10, C1 through C5, and R1 through R5.


If you show an actual result there is likely to be not need for separate steps to "reformat" and then "add flags". If the goal is to create flags, then show what the flag results should be.
-The flag should indicate, using a 0 or 1, whether the observation had a risk factor of H1, H2, H3, or any of the above range of values.


Incomplete rules:

Consider this line:

H1-2,5-6,9-10;C1-2,4;R3-5

does a - mean that a continuous sequence is involved? Such as R3-5 means that at the end there should be an R3 R4 R5? If there is to be a "flag" then what is the value of the flag.
-To my understanding, you are correct. If there is a flag, then another separate column would show 0 if the observation did not indicate the presence of H8, for example, but a 1 for the presence of H1, H2, H5, H6, H9, and H10.


Also, did you retype the example data? The line

H1,2,3,5,6,10; C1,3,R3,5

does not show a ; before the R that appears in other lines (very helpful if the ; is actually consistent).
-It would be helpful, but the punctuation is incredibly inconsistent. I did not show all rows of the datasets, but some values have a ;, some have a /, some have a &, and on and on and on...


How many H, C and R flags are to be created for each?


Someone needs to go to a class on data entry. It appears, which you should state in your rules, that some of the values are entered as LetterNumber combinations, some are entered are ranges in different formats: H1-2 H:5,6,8,10.

But we need some rule of what the heck to do with H# HA H* (you said letter and number #; # A and * are not numbers).
-You are correct, I do not know what to do with the #, *, etc.! I assume, once the complete LetterNumber values are reformatted, then something can be done with those (potentially just delete them because they're seemingly useless).


Here is one way that creates flags with values of 1 when the H, C and R a delimited by semicolons. That is only one case of your data. I create the flags here because the steps to make columns, especially with H , C and R values on the same "row" as your PDF shows, is actually harder.


Note: If there are only a "few" records this may be easier done by hand. Few depends on complexity but I suspect up to a 100 or so reentering the data, or making the version you use, may be done by hand. 1000's not so much.
- I think you're right - this might be easier done in Excel or something. There's only a few hundred observations, so we'll see what the rest of my team thinks.
ballardw
Super User

IF the code that I showed for parsing one type of values works then perhaps use that a pattern for "re-entering" data.

That  would be a time to remove/ignore things like "Screening Codes" or other not obviously useful text.

bazingarollcall
Fluorite | Level 6

This is what I was trying to avoid, but it ended up being the only way to get things done. I used if/then statements to re-write the column, then used the index statement to find and separate the various values into their own columns. Since the index statement simply returns the numerical place of the cited value, I used more if/then statements to assign 0 if the index statements returned a value of 0 (letter/number value not present in text), or a 1 if the index statement returned anything other than 0 (meaning that the letter/number value was present in the text).

 

It was tedious but it worked! Thank you so much for your help and ideas.   

 

 

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 3811 views
  • 0 likes
  • 3 in conversation