BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
texasmfp
Lapis Lazuli | Level 10

I have a data subscription service that, unfortunately, generates data in Excel where some fields can contain multiple values separate by a semi-colon. Here is an example:

 

Date Company Code Product Value
2022-06-16 Company A 7306199902; 7216910100; 7306199902; 7306199902; 7306199902; 7306300400 TUBE1; TUBE2; TUBE3; TUBE4; TUBE5; Tube6 79106.46; 36387.81; 120898.88; 1701.82; 138816.6; 61075.22
2022-06-15 Company A 7306300400 TUBE4 194963.13
2022-06-15 Company B 3917230499; 3917400100 TUBE1;TUBE4 24324.36; 2295.45
2022-06-15 Company C 7609000299 TUBE5 3089.67

 

While not in this example, occasionally, some values are blanks.  The blanks can be in any of the 5 fields.  If it is in one of the fields with multiple values, there is a delimiter, but the value is blank.  I'd like to create a SAS database that splits out the delimited values, one to a line, repeating the values of the non-delimited fields (Date and Company in the example).  Here is the desired SAS dataset:

 

Date Company Code Product Value
2022-06-16 Company A 7306199902 TUBE1 79106.46
2022-06-16 Company A 7216910100 TUBE2 36387.81
2022-06-16 Company A 7306199902 TUBE3 120898.88
2022-06-16 Company A 7306199902 TUBE4 1701.82
2022-06-16 Company A 7306199902 TUBE5 138816.6
2022-06-16 Company A 7306300400 TUBE6 61075.22
2022-06-15 Company A 7306300400 TUBE4 194963.13
2022-06-15 Company B 3917230499 TUBE1 24324.36
2022-06-15 Company B 3917400100 TUBE4 2295.45
2022-06-15 Company C 7609000299 TUBE5 3089.67

 

I am using Base SAS 9.4 (TS1M6) and have SAS/ACCESS Interface to PC Files.
.

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

For what it might be worth, since this is a subscription service and you are presumably a paying customer I might be tempted to ask a person, if you have not already, if they can make the desired file structure. I say that because I would bet that no one in there right mind is storing their data that way and possibly that structure came about because of an early (and not terribly bright) client. It can't hurt to ask if it is possible and what the cost difference may be.

 

Because of the really stupid combined fields implying that it may be extremely difficult to make sure another approach to reading the data will work I would suggest using a Libname XLSX to connect to the file meaning that the sheet should be usable as a data set to "clean up" stuff.

 

Then use a data step with a set statement to read the data and deal with poor content. Because of the nature of Excel in general I would copy the non-problem variables into variables of your design by renaming the original column variables in the sheet so you can use the "nice" names in the output. Then parse the problem fields. The following assumes that you have successfully assigned a library to the Excel file with the libname xlib. The data set should be what is needed to reference the sheet with the data. You can find out the name after the libname is successful.

Here is a sort of dummy data step to demonstrate

data want;
   set xlib.datasheet (rename=(Company=S_company
     Code=S_code Product=S_prod Value=S_Value ) );
   /* New typical variables defined 
      set lengths of character variables 
      to longest expected (or a bit longer
      if paranoid)
   */
   length company $ 30   Code $ 10 Product $ 15;
   company = S_company;
   /* assumes 1-1 correlation between the original
   code, product and value variables. Also assumes
   you want Value to be numeric
   */
   do i = 1 to countw(s_code,';');
      code    = scan(s_code,i,';');
      product = scan(s_prod,i,';');
      value   = input(scan(s_value,i,';'),12.);
      output;
   end; 
   drop i s_: ;
run;

I used the example for the company to create a specific standard length value. One of the obnoxious things with spreadsheets is the lack of definitions such as lengths of variables and each connection to them might yield a different length. You do not want to deal with that so create a known length more than long enough to handle such. I assumed that Code should be character. If you intend to use it for arithmetic you need to let us know, but see Value for an example of that.

The Do loop is parser.

The Countw function will count "words" which are defined as what comes between delimiters and will let you specify as the 3rd parameter the delimiter to use. So the loop will execute once for each "word" or code value. A parallel behavior with the SCAN function lets you say which "word" to pull from a longer string and which character to use for a delimiter. Value we use Input to create a numeric value as the decimals make me believe that is wanted. If your Values included currency or commas you would need a different informat such as Comma.

And the final piece: the explicit OUTPUT statement says wen to write to the output set. One time for each code. Since the Date and Company would already have a value they get repeated.

Drop to remove unneeded variables after parsing. I might suggest commenting that line out until you are comfortable with the results.

 

A comment on Date: Depending who is building the Excel file you may or may not have a SAS date value (numeric length 8 with a format assigned such as yymmdd10. If your libname xlib data set does not show as a SAS date then you will need to do a similar rename/parse with an appropriate informat to read what you do get. This can be a bit of a guessing game so just warning to be prepared. If you do have to manipulate the dates you would want to do it before the DO loop similar to the Company assignment.

 

If you have not seen the list builder S_: that I used a colon following the first part of variable(s) name tells SAS to attempt to use all the variables that have a name starting with S_ . Since that yields multiple variables it works well for things like Drop or Keep statements and array definitions. The lists can also be used in some functions such a the SUM, MEAN, MIN , MAX and other data step statistic calcs by using  Sum(of S_:)  (requires all of the variables to be numeric or it will cause an error).

 

View solution in original post

2 REPLIES 2
ballardw
Super User

For what it might be worth, since this is a subscription service and you are presumably a paying customer I might be tempted to ask a person, if you have not already, if they can make the desired file structure. I say that because I would bet that no one in there right mind is storing their data that way and possibly that structure came about because of an early (and not terribly bright) client. It can't hurt to ask if it is possible and what the cost difference may be.

 

Because of the really stupid combined fields implying that it may be extremely difficult to make sure another approach to reading the data will work I would suggest using a Libname XLSX to connect to the file meaning that the sheet should be usable as a data set to "clean up" stuff.

 

Then use a data step with a set statement to read the data and deal with poor content. Because of the nature of Excel in general I would copy the non-problem variables into variables of your design by renaming the original column variables in the sheet so you can use the "nice" names in the output. Then parse the problem fields. The following assumes that you have successfully assigned a library to the Excel file with the libname xlib. The data set should be what is needed to reference the sheet with the data. You can find out the name after the libname is successful.

Here is a sort of dummy data step to demonstrate

data want;
   set xlib.datasheet (rename=(Company=S_company
     Code=S_code Product=S_prod Value=S_Value ) );
   /* New typical variables defined 
      set lengths of character variables 
      to longest expected (or a bit longer
      if paranoid)
   */
   length company $ 30   Code $ 10 Product $ 15;
   company = S_company;
   /* assumes 1-1 correlation between the original
   code, product and value variables. Also assumes
   you want Value to be numeric
   */
   do i = 1 to countw(s_code,';');
      code    = scan(s_code,i,';');
      product = scan(s_prod,i,';');
      value   = input(scan(s_value,i,';'),12.);
      output;
   end; 
   drop i s_: ;
run;

I used the example for the company to create a specific standard length value. One of the obnoxious things with spreadsheets is the lack of definitions such as lengths of variables and each connection to them might yield a different length. You do not want to deal with that so create a known length more than long enough to handle such. I assumed that Code should be character. If you intend to use it for arithmetic you need to let us know, but see Value for an example of that.

The Do loop is parser.

The Countw function will count "words" which are defined as what comes between delimiters and will let you specify as the 3rd parameter the delimiter to use. So the loop will execute once for each "word" or code value. A parallel behavior with the SCAN function lets you say which "word" to pull from a longer string and which character to use for a delimiter. Value we use Input to create a numeric value as the decimals make me believe that is wanted. If your Values included currency or commas you would need a different informat such as Comma.

And the final piece: the explicit OUTPUT statement says wen to write to the output set. One time for each code. Since the Date and Company would already have a value they get repeated.

Drop to remove unneeded variables after parsing. I might suggest commenting that line out until you are comfortable with the results.

 

A comment on Date: Depending who is building the Excel file you may or may not have a SAS date value (numeric length 8 with a format assigned such as yymmdd10. If your libname xlib data set does not show as a SAS date then you will need to do a similar rename/parse with an appropriate informat to read what you do get. This can be a bit of a guessing game so just warning to be prepared. If you do have to manipulate the dates you would want to do it before the DO loop similar to the Company assignment.

 

If you have not seen the list builder S_: that I used a colon following the first part of variable(s) name tells SAS to attempt to use all the variables that have a name starting with S_ . Since that yields multiple variables it works well for things like Drop or Keep statements and array definitions. The lists can also be used in some functions such a the SUM, MEAN, MIN , MAX and other data step statistic calcs by using  Sum(of S_:)  (requires all of the variables to be numeric or it will cause an error).

 

texasmfp
Lapis Lazuli | Level 10

Outstanding.  I just need to figure out the date formatting to deal with the SAS versus Excel date issue.

 

Super timesaver.  Thanks

 

And yes, in an ideal world I should call them and ask them to offer an alternative way to download the data.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 2 replies
  • 390 views
  • 0 likes
  • 2 in conversation