Re: Identify string and generate variables

UV_123 · Posted 05-30-2022 05:13 PM

Hi All,

I need to know how to identify a particular string and select all the strings after that and pass it as column names.

For example, I have the below data:

To be considered:

XYZ

STU

LMN

Random Data

I want to identify the string "To be Considered:" and select all the row values after the string and make them variable names followed by _1 and _2.

My output columns should be like this:

XYZ_1

XYZ_2

STU_1

STU_2

LMN_1

LMN_2

Please help.

ballardw · Posted 05-31-2022 02:36 AM

Pass what values as "column names" where? To what process?

What does the "random data" look like afterward? How doe we know when the "column names" end and "random data starts, as in a rule that will always identify such? Since the column to hold text such as "To be considered" all the values in that "column" will be character.

Provide an actual example of your current data in the form of a data step and an example data step of what you expect the result to be. Since you don't show any values for the variables I suspect you actually aren't completely describing your problem, process or desired output. SAS variables without any values for the variables are pretty odd. So what is going to be done with the resulting data?

UV_123 · Posted 05-31-2022 02:05 PM

Hi @ballardw,

Thanks a lot for your reply.

Actually, I have attached my data, and it's a .txt file.

The Random data lines contain headings of the data and all the other text, which can be ignored.

The text after "includes.." and below "To be considered:" till a blank (as in the example above) should be the columns headers and I want 2 columns for each. I want this to assign values from the main data, which starts with a number, for example, 123456789123, and the fields under that number belong to it with variables same as the column headers we have mentioned above.

So, I want my output to look like this:

Column1	ABC_1	ABC_2	DEF_1	DEF_2	GHI_1	GHI_2	JKL_1	JKL_2	XYZ_1	XYZ_2	STU_1	STU_2	LMN_1	LMN_2
123456789123	12	34	56	37			64	73
23455122467	67	43			12	46	87	56	45	78
5674512344	78	45	12	54							32	82	45	36

I hope I have answered all your questions. Please help.

@SASKiwi @PaigeMiller: For some reason, my previous post got deleted. This is an example of what my data looks like and what I want. Any help will be greatly appreciated.

Tom · Posted 05-31-2022 04:04 PM

So you are trying to figure out how to read a textual report?

It is a pretty simple task in SAS. But you need to know the format of the file to be able to programmatically figure where you are while reading it.

So it looks to me like the data is organized under an ID number.

Is it safe to assume that any line that starts with a number like that the start of new group of values?

First let's share your sample text file in a form that anyone can run on their version of SAS.

So using PARMCARDS statement, PARMCARDS option and FILENAME statement you can make sample file in your WORK directory.

Spoiler

options parmcards=txt ;
filename txt temp;
parmcards4;
Random Data
Random Data
Random Data
Include... ABC
Include... DEF
Include... GHI
Include... JKL

123456789123
ABC             ..   ..  12   34
DEF             ..   ..  56   37
JKL             ..   ..  64   73
23455122467
ABC             ..   ..  67   43
GHI             ..   ..  12   46
JKL             ..   ..  87   56
XYZ             ..   ..  45   78
5674512344
ABC             ..   ..  78   45
DEF             ..   ..  12   54
STU             ..   ..  32   82
LMN             ..   ..  45   36

To be considered:
ABC
XYZ
STU
LMN

Random Data
;;;;

options parmcards=txt ; filename txt temp; parmcards4; Random Data Random Data Random Data Include... ABC Include... DEF Include... GHI Include... JKL 123456789123 ABC .. .. 12 34 DEF .. .. 56 37 JKL .. .. 64 73 23455122467 ABC .. .. 67 43 GHI .. .. 12 46 JKL .. .. 87 56 XYZ .. .. 45 78 5674512344 ABC .. .. 78 45 DEF .. .. 12 54 STU .. .. 32 82 LMN .. .. 45 36 To be considered: ABC XYZ STU LMN Random Data ;;;;

Now let's write a data step to read through file and pull out two thing.

One is the list of "vars" to INCLUDE from that header text.

The other is the actual data in a simplified ROW*COL format.

data consider (keep=varname)
     results (keep=id row varname col value)
;
   infile txt truncover length=ll column=cc ;

   input @1 dummy ?? @;
   if not missing(dummy) then do;
      id = dummy;
      retain id;
      input / dummy ?? @@ ;
      do row=1 by 1 while( missing(dummy));
         input @1 varname :$32. @ ;
         do col=1 by 1 while(cc<= ll );
           input value ?? @;
           if not missing(value) then output results;
         end;
         input / dummy ?? @@;
      end;
   end;
   else if _infile_=:'Include' then do;
      input varname ;
      output consider;
   end;
run;

You will probably need to tweak the logic as you figure out more about how the report is structured.

Results:

The list of variables to "include" or "consider".

Obs    varname

 1       ABC
 2       DEF
 3       GHI
 4       JKL

The actual data:

Obs         id         row    varname    col    value

  1    123456789123     1       ABC       3       12
  2    123456789123     1       ABC       4       34
  3    123456789123     2       DEF       3       56
  4    123456789123     2       DEF       4       37
  5    123456789123     3       JKL       3       64
  6    123456789123     3       JKL       4       73
  7     23455122467     1       ABC       3       67
  8     23455122467     1       ABC       4       43
  9     23455122467     2       GHI       3       12
 10     23455122467     2       GHI       4       46
 11     23455122467     3       JKL       3       87
 12     23455122467     3       JKL       4       56
 13     23455122467     4       XYZ       3       45
 14     23455122467     4       XYZ       4       78
 15      5674512344     1       ABC       3       78
 16      5674512344     1       ABC       4       45
 17      5674512344     2       DEF       3       12
 18      5674512344     2       DEF       4       54
 19      5674512344     3       STU       3       32
 20      5674512344     3       STU       4       82
 21      5674512344     4       LMN       3       45
 22      5674512344     4       LMN       4       36

If you really need that unwieldy wide format (perhaps for exporting to EXCEL for someone to browse?) you can use PROC TRANSPOSE.

proc transpose data=results out=wide (drop=_name_);
  by id  notsorted;
  id varname col;
  var value;
run;

UV_123 · Posted 05-31-2022 05:23 PM

Thanks a lot for the reply.

Just a quick question: what does dummy ?? @ signify in the program.

Tom · Posted 05-31-2022 05:48 PM

It is part of an INPUT statement.

DUMMY is a variable that SAS will create as numeric since it has not been defined before. Since it is not included in either of the KEEP= dataset option lists it will not written to either dataset. So it is just a dummy variable used to test if the first word on the line is a number.

?? is input modifier that says don't issue any errors if the text being read is not a valid number.

Trailing @ means to stay on the current line for this iteration of the data step.

Trailing double @ means to stay on he current line even across iterations of the data step.

UV_123 · Posted 05-31-2022 06:07 PM

How to define the length and column. In your case it's 11 and column=cc.

Tom · Posted 05-31-2022 11:31 PM

The LENGTH= and COLUMN= options on the INFILE statement define the NAMES of the variables that will contain the length of the current line from the file and location of the column pointer that determines where on that line SAS is currently set to next read.

In that program the names were LL (line length) and CC (current column).

UV_123 · Posted 05-31-2022 08:57 PM

Also, suppose if there is some string present in front of a number for example: view 00001 no #123456789123 then can we define dummy using a substring function to get the desired number that is 123456789123?

Tom · Posted 05-31-2022 11:34 PM

@UV_123 wrote:

Also, suppose if there is some string present in front of a number for example: view 00001 no #123456789123 then can we define dummy using a substring function to get the desired number that is 123456789123?

In that case you will need to modify the logic used to detect when a new ID number has appeared. The current method of just reading the first word on the line into a numeric variable will not work. But you could perhaps use the other method of comparing the start of the current line to a string literal that was also used in that example program to detect the lines you had that listed the variables to INCLUDE.

So perhaps something like:

if _infile_ =: 'view' then do;
   input @ '#' id ;
end;

UV_123 · Posted 06-01-2022 12:40 AM

Hi @Tom ,

Sorry for the trouble. So, in this case, the code will look somewhat like this:

data consider (keep=varname)
     results (keep=id row varname col value)
;
   infile txt truncover length=ll column=cc ;

 if _infile_ =: 'view' then do;
   input @ '#' id ;
end;
   /*if not missing('view') then do;
      id = dummy;
      retain id;
      input / dummy ?? @@ ;*/
      do row=1 by 1 while( missing('view'));
         input @1 varname :$32. @ ;
         do col=1 by 1 while(cc<= ll );
           input value id @;
           if not missing(value) then output results;
         end;
         input @ '#' id @@;
      end;
   end;
   else if _infile_=:'Include' then do;
      input varname ;
      output consider;
   end;
run;

I am not sure if I have done it correctly. Please do correct me. Thanks and sorry again.

UV_123 · Posted 05-29-2022 12:09 PM

Hi All,

I need help generating columns in SAS. Below is a sample of what my data looks like:

Random Data

Include... ABC

Include... DEF

Include... GHI

Include... JKL

123456789123

ABC $..$ ^..^ 12 34

DEF $..$ ^..^ 56 37

JKL $..$ ^..^ 64 73

23455122467

ABC $..$ ^..^ 67 43

GHI $..$ ^..^ 12 46

JKL $..$ ^..^ 87 56

XYZ $..$ ^..^ 45 78

5674512344

ABC $..$ ^..^ 78 45

DEF $..$ ^..^ 12 54

STU $..$ ^..^ 32 82

LMN $..$ ^..^ 45 36

To be considered:

XYZ

STU

LMN

Random Data

I want to generate the final result as below:

Column1	ABC_1	ABC_2	DEF_1	DEF_2	GHI_1	GHI_2	JKL_1	JKL_2	XYZ_1	XYZ_2	STU_1	STU_2	LMN_1	LMN_2
123456789123	12	34	56	37			64	73
23455122467	67	43			12	46	87	56	45	78
5674512344	78	45	12	54							32	82	45	36

Any help will be greatly appreciated.

@SASKiwi @PaigeMiller : This is a more precise version of what I have and what I want. I appreciate your help.

SASKiwi · Posted 05-31-2022 04:10 PM

@UV_123 - Looks like something happened to your previous post as it is completely blank. Can post your input data as DATA step DATALINES please otherwise we will have to spend time doing it. Is that the actual layout of your input data file as your data doesn't appear to be well organised? Please supply the actual layout of your input data file as DATALINES.

UV_123 · Posted 05-31-2022 04:35 PM

Hi @SASKiwi ,

Please find attached data for your ref.

mkeintz · Posted 06-01-2022 01:50 AM

What is the significance of the "include" rows? Should data for identifiers NOT in an "include" row be ignored?
Is every row prior to the first "include" to be ignored?
What is the significance of the "to be considered" row?
It appears that, for the "data" rows, you want only the rightmost two values, no matter how much content falls between the content identifier ("ABC", "DEF", etc) and those two values. Is the correct?
Are the data values always numeric?
Are all the content identifiers three letters long? If they can be longer, what is the maximum length (remember the ultimate variable names are limited to a length of 32 characters)?

This looks like a two step program: (1) get id's, content identifiers (future variable names), and data values in a long format, (2) transpose to a wide format. What have you tried so far to read in the data - prior to transpose?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Identify string and generate variables