Splitting a very long and delimited string character variable into Mul...

ruysolo · Posted 05-06-2021 03:43 PM

Hello SAS Community,

I have a dataset (let's call it 'HAVE') with a single character variable called "line". This variable is very long and is delimited by '|'. Below are the first 3 rows of this dataset:

<row 1>

3142|2018-07-09 03:24|xxxxxx xxxxx - Network|xxxx_xxx@work.com|41|266|337|i6a-je-cbbh|Shared_cbbh||CBBH ADaM|ADaM-IG 1.0|21.0||||||2016-06-24|2016-03-25||AXEX|Exposure Analysis Data Set||AD9999|Dataset %Domain% not validated|Warning|16|1|1|17|Yes|19|97|100|0|0|100|56||

<row 2>

3142|2018-07-09 03:24|xxxxxx xxxxx - Network|xxxx_xxx@work.com|41|266|337|i6a-je-cbbh|Shared_cbbh||CBBH ADaM|ADaM-IG 1.0|21.0||||||2016-06-24|2016-03-25||ADLB|Lab analysis SI units & perf lab limits||AD0225|Calculation issue: PCHG != (AVAL - BASE)/BASE * 100|Error|16|1|351|5111|Yes|19|97|100|0|0|100|56||

<row 3>

3142|2018-07-09 03:24|xxxxxx xxxxx - Network|xxxx_xxx@work.com|41|266|337|i6a-je-cbbh|Shared_cbbh||CBBH ADaM|ADaM-IG 1.0|21.0||||||2016-06-24|2016-03-25||ADSL|Subject-Level Analysis|ITTFL|TS0026|Analysis Preferable variable %Variable% not found|Warning|16|1|1|16|Yes|19|97|100|0|0|100|56||

I tried using the solution provided by data_null__JADE in this post to break this record into multiple columns. However, when I use _infile_=line, _infile_ gets truncated.

Here is the program used to read "HAVE"

data have;
input line $305.;
cards;
3142|2018-07-09 03:24|xxxxxx xxxxx - Network|xxxx_xxx@work.com|41|266|337|i6a-je-cbbh|Shared_cbbh||CBBH ADaM|ADaM-IG 1.0|21.0||||||2016-06-24|2016-03-25||AXEX|Exposure Analysis Data Set||AD9999|Dataset %Domain% not validated|Warning|16|1|1|17|Yes|19|97|100|0|0|100|56||
3142|2018-07-09 03:24|xxxxxx xxxxx - Network|xxxx_xxx@work.com|41|266|337|i6a-je-cbbh|Shared_cbbh||CBBH ADaM|ADaM-IG 1.0|21.0||||||2016-06-24|2016-03-25||ADLB|Lab analysis SI units & perf lab limits||AD0225|Calculation issue: PCHG != (AVAL - BASE)/BASE * 100|Error|16|1|351|5111|Yes|19|97|100|0|0|100|56||
3142|2018-07-09 03:24|xxxxxx xxxxx - Network|xxxx_xxx@work.com|41|266|337|i6a-je-cbbh|Shared_cbbh||CBBH ADaM|ADaM-IG 1.0|21.0||||||2016-06-24|2016-03-25||ADSL|Subject-Level Analysis|ITTFL|TS0026|Analysis Preferable variable %Variable% not found|Warning|16|1|1|16|Yes|19|97|100|0|0|100|56||
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
run;

proc print;
run;

Here is my code, trying to break each record into 41 different columns following the solution provided by data_null__jade.

You can copy and paste my code, if you want to replicate the (lack of) results and the log.

data want;
infile cards dsd dlm='|' ;
if _n_ eq 1 then input @@;
set have;
_infile_ = line;

put _infile_;
input @1

job_id
validated_gmt:anydtdtm21.
submitter:$CHAR40.
submitter_email:$CHAR45.
project_id
study_id
data_package_id project:$CHAR23.
study:$CHAR35.
protocol:$CHAR23.
data_package:$CHAR47.
standard:$CHAR13.
meddra
snomed:anydtdte12.
whodrug:$CHAR13.
loinc
unii:anydtdte12.
ndf_rt:anydtdte12.
sdtm_ct:anydtdte12.
adam_ct:anydtdte12.
send_ct:anydtdte12.
dataset:$CHAR46.
label:$CHAR106.
variable:$CHAR10.
rule:$CHAR7.
message:$CHAR216.
severity:$CHAR7.
subjects
failures
datapoints
records
define_included:$CHAR.
score
score_compliance
score_ct
score_regulatory
score_metadata
score_quality
score_analysis
active_explanation_1:$CHAR200.
active_explanation_1_dt:anydtdtm21.
@@
;

format

job_id best12.
validated_gmt datetime.
submitter $40.
submitter_email $45.
project_id best12.
study_id best12.
data_package_id best12.
project $23.
study $35.
protocol $23.
data_package $47.
standard $13.
meddra best12.
snomed date9.
whodrug $13.
loinc best4.
unii date9.
ndf_rt date9.
sdtm_ct date9.
adam_ct date9.
send_ct date9.
dataset $46.
label $106.
variable $10.
rule $7.
message $216.
severity $7.
subjects best12.
failures best12.
datapoints best12.
records best12.
define_included $3.
score best12.
score_compliance best12.
score_ct best12.
score_regulatory best12.
score_metadata best12.
score_quality best12.
score_analysis best12.
active_explanation_1 $200.
active_explanation_1_dt datetime.;
cards;
Necessary evil
run;

And here is the log. Notice that the first 7 variables get populated as desired. But starting from the 8th variable, all get null /missing values. Notice the results from the put _input_; statement:

3142|2018-07-09 03:24|xxxxxx xxxxx - Network|xxxx_xxx@work.com|41|266|337|i6a-je

NOTE: LOST CARD.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
121 run;
line=3142|2018-07-09 03:24|xxxxxx xxxxx - Network|xxxx_xxx@work.com|41|266|337|i6a-je-cbbh|Shared_cbbh||CBBH ADaM|ADaM-IG 1.0|21.0||||||2016-06-24|2016-03-25||AXEX|Exposure Analysis Data Set||AD9999|Dataset %Domain% not validated|Warning|16|1|1|17|Yes|19|97|100|0|0|100|56||

job_id=3142

validated_gmt=09JUL18:03:24:00

submitter=xxxxxx xxxxx - Network

submitter_email=xxxx_xxx@work.com
project_id=41

study_id=266

data_package_id=337

project=i6a-je

study=

protocol=

data_package=

standard=

meddra=.

snomed=.
whodrug=

loinc=.

unii=.

ndf_rt=.

sdtm_ct=.

adam_ct=.

send_ct=.

dataset=

label=

variable=

rule=

message=

severity=

subjects=.
failures=.

datapoints=.

records=.

define_included=

score=.

score_compliance=.

score_ct=.

score_regulatory=.

score_metadata=.
score_quality=.

score_analysis=.

active_explanation_1=

active_explanation_1_dt=.

_ERROR_=1

_INFILE_=

_N_=1
NOTE: SAS went to a new line when INPUT statement reached past the end of a line.
NOTE: There were 1 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 0 observations and 42 variables.

ballardw · Posted 05-06-2021 05:22 PM

First thing, post long lines of code or log entries into a text box opened on the forum with the </> icon at the top of the message window. The main message windows will reformat text making it quite likely that the pasted result is not what you started with. So we actually can't properly test your code as pasted.

If this text started out as text file it would likely be easier to read that, just use the INPUT with no @@ and such.

The LOST card and the "NOTE: SAS went to a new line when INPUT statement reached past the end of a line." are telling you that you are attempting to read more variables from a single line of text than are available.

You can try to fix that by adding the MISSOVER option to the INFILE so that the "extra" variables don't attempt to read from the next line.

However your CODE shown does not match the error.

The error message shows

line=3142|2018-07-09 03:24|xxxxxx xxxxx - Network|xxxx_xxx@work.com|41|266|337|i6a-je-cbbh|Shared_cbbh||CBBH ADaM|ADaM-IG 1.0|21.0||||||2016-06-24|2016-03-25||AXEX|Exposure Analysis Data Set||AD9999|Dataset %Domain% not validated|Warning|16|1|1|17|Yes|19|97|100|0|0|100|56||
job_id=3142
validated_gmt=09JUL18:03:24:00
submitter=xxxxxx xxxxx - Network
submitter_email=xxxx_xxx@work.com
project_id=41
study_id=266
data_package_id=337
project=i6a-je
study=

Your code on the input statement shows

data_package:$CHAR47.
standard:$CHAR13.
meddra

The variable reading the value in the error message after data_package is Project (and used a shorter than $char13. informat), then Study. The INPUT statement shows variables Standard, which is what would have read that "16a-je" value and Meddra. So make sure the errors come from the code matching the invalid data.

I am not quite sure why all that convoluted stuff was the "solution" but your example doesn't need any of that.

I would start with:

data junk;
infile cards dsd dlm='|' missover;
  input
      job_id
      validated_gmt:anydtdtm21.
      submitter:$CHAR40.
      submitter_email:$CHAR45.
      project_id
      study_id
      data_package_id project:$CHAR23.
      study:$CHAR35.
      protocol:$CHAR23.
      data_package:$CHAR47.
      standard:$CHAR13.
      meddra
      snomed:anydtdte12.
      whodrug:$CHAR13.
      loinc
      unii:anydtdte12.
      ndf_rt:anydtdte12.
      sdtm_ct:anydtdte12.
      adam_ct:anydtdte12.
      send_ct:anydtdte12.
      dataset:$CHAR46.
      label:$CHAR106.
      variable:$CHAR10.
      rule:$CHAR7.
      message:$CHAR216.
      severity:$CHAR7.
      subjects
      failures
      datapoints
      records
      define_included:$CHAR.
      score
      score_compliance
      score_ct
      score_regulatory
      score_metadata
      score_quality
      score_analysis
      active_explanation_1:$CHAR200.
      active_explanation_1_dt:anydtdtm21.
   ;
   format
      validated_gmt datetime.
      snomed date9.
      unii date9.
      ndf_rt date9.
      sdtm_ct date9.
      adam_ct date9.
      send_ct date9.
      active_explanation_1_dt datetime.
   ; 
cards;
3142|2018-07-09 03:24|xxxxxx xxxxx - Network|xxxx_xxx@work.com|41|266|337|i6a-je-cbbh|Shared_cbbh||CBBH ADaM|ADaM-IG 1.0|21.0||||||2016-06-24|2016-03-25||AXEX|Exposure Analysis Data Set||AD9999|Dataset %Domain% not validated|Warning|16|1|1|17|Yes|19|97|100|0|0|100|56||
3142|2018-07-09 03:24|xxxxxx xxxxx - Network|xxxx_xxx@work.com|41|266|337|i6a-je-cbbh|Shared_cbbh||CBBH ADaM|ADaM-IG 1.0|21.0||||||2016-06-24|2016-03-25||ADLB|Lab analysis SI units & perf lab limits||AD0225|Calculation issue: PCHG != (AVAL - BASE)/BASE * 100|Error|16|1|351|5111|Yes|19|97|100|0|0|100|56||
3142|2018-07-09 03:24|xxxxxx xxxxx - Network|xxxx_xxx@work.com|41|266|337|i6a-je-cbbh|Shared_cbbh||CBBH ADaM|ADaM-IG 1.0|21.0||||||2016-06-24|2016-03-25||ADSL|Subject-Level Analysis|ITTFL|TS0026|Analysis Preferable variable %Variable% not found|Warning|16|1|1|16|Yes|19|97|100|0|0|100|56||
;

And since that works just fine instead of parsing your data as is, write it out to a temp text file and then read that text file with the data step:

data _null_;
    file "c:\somefolder\data.txt";
   put linevariablename;
run;

Data want;
   infile  "c:\somefolder\data.txt" dlm='|' missover;
  <the input statement from my previous code that runs just fine>
;

There is not reason to supply format statements for character variables like: Format somevar $35.; unless you intend to display it with fewer characters than they occupy. Similar with BEST12 numeric format, generally that is the SAS default and need not be set.

Sajid01 · Posted 05-06-2021 07:56 PM

Hello @ruysolo
I understand your question as follows

1.You a single character variable named line.

2.This variable is a collection of number of text tokens or strings delimited by "|"

3.It is desired to split this line at each of the delimiters into new variables.

4.You are aware of the number of the text strings/variables in the line.

Under such a scenario, i would use the following sample/proof of concept code..

Make appropriate changes as needed.
It is my understanding that this will serve your purpose.
If any of these new string variable contains number you should be able to convert it to number using input function.

data test;
length line $ 250;
input line $;
datalines;
asdf1|asdf2|asdf3|asdf4|asdf5
lkjh1|lkjh2|lkjh3|lkjh4|lkjh5
;
run;
data test2 (drop=line);
set test;
test1=scan(line,1,"|");
test2=scan(line,2,"|");
test3=scan(line,3,"|");
test4=scan(line,4,"|");
test5=scan(line,5,"|");
run;

The output dataset will be like this

Please let me know if you have questions.

Tom · Posted 05-06-2021 08:39 PM

Issues.

1) Use the TRUNCOVER option so that you don't go past the end of the line.

2) You need to get the LRECL for the infile to be at least as large as the length of the character variable you are using to replace it. That is hard with in-line (aka CARDS aka DATALINES). The LRECL is always a multiple of 80 and the smallest needed for that actual data. Use a real file.

3) Make sure all INPUT statements use the double trailing @ .

filename dummy temp;
data _null_;
  file dummy;
  put 'place holder line';
run;
data want;
  infile dummy truncover dsd dlm='|' lrecl=32767;
  if _n_ eq 1 then input @@;
  set have;
  _infile_ = line;
  input @1 
  ....
    @@;
run;

Splitting a very long and delimited string character variable into Multiple Columns

Re: Splitting a very long and delimited string character variable into Multiple Columns

Re: Splitting a very long and delimited string character variable into Multiple Columns

Re: Splitting a very long and delimited string character variable into Multiple Columns

Splitting a very long and delimited string character variable into Multiple Columns

Re: Splitting a very long and delimited string character variable into Multiple Columns

Re: Splitting a very long and delimited string character variable into Multiple Columns

Re: Splitting a very long and delimited string character variable into Multiple Columns

Registration is open

SAS Training: Just a Click Away