Solved: Re: How do I parse out a text field that contains multiple fields

Areksoo · Posted 11-29-2018 05:54 PM

For example, the data looks like this (CSV):

"Client_NO", "Client_Name", "Unique_Fields"

123,"Muffin","{"Category":"DOG","Age":3,"License":1111}"

222,"Sprinkles","{"Category":"CAT","Age":12,"License":1122}"

423,"Larry","{"Category":"CHICKEN","Age":2,"License":1133}"

The Unique_Fields contains 3 additional fields. I can read them into a data set, but how do I parse out Unique_Fields in a data step?

Reeza · Posted 11-29-2018 08:32 PM

Here's the approach I would take. It doesn't really matter if the terms are different, they'll be set as missing in the data set if it doesn't exist. I've used this approach to parse XML files before and it's pretty robust.

*Sample data - assuming you already have this read in;
data have;
    informat ID $3. Type $12. Random $50.;
    infile cards dlm="|" dsd;
    input ID $ Type $ Random $;
    cards;
123|"Muffin"|"{"Category":"DOG","Age":3,"License":1111}"
222|"Sprinkles"|"{"Category":"CAT","Age":12,"License":1122}"
423|"Larry"|"{"Category":"CHICKEN","Age":2,"License":1133}"
;
    ;
    ;;
run;

*flip structure to long to account for differing types;
data long;
    set have;
    
    *count the number of : whichis number of terms;
    nwords=countc(Random, ':');
    *remove {} from variable;
    random=compress(random, "{}");

    *loop over each number of term;
    do i=1 to nwords;
        *isolate each term alone;
        term=scan(Random, i, ",");
        *get name;
        VarName=scan(term, 1, '":');
        *get value;
        VarValue=scan(term, 2, '":');
        *output to data set;
        output;
    end;
    
    *Keep only variables needed;
    keep ID Type VarName VarValue;
run;

*flip to wide format for standard data set;
proc transpose data=long out=want (drop=_name_);
    by ID Type;
    ID VarName;
    Var VarValue;
run;

@Areksoo wrote:

For example, the data looks like this (CSV):

"Client_NO", "Client_Name", "Unique_Fields"

123,"Muffin","{"Category":"DOG","Age":3,"License":1111}"

222,"Sprinkles","{"Category":"CAT","Age":12,"License":1122}"

423,"Larry","{"Category":"CHICKEN","Age":2,"License":1133}"

The Unique_Fields contains 3 additional fields. I can read them into a data set, but how do I parse out Unique_Fields in a data step?

View solution in original post

Astounding · Posted 11-29-2018 06:03 PM

Parsing comes second. First step is making decisions about what the goal is. For example:

Do you want to create three variables named Category, Age, and License? Will that always be true, or will it depend on what is inside Unique_Fields?
Which new variables should be character and which should be numeric?

Areksoo · Posted 11-29-2018 06:20 PM

I didn't want to complicate it, but sometimes the values in Unique_Fields is completely different with different variable names and values. For example:

"{"Color":"RED","Height":25}"

There's only 3 different sets fields in the Unique_Field, so if I can do one of them, I would just make IF statements with different logic depending on the first few characters. The variable names can be kept the same, but if the text has quotes, then it remains text, if it has no quotes it should become a number. Between field name and value, there's a colon and a comma between each set of field name and value.

There might an elegant solution to read multiple different variables, but I would just be happy with a solution that works for 1 case. The trouble I'm having is how do I read from the colon to the comma and then jump to the next variable and read from the colon to the comma.

Reeza · Posted 11-29-2018 06:41 PM

Is the source JSON or XML originally?

ballardw · Posted 11-29-2018 07:17 PM

For your limited example and note about changing fields this works:

data junk;
   infile datalines dlm=',{}' dsd;
   informat Client_NO $5. Client_Name $15. category $10. Age best. license $5.;
   informat color $10. height best.;
   input  Client_NO  Client_Name @;
   if index(_infile_,'"Category":')> 0 then 
      input    @'"Category":' category    @'"Age":' age @'"License":' license;
   ;                             
   if index(_infile_,'"Color"')> 0 then 
      input    @'"Color":' color    @'"Height":' height;
   ;                             
datalines;
123,"Muffin","{"Category":"DOG","Age":3,"License":1111}"
222,"Sprinkles","{"Category":"CAT","Age":12,"License":1122}"
423,"Larry","{"Category":"CHICKEN","Age":2,"License":1133}"
423,"Larry","{"Color":"RED","Height":25}""
;
run;

The first @ on the input holds the line for different readings. _INFILE_ is an automatic variable containing the contents of the INFILE buffer that can be searched like any other string.

The @'string' tells SAS to read from the position that value is found. the ugly code of mixed single and double quotes is because that source actually has unmatched quotes for most purposes.

The index function searches for a hopefully key value that identifies which record format you need to read

Reeza · Posted 11-29-2018 08:32 PM

Here's the approach I would take. It doesn't really matter if the terms are different, they'll be set as missing in the data set if it doesn't exist. I've used this approach to parse XML files before and it's pretty robust.

*Sample data - assuming you already have this read in;
data have;
    informat ID $3. Type $12. Random $50.;
    infile cards dlm="|" dsd;
    input ID $ Type $ Random $;
    cards;
123|"Muffin"|"{"Category":"DOG","Age":3,"License":1111}"
222|"Sprinkles"|"{"Category":"CAT","Age":12,"License":1122}"
423|"Larry"|"{"Category":"CHICKEN","Age":2,"License":1133}"
;
    ;
    ;;
run;

*flip structure to long to account for differing types;
data long;
    set have;
    
    *count the number of : whichis number of terms;
    nwords=countc(Random, ':');
    *remove {} from variable;
    random=compress(random, "{}");

    *loop over each number of term;
    do i=1 to nwords;
        *isolate each term alone;
        term=scan(Random, i, ",");
        *get name;
        VarName=scan(term, 1, '":');
        *get value;
        VarValue=scan(term, 2, '":');
        *output to data set;
        output;
    end;
    
    *Keep only variables needed;
    keep ID Type VarName VarValue;
run;

*flip to wide format for standard data set;
proc transpose data=long out=want (drop=_name_);
    by ID Type;
    ID VarName;
    Var VarValue;
run;

@Areksoo wrote:

For example, the data looks like this (CSV):

"Client_NO", "Client_Name", "Unique_Fields"

123,"Muffin","{"Category":"DOG","Age":3,"License":1111}"

222,"Sprinkles","{"Category":"CAT","Age":12,"License":1122}"

423,"Larry","{"Category":"CHICKEN","Age":2,"License":1133}"

The Unique_Fields contains 3 additional fields. I can read them into a data set, but how do I parse out Unique_Fields in a data step?

Areksoo · Posted 12-03-2018 01:40 PM

Apologies for the late reply... flu hit me hard on Friday.

This is fantastic and worked perfectly!

Ksharp · Posted 11-30-2018 09:29 AM

data have;
infile cards dsd truncover;
input id name :$20. temp $100.;
n+1;
do i=1 to countw(temp,',');
 t=scan(temp,i,',');
 x=scan(t,1,'"{}:');
 y=scan(t,2,'"{}:');
 output;
end;
drop i t temp;
cards;
123,"Muffin","{"Category":"DOG","Age":3,"License":1111}"
222,"Sprinkles","{"Category":"CAT","Age":12,"License":1122}"
423,"Larry","{"Category":"CHICKEN","Age":2,"License":1133}"
;
run;

proc transpose data=have out=want;
by n id name;
var y;
id x;
run;
proc print;run;

SAS Innovate 2025: Call for Content

Classroom Training Available!