BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Areksoo
Calcite | Level 5

For example, the data looks like this (CSV):

 

"Client_NO", "Client_Name", "Unique_Fields"

123,"Muffin","{"Category":"DOG","Age":3,"License":1111}"

222,"Sprinkles","{"Category":"CAT","Age":12,"License":1122}"

423,"Larry","{"Category":"CHICKEN","Age":2,"License":1133}"

 

The Unique_Fields contains 3 additional fields.  I can read them into a data set, but how do I parse out Unique_Fields in a data step?  

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

Here's the approach I would take. It doesn't really matter if the terms are different, they'll be set as missing in the data set if it doesn't exist. I've used this approach to parse XML files before and it's pretty robust. 

 

*Sample data - assuming you already have this read in;
data have;
    informat ID $3. Type $12. Random $50.;
    infile cards dlm="|" dsd;
    input ID $ Type $ Random $;
    cards;
123|"Muffin"|"{"Category":"DOG","Age":3,"License":1111}"
222|"Sprinkles"|"{"Category":"CAT","Age":12,"License":1122}"
423|"Larry"|"{"Category":"CHICKEN","Age":2,"License":1133}"
;
    ;
    ;;
run;

*flip structure to long to account for differing types;
data long;
    set have;
    
    *count the number of : whichis number of terms;
    nwords=countc(Random, ':');
    *remove {} from variable;
    random=compress(random, "{}");

    *loop over each number of term;
    do i=1 to nwords;
        *isolate each term alone;
        term=scan(Random, i, ",");
        *get name;
        VarName=scan(term, 1, '":');
        *get value;
        VarValue=scan(term, 2, '":');
        *output to data set;
        output;
    end;
    
    *Keep only variables needed;
    keep ID Type VarName VarValue;
run;

*flip to wide format for standard data set;
proc transpose data=long out=want (drop=_name_);
    by ID Type;
    ID VarName;
    Var VarValue;
run;

@Areksoo wrote:

For example, the data looks like this (CSV):

 

"Client_NO", "Client_Name", "Unique_Fields"

123,"Muffin","{"Category":"DOG","Age":3,"License":1111}"

222,"Sprinkles","{"Category":"CAT","Age":12,"License":1122}"

423,"Larry","{"Category":"CHICKEN","Age":2,"License":1133}"

 

The Unique_Fields contains 3 additional fields.  I can read them into a data set, but how do I parse out Unique_Fields in a data step?  

 

 


 

View solution in original post

7 REPLIES 7
Astounding
PROC Star

Parsing comes second.  First step is making decisions about what the goal is.  For example:

 

  • Do you want to create three variables named Category, Age, and License?  Will that always be true, or will it depend on what is inside Unique_Fields?
  • Which new variables should be character and which should be numeric?

 

 

Areksoo
Calcite | Level 5

I didn't want to complicate it, but sometimes the values in Unique_Fields is completely different with different variable names and values.  For example:

 

"{"Color":"RED","Height":25}"

 

There's only 3 different sets fields in the Unique_Field, so if I can do one of them, I would just make IF statements with different logic depending on the first few characters.  The variable names can be kept the same, but if the text has quotes, then it remains text, if it has no quotes it should become a number.  Between field name and value, there's a colon and a comma between each set of field name and value.

 

There might an elegant solution to read multiple different variables, but I would just be happy with a solution that works for 1 case.  The trouble I'm having is how do I read from the colon to the comma and then jump to the next variable and read from the colon to the comma.

Reeza
Super User
Is the source JSON or XML originally?
ballardw
Super User

For your limited example and note about changing fields this works:

data junk;
   infile datalines dlm=',{}' dsd;
   informat Client_NO $5. Client_Name $15. category $10. Age best. license $5.;
   informat color $10. height best.;
   input  Client_NO  Client_Name @;
   if index(_infile_,'"Category":')> 0 then 
      input    @'"Category":' category    @'"Age":' age @'"License":' license;
   ;                             
   if index(_infile_,'"Color"')> 0 then 
      input    @'"Color":' color    @'"Height":' height;
   ;                             
datalines;
123,"Muffin","{"Category":"DOG","Age":3,"License":1111}"
222,"Sprinkles","{"Category":"CAT","Age":12,"License":1122}"
423,"Larry","{"Category":"CHICKEN","Age":2,"License":1133}"
423,"Larry","{"Color":"RED","Height":25}""
;
run;

 

The first @ on the input holds the line for different readings. _INFILE_ is an automatic variable containing the contents of the INFILE buffer that can be searched like any other string.

The @'string' tells SAS to read from the position that value is found. the ugly code of mixed single and double quotes is because that source actually has unmatched quotes for most purposes.

The index function searches for a hopefully key value that identifies which record format you need to read

Reeza
Super User

Here's the approach I would take. It doesn't really matter if the terms are different, they'll be set as missing in the data set if it doesn't exist. I've used this approach to parse XML files before and it's pretty robust. 

 

*Sample data - assuming you already have this read in;
data have;
    informat ID $3. Type $12. Random $50.;
    infile cards dlm="|" dsd;
    input ID $ Type $ Random $;
    cards;
123|"Muffin"|"{"Category":"DOG","Age":3,"License":1111}"
222|"Sprinkles"|"{"Category":"CAT","Age":12,"License":1122}"
423|"Larry"|"{"Category":"CHICKEN","Age":2,"License":1133}"
;
    ;
    ;;
run;

*flip structure to long to account for differing types;
data long;
    set have;
    
    *count the number of : whichis number of terms;
    nwords=countc(Random, ':');
    *remove {} from variable;
    random=compress(random, "{}");

    *loop over each number of term;
    do i=1 to nwords;
        *isolate each term alone;
        term=scan(Random, i, ",");
        *get name;
        VarName=scan(term, 1, '":');
        *get value;
        VarValue=scan(term, 2, '":');
        *output to data set;
        output;
    end;
    
    *Keep only variables needed;
    keep ID Type VarName VarValue;
run;

*flip to wide format for standard data set;
proc transpose data=long out=want (drop=_name_);
    by ID Type;
    ID VarName;
    Var VarValue;
run;

@Areksoo wrote:

For example, the data looks like this (CSV):

 

"Client_NO", "Client_Name", "Unique_Fields"

123,"Muffin","{"Category":"DOG","Age":3,"License":1111}"

222,"Sprinkles","{"Category":"CAT","Age":12,"License":1122}"

423,"Larry","{"Category":"CHICKEN","Age":2,"License":1133}"

 

The Unique_Fields contains 3 additional fields.  I can read them into a data set, but how do I parse out Unique_Fields in a data step?  

 

 


 

Areksoo
Calcite | Level 5

Apologies for the late reply... flu hit me hard on Friday.  

 

This is fantastic and worked perfectly!

Ksharp
Super User
data have;
infile cards dsd truncover;
input id name :$20. temp $100.;
n+1;
do i=1 to countw(temp,',');
 t=scan(temp,i,',');
 x=scan(t,1,'"{}:');
 y=scan(t,2,'"{}:');
 output;
end;
drop i t temp;
cards;
123,"Muffin","{"Category":"DOG","Age":3,"License":1111}"
222,"Sprinkles","{"Category":"CAT","Age":12,"License":1122}"
423,"Larry","{"Category":"CHICKEN","Age":2,"License":1133}"
;
run;

proc transpose data=have out=want;
by n id name;
var y;
id x;
run;
proc print;run;

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1446 views
  • 2 likes
  • 5 in conversation