- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
For example, the data looks like this (CSV):
"Client_NO", "Client_Name", "Unique_Fields"
123,"Muffin","{"Category":"DOG","Age":3,"License":1111}"
222,"Sprinkles","{"Category":"CAT","Age":12,"License":1122}"
423,"Larry","{"Category":"CHICKEN","Age":2,"License":1133}"
The Unique_Fields contains 3 additional fields. I can read them into a data set, but how do I parse out Unique_Fields in a data step?
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Here's the approach I would take. It doesn't really matter if the terms are different, they'll be set as missing in the data set if it doesn't exist. I've used this approach to parse XML files before and it's pretty robust.
*Sample data - assuming you already have this read in;
data have;
informat ID $3. Type $12. Random $50.;
infile cards dlm="|" dsd;
input ID $ Type $ Random $;
cards;
123|"Muffin"|"{"Category":"DOG","Age":3,"License":1111}"
222|"Sprinkles"|"{"Category":"CAT","Age":12,"License":1122}"
423|"Larry"|"{"Category":"CHICKEN","Age":2,"License":1133}"
;
;
;;
run;
*flip structure to long to account for differing types;
data long;
set have;
*count the number of : whichis number of terms;
nwords=countc(Random, ':');
*remove {} from variable;
random=compress(random, "{}");
*loop over each number of term;
do i=1 to nwords;
*isolate each term alone;
term=scan(Random, i, ",");
*get name;
VarName=scan(term, 1, '":');
*get value;
VarValue=scan(term, 2, '":');
*output to data set;
output;
end;
*Keep only variables needed;
keep ID Type VarName VarValue;
run;
*flip to wide format for standard data set;
proc transpose data=long out=want (drop=_name_);
by ID Type;
ID VarName;
Var VarValue;
run;
@Areksoo wrote:
For example, the data looks like this (CSV):
"Client_NO", "Client_Name", "Unique_Fields"
123,"Muffin","{"Category":"DOG","Age":3,"License":1111}"
222,"Sprinkles","{"Category":"CAT","Age":12,"License":1122}"
423,"Larry","{"Category":"CHICKEN","Age":2,"License":1133}"
The Unique_Fields contains 3 additional fields. I can read them into a data set, but how do I parse out Unique_Fields in a data step?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Parsing comes second. First step is making decisions about what the goal is. For example:
- Do you want to create three variables named Category, Age, and License? Will that always be true, or will it depend on what is inside Unique_Fields?
- Which new variables should be character and which should be numeric?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I didn't want to complicate it, but sometimes the values in Unique_Fields is completely different with different variable names and values. For example:
"{"Color":"RED","Height":25}"
There's only 3 different sets fields in the Unique_Field, so if I can do one of them, I would just make IF statements with different logic depending on the first few characters. The variable names can be kept the same, but if the text has quotes, then it remains text, if it has no quotes it should become a number. Between field name and value, there's a colon and a comma between each set of field name and value.
There might an elegant solution to read multiple different variables, but I would just be happy with a solution that works for 1 case. The trouble I'm having is how do I read from the colon to the comma and then jump to the next variable and read from the colon to the comma.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
For your limited example and note about changing fields this works:
data junk; infile datalines dlm=',{}' dsd; informat Client_NO $5. Client_Name $15. category $10. Age best. license $5.; informat color $10. height best.; input Client_NO Client_Name @; if index(_infile_,'"Category":')> 0 then input @'"Category":' category @'"Age":' age @'"License":' license; ; if index(_infile_,'"Color"')> 0 then input @'"Color":' color @'"Height":' height; ; datalines; 123,"Muffin","{"Category":"DOG","Age":3,"License":1111}" 222,"Sprinkles","{"Category":"CAT","Age":12,"License":1122}" 423,"Larry","{"Category":"CHICKEN","Age":2,"License":1133}" 423,"Larry","{"Color":"RED","Height":25}"" ; run;
The first @ on the input holds the line for different readings. _INFILE_ is an automatic variable containing the contents of the INFILE buffer that can be searched like any other string.
The @'string' tells SAS to read from the position that value is found. the ugly code of mixed single and double quotes is because that source actually has unmatched quotes for most purposes.
The index function searches for a hopefully key value that identifies which record format you need to read
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Here's the approach I would take. It doesn't really matter if the terms are different, they'll be set as missing in the data set if it doesn't exist. I've used this approach to parse XML files before and it's pretty robust.
*Sample data - assuming you already have this read in;
data have;
informat ID $3. Type $12. Random $50.;
infile cards dlm="|" dsd;
input ID $ Type $ Random $;
cards;
123|"Muffin"|"{"Category":"DOG","Age":3,"License":1111}"
222|"Sprinkles"|"{"Category":"CAT","Age":12,"License":1122}"
423|"Larry"|"{"Category":"CHICKEN","Age":2,"License":1133}"
;
;
;;
run;
*flip structure to long to account for differing types;
data long;
set have;
*count the number of : whichis number of terms;
nwords=countc(Random, ':');
*remove {} from variable;
random=compress(random, "{}");
*loop over each number of term;
do i=1 to nwords;
*isolate each term alone;
term=scan(Random, i, ",");
*get name;
VarName=scan(term, 1, '":');
*get value;
VarValue=scan(term, 2, '":');
*output to data set;
output;
end;
*Keep only variables needed;
keep ID Type VarName VarValue;
run;
*flip to wide format for standard data set;
proc transpose data=long out=want (drop=_name_);
by ID Type;
ID VarName;
Var VarValue;
run;
@Areksoo wrote:
For example, the data looks like this (CSV):
"Client_NO", "Client_Name", "Unique_Fields"
123,"Muffin","{"Category":"DOG","Age":3,"License":1111}"
222,"Sprinkles","{"Category":"CAT","Age":12,"License":1122}"
423,"Larry","{"Category":"CHICKEN","Age":2,"License":1133}"
The Unique_Fields contains 3 additional fields. I can read them into a data set, but how do I parse out Unique_Fields in a data step?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Apologies for the late reply... flu hit me hard on Friday.
This is fantastic and worked perfectly!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
data have;
infile cards dsd truncover;
input id name :$20. temp $100.;
n+1;
do i=1 to countw(temp,',');
t=scan(temp,i,',');
x=scan(t,1,'"{}:');
y=scan(t,2,'"{}:');
output;
end;
drop i t temp;
cards;
123,"Muffin","{"Category":"DOG","Age":3,"License":1111}"
222,"Sprinkles","{"Category":"CAT","Age":12,"License":1122}"
423,"Larry","{"Category":"CHICKEN","Age":2,"License":1133}"
;
run;
proc transpose data=have out=want;
by n id name;
var y;
id x;
run;
proc print;run;