Solved: Re: import dataset

France · Posted 01-23-2019 09:31 AM

dear all,

could you please give me some suggestion to import the below text file into sas. the sample of it has been added in the attachment.

three of the information is below,

{"company_number":"09145694","data":{"address":{"address_line_1":"St. Andrews Road","country":"England","locality":"Henley-On-Thames","postal_code":"RG9 1HP","premises":"2"},"ceased_on":"2018-05-14","country_of_residence":"England","date_of_birth":{"month":2,"year":1977},"etag":"3b8caf795c03af63921e381f7bb8300a51ebb73d","kind":"individual-person-with-significant-control","links":{"self":"/company/09145694/persons-with-significant-control/individual/bIhuKnMFctSnjrDjUG8n3NgOrlU"},"name":"Mrs Nga Thanh Wildman","name_elements":{"forename":"Nga","middle_name":"Thanh","surname":"Wildman","title":"Mrs"},"nationality":"Vietnamese","natures_of_control":["ownership-of-shares-50-to-75-percent"],"notified_on":"2016-04-06"}}

{"company_number":"08581893","data":{"address":{"address_line_1":"High Street","address_line_2":"Wendover","country":"England","locality":"Aylesbury","postal_code":"HP22 6EA","premises":"14a","region":"Buckinghamshire"},"ceased_on":"2016-07-01","country_of_residence":"England","date_of_birth":{"month":9,"year":1947},"etag":"45f9c9e5494b574eb52abc3990a49bd96fe09df3","kind":"individual-person-with-significant-control","links":{"self":"/company/08581893/persons-with-significant-control/individual/RgR9Zhc7yGhV0SBys8_WJ6H9O1o"},"name":"Mr Stephen Robert Charles Davies","name_elements":{"forename":"Stephen","middle_name":"Robert Charles","surname":"Davies","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-25-to-50-percent","ownership-of-shares-25-to-50-percent-as-firm"],"notified_on":"2016-06-30"}}

{"company_number":"08581893","data":{"address":{"address_line_1":"High Street","address_line_2":"Wendover","country":"England","locality":"Aylesbury","postal_code":"HP22 6EA","premises":"14a","region":"Buckinghamshire"},"ceased_on":"2016-07-01","country_of_residence":"England","date_of_birth":{"month":6,"year":1965},"etag":"d55168c49f85ab1ef38a12ed76238d68f79f5a01","kind":"individual-person-with-significant-control","links":{"self":"/company/08581893/persons-with-significant-control/individual/-6HQmkhiomEBXJI2rgHccU67fpM"},"name":"Mr Quentin Colin Maxwell Solt","name_elements":{"forename":"Quentin","middle_name":"Colin Maxwell","surname":"Solt","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-25-to-50-percent","voting-rights-25-to-50-percent"],"notified_on":"2016-06-30"}}

thanks in advance

Reeza · Posted 01-24-2019 10:36 PM

Agreed. Seems like it's designed to be used via an API and someone is just appending results in every day for each company. Interesting method.

@France this should get you started. It'll get you each line into a separate observation and you can work on separating it out using SCAN().

Note the OBS option at the moment which is limiting the number of records. You'll want to remove that, but for testing, it should get you started.

filename psc '/folders/myfolders/psc-snapshot-2019-01-24_1of13.json' termstr=LF;

data psc1;
length myVar $30000. companyName $100.;
infile psc lrecl=30000 obs=1000;
input;

myVar = _infile_;

companyName = dequote(scan(myVar, 2, ":,"));


run;

View solution in original post

Reeza · Posted 01-23-2019 10:43 AM

Is it identified as XML or JSON originally?

France · Posted 01-23-2019 03:33 PM

Dear Reeza,

thanks for your message,

sorry, I do not understand the meaning of XML and JSON. besides, the dataset is download by following the website (http://download.companieshouse.gov.uk/en_output.html)

Reeza · Posted 01-23-2019 03:52 PM

When I download the zip file, that's been separated I get a clean well formatted CSV file, though it maybe in a different encoding.

data demo;
infile "..\BasicCompanyData-2019-01-01-part1_6\BasicCompanyData-2019-01-01-part1_6.csv" lrecl=32000 obs=100;
file 'c:\_localdata\demo.csv';
input ;
put _infile_;
run;

This creates another text file with the first 100 rows for testing, because the file is too big to open manually.

The first 100 records are attached.

Is this from a specific file in the link?

@France wrote:
Dear Reeza,

thanks for your message,

sorry, I do not understand the meaning of XML and JSON. besides, the dataset is download by following the website (http://download.companieshouse.gov.uk/en_output.html)

France · Posted 01-23-2019 08:05 PM

Dear Reeza,

apology for my mistake. the dataset is downloaded from the below website.
http://download.companieshouse.gov.uk/en_pscdata.html

Reeza · Posted 01-23-2019 08:37 PM

It’s a JSON file so try the JSON libname.

https://blogs.sas.com/content/sasdummy/2016/12/02/json-libname-engine-sas/

France · Posted 01-24-2019 08:56 PM

dear Reeza,

Thanks for your kind advice.

But I do not understand the meaning of it. Could you please explain it for me by using the following example.

{

"company_number":"09145694",

"data":

{

"address":

{

"address_line_1":"St. Andrews Road",

"country":"England"

,"locality":"Henley-On-Thames",

"postal_code":"RG9 1HP",

"premises":"2"

},

"ceased_on":"2018-05-14",

"country_of_residence":"England",

"date_of_birth":

{

"month":2,

"year":1977

},

"etag":"3b8caf795c03af63921e381f7bb8300a51ebb73d",

"kind":"individual-person-with-significant-control",

"links":

{

"self":"/company/09145694/persons-with-significant-control/individual/bIhuKnMFctSnjrDjUG8n3NgOrlU"

},

"name":"Mrs Nga Thanh Wildman",

"name_elements":

{

"forename":"Nga",

"middle_name":"Thanh",

"surname":"Wildman",

"title":"Mrs"

},

"nationality":"Vietnamese",

"natures_of_control":

["ownership-of-shares-50-to-75-percent"],

"notified_on":"2016-04-06"

}

this is a group of record in my dataset. could you please give me some suggestion about this?

ChrisNZ · Posted 01-24-2019 09:24 PM

@Reeza pointed you to the answer.

Time for you to:

1. Read about JSON files.

2. Read the link provided.

One line in that page is

libname space JSON fileref=resp;

Start from there.

High-Performance SAS Coding - Third Edition

Reeza · Posted 01-24-2019 09:24 PM

JSON is a pain and the files there are too big for the JSON libname in my example. Because of that you have to manually parse the file. It's not easy, I agree, and it is a significant amount of work to test and get it right. When I did this for a different open data set, I think it ended up taking two days to get it fully working, testing and automated. Unfortunately I don't have that amount of time to spend on this. I'm not saying it's going to take that long - I was trying the JSON libname here, in R and python - and was trying to use a more 'modern' method. In the end, I gave up and used SCAN() , FIND() and INDEX() to get the data I needed.

You do have a learning curve here for sure. Sorry I can't help further. Good Luck.

@France wrote:

dear Reeza,

Thanks for your kind advice.

But I do not understand the meaning of it. Could you please explain it for me by using the following example.

{

"company_number":"09145694",

"data":

{

"address":

{

"address_line_1":"St. Andrews Road",

"country":"England"

,"locality":"Henley-On-Thames",

"postal_code":"RG9 1HP",

"premises":"2"

},

"ceased_on":"2018-05-14",

"country_of_residence":"England",

"date_of_birth":

{

"month":2,

"year":1977

},

"etag":"3b8caf795c03af63921e381f7bb8300a51ebb73d",

"kind":"individual-person-with-significant-control",

"links":

{

"self":"/company/09145694/persons-with-significant-control/individual/bIhuKnMFctSnjrDjUG8n3NgOrlU"

},

"name":"Mrs Nga Thanh Wildman",

"name_elements":

{

"forename":"Nga",

"middle_name":"Thanh",

"surname":"Wildman",

"title":"Mrs"

},

"nationality":"Vietnamese",

"natures_of_control":

["ownership-of-shares-50-to-75-percent"],

"notified_on":"2016-04-06"

}

}

this is a group of record in my dataset. could you please give me some suggestion about this?

Background on JSON: https://www.w3schools.com/js/js_json_intro.asp

ChrisNZ · Posted 01-24-2019 09:35 PM

@Reeza

Interesting. I did't know the libname engine had size limitations.

Do you know if PROC DS2's JSON package has similar constraints?

High-Performance SAS Coding - Third Edition

Reeza · Posted 01-24-2019 09:50 PM

After further checking that doesn't seem to be the issue. It seems to be a non standard JSON file. R only read the first two rows for some reason and that's the same place SAS is complaining about. I'm guessing a different EOL or encoding, since it's European data?

@ChrisNZ wrote:

@Reeza

Interesting. I did't know the libname engine had size limitations.

Do you know if PROC DS2's JSON package has similar constraints?

ChrisNZ · Posted 01-24-2019 10:19 PM

After checking, each record is an independent JSON structure that looks like the below, with a LF to delimit the record.

So manual parsing seems like the way to go.

{"company_number":"03100413"
,"data":{"etag":"e0c7cf72c430d379d849c89225a4d6bad8face98"
        ,"kind":"persons-with-significant-control-statement"
        ,"links":{"self":"/company/03100413/persons-with-significant-control-statements/E9L3Ktzi07T3zgnWCSSuZ5-OjJE"}
        ,"notified_on":"2016-08-16"
        ,"statement":"no-individual-or-entity-with-signficant-control"}
}
{"company_number":"SC287552"
,"data":{"ceased_on":"2017-10-13"
        ,"etag":"5aba35df221cc2bfc1e5fa3f571390522098d755"
        ,"kind":"persons-with-significant-control-statement"
        ,"links":{"self":"/company/SC287552/persons-with-significant-control-statements/VxNqbL-lSWIGfhgeaHSy2qYV1hw"}
        ,"notified_on":"2016-07-15"
        ,"statement":"psc-details-not-confirmed"}
}

High-Performance SAS Coding - Third Edition

Reeza · Posted 01-24-2019 10:36 PM

Agreed. Seems like it's designed to be used via an API and someone is just appending results in every day for each company. Interesting method.

@France this should get you started. It'll get you each line into a separate observation and you can work on separating it out using SCAN().

Note the OBS option at the moment which is limiting the number of records. You'll want to remove that, but for testing, it should get you started.

filename psc '/folders/myfolders/psc-snapshot-2019-01-24_1of13.json' termstr=LF;

data psc1;
length myVar $30000. companyName $100.;
infile psc lrecl=30000 obs=1000;
input;

myVar = _infile_;

companyName = dequote(scan(myVar, 2, ":,"));


run;

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away