Jump to content

Extracting UK postcodes and other information from an address field with inconsistent structure


Avi Levy

Recommended Posts

Hello everyone,

I hope you are all well during these times!

I wanted to know if I could get some help with something I've been trying to do for a bit now, without any success.

I have a field which is named"Address" and I am trying to extract the postcode from it so I can use that postcode to identify the facility. The issue is that because the sytstem allowed free text entry for the addresses, they are not consistent in format. These are UK postcodes, so they will have 2 parts, such as "AB1 2CD" (with 6 characters)"B1 2CD" (with 5), or more.These could have spaces at the wrong place, or extra commas before it starts. Some examples are below:

"Royal Hospital, 1 John Smith Road, London, United Kingdom of Great Britain and Northern Ireland, AB1 2CD"

"1 John Smith Road, London, United Kingdom of Great Britain and Northern Ireland, A1 2CD"

"1 John Smith Road, London, United Kingdom of Great Britain and Northern Ireland, AB12CD"

"1 John Smith Road, London, United Kingdom of Great Britain and Northern Ireland, AB123CD"

"1 John Smith Road, London, United Kingdom of Great Britain and Northern Ireland, AB12 3CD"

"1 John Smith Road, London, United Kingdom of Great Britain and Northern Ireland,, AB 1 2CD"

 

Additionally, for those rows where no postcode has been entered at the end, I wanted to extract the hospital name from earlier on in the address. For example:

"Royal Hospital, 1 John Smith Road, London, United Kingdom of Great Britain and Northern Ireland, AB1 2CD" --> Royal Hospital

"Barts Health NHS Trust, Whipps Cross Hospital,Whipps Cross Road, Clinical Research Unit, Junction 9, lower ground floor,Greater London, London, london, Redbridge, United Kingdom of Great Britain and Northern Ireland" -->Whipps Cross Hospital

 

Any help would be greatly appreciated,

Cheers!

Link to comment
Share on other sites

Hello,

 

This sounds like a job for regular expressions (if you haven't used that before, there are lots of great resources online for learing more).I like using https://regex101.com/(online regex tester)for creating and trying out my expressions first, before using them in Spotfire. Ithis case, you may then want to add a calculated column in Spotfire, using the RXExtract function which can be used to extract a part of a string (your address) using a regular expression.

I found the following discussion about what regular expression to use to match UK postal codes:

https://stackoverflow.com/questions/164979/regex-for-matching-uk-postcodes

The regular expression (I used the one from the top voted answer) was:

([Gg][ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]))))s[0-9][A-Za-z]{2})

NOTE: you need to escape special characters, such as ,with when using them in your expression in Spotfire. The postcode expression contained "s"so I escaped that ("\s") in the Spotfire expression below.

Your calculated column in Spotfire would have the following expression (in this example, assuming your column is named "Address")

RXExtract([Address],"([Gg][ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]))))\s[0-9][A-Za-z]{2})",1)

Note: using this expression did not give a match for one of the values in your example - "AB 1 2CD". So, you may want to tweak the expression to capture everything you consider to be valid.

 

For finding hospital names, you could again use a calculated columns with a regular expression. I don't know exactly what format the names might be in but maybe this can help you get started:

RXExtract([Address],"[^,]([\sA-Za-z0-9]+[Hh]ospital)\s*[$,]",1)

I did not find a nice, direct way to only get the matching capturing group (the part within parentesis) so to get rid of some leading/trailing text, I simply added another RXExtract so my final column expression was this:

 

RXExtract(RXExtract([Address],"[^,]([\sA-Za-z0-9]+[Hh]ospital)\s*[$,]",1),"[A-Za-z0-9][A-Za-z0-9\s]+[A-Za-z0-9]",1)

 

Best Regards

Fredrik

Link to comment
Share on other sites

Hi Fredrik,

 

Thank you very much for looking into this - the expression you posted:

 

RXExtract([Address],"([Gg][ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]))))\s[0-9][A-Za-z]{2})",1)

 

did really well, and out of ~4000 or so rows less than 300 weren't found. That link you shared from stackoverflow and regex is a good place, so thanks for that too, and I will modify it as I learn more.

 

The hospital expression I am working on, so thank you for posting this. (I think I've realised I need to learn alot more about this function in spotfire!)

 

Regards,

 

Avi

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...