Jump to content

Common data periods


Dagfinn Veiberg

Recommended Posts

I have two datasets:

 

One dataset containing the start end end dates for each well drilled. It also contains the license names where the wells were drilled and the well names

One dataset containing the start end end dates when different companies were owners of the licenses. Blank end dates should be set to current date. It also contains the license- and company names

 

I would like to know which companies were license owners when the the different wells were drilled.

Attached are the two datasets and an illustration showing what I want to achieve.

Link to comment
Share on other sites

I took a closer look at this issuesince I have to do similar joins pretty often, but always with SQL.

First, I used a data function with the suggested 'fuzzyjoin' package. The results were fine, but the calculation was really long (>2h). My laptop runout of memory here...

Then I used another data function with a regular left join (on licence) and filtering (your condition of overlap) using 'dplyr'. This was really fast (1-2sec). But if you want to get a correct left join it takes some extra effort. On the other hand the difference of the inner join is just one row. (One well that never started)

Then I explored the 'sqldf' package. It allows you to write SQL like code in your data function. This is really nice. You can write very complex conditions so easily here. The performance was much better than using 'fuzzyjoin'. (25sec). And you get a regular left join.

The fastest option for overlap joins seems to be the foverlaps() function from 'data.table'. But it took me some time to understand how to use this function. And moreover, the function worked fine when the code run on the server, but it failed when using a TERR function. The TERR worked fine as long as the merge was performed only on the date columns, but it failed when the licence condtion was added. The performance of foverlaps() was more than 100 times faster then sqldf(). (0.4sec)

Since your data has an additional join on licenceand the data size is still not too big, a regular join with dplyr ordata.table plus additional filtering will give the results in reasonable time. I also like the sqldf solution a lot, but performancewise nothing can beat foverlaps() from data.table

I know you are looking for a no-code solution. And this is possible as well following the same approach as discussed above. First join you tables on the licence column, then removerows where the dates don't overlap. With just those two operations you will get the merged table. For more details please check the transformations in the data canvas in the attached dxp file.

PS: On the data import I had to do some data cleaning (some strings had trailing spaces; replacing empty end dates), and I renamed some columns for my data function approach. (Not included in the dxp).

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...