Common data periods

Dagfinn Veiberg · February 2, 2022

I have two datasets:

One dataset containing the start end end dates for each well drilled. It also contains the license names where the wells were drilled and the well names

One dataset containing the start end end dates when different companies were owners of the licenses. Blank end dates should be set to current date. It also contains the license- and company names

I would like to know which companies were license owners when the the different wells were drilled.

Attached are the two datasets and an illustration showing what I want to achieve.

Fabian Duerr · February 2, 2022

Can you please check the following discussions:

https://community.spotfire.com/questions/overlapping-dates

https://community.spotfire.com/questions/join-tables-date-range

https://community.spotfire.com/questions/range-joining-two-data-tables-spotfire

Then let us know if you need further help.

Dagfinn Veiberg · February 2, 2022

Thanks for the info.

Since I am not a capable programmer myself, it would be nice to have a setup that solves my specific issue.

Fabian Duerr · February 5, 2022

I took a closer look at this issuesince I have to do similar joins pretty often, but always with SQL.

First, I used a data function with the suggested 'fuzzyjoin' package. The results were fine, but the calculation was really long (>2h). My laptop runout of memory here...

Then I used another data function with a regular left join (on licence) and filtering (your condition of overlap) using 'dplyr'. This was really fast (1-2sec). But if you want to get a correct left join it takes some extra effort. On the other hand the difference of the inner join is just one row. (One well that never started)

Then I explored the 'sqldf' package. It allows you to write SQL like code in your data function. This is really nice. You can write very complex conditions so easily here. The performance was much better than using 'fuzzyjoin'. (25sec). And you get a regular left join.

The fastest option for overlap joins seems to be the foverlaps() function from 'data.table'. But it took me some time to understand how to use this function. And moreover, the function worked fine when the code run on the server, but it failed when using a TERR function. The TERR worked fine as long as the merge was performed only on the date columns, but it failed when the licence condtion was added. The performance of foverlaps() was more than 100 times faster then sqldf(). (0.4sec)

Since your data has an additional join on licenceand the data size is still not too big, a regular join with dplyr ordata.table plus additional filtering will give the results in reasonable time. I also like the sqldf solution a lot, but performancewise nothing can beat foverlaps() from data.table

I know you are looking for a no-code solution. And this is possible as well following the same approach as discussed above. First join you tables on the licence column, then removerows where the dates don't overlap. With just those two operations you will get the merged table. For more details please check the transformations in the data canvas in the attached dxp file.

PS: On the data import I had to do some data cleaning (some strings had trailing spaces; replacing empty end dates), and I renamed some columns for my data function approach. (Not included in the dxp).

Sign In

Common data periods

Recommended Posts

Dagfinn Veiberg

Link to comment

Share on other sites

Fabian Duerr

Link to comment

Share on other sites

Dagfinn Veiberg

Link to comment

Share on other sites

Fabian Duerr

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Industries