-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathREADME.Rmd
894 lines (673 loc) · 41.4 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r setup, include = FALSE, error=TRUE, message=FALSE, warning=FALSE}
knitr::opts_chunk$set(echo = TRUE,
comment = "#>",
collapse = TRUE,
error = TRUE,
warning = FALSE,
message = FALSE,
fig.align = 'center',
autodep=FALSE
)
```
```{r, echo=FALSE}
proj_root <- rprojroot::find_rstudio_root_file()
```
```{r, out.width = "900px", fig.cap="Step at 1 day", echo=FALSE}
# knitr::include_graphics(file.path(proj_root, "img/volve_north_view_start_end.png"))
knitr::include_graphics("./img/volve_north_view_start_end.png")
```
# A reproducible comparison of the Volve reservoir model
# Introduction
Continuing with the previous article [The fabrication of an artificial intelligence agent for reservoir history matching from the Volve dataset](http://blog.oilgainsanalytics.com/publications/the-fabrication-of-an-artificial-intelligence-agent-for-reservoir-history-matching-from-the-volve-dataset/), and the generation of a master dataset for an AI agent to perform history matching of reservoir models, we will extract additional data from the output of the Volve reservoir model, the `PRT` text file. This is the output "as-is", as we found it. No additional simulation runs have been performed over this model.
In addition to the PRT file there are binary files that Eclipse generates as part of the output. We still don't have a reader for those files but we are close to take a look at some code written in Python by a Reservoir Engineer who I have been in touch for the past few days. Working with binary files should be your more preferred option than dealing with the painful extraction of variables using regular expressions (regex). At any rate, we could take these last two posts as an exercise of applying regex to text files when we don't have access to the binary files format.
## Data blocks in the simulator output
In the previous article, we extracted the __field totals__, or cumulatives, from the text blocks with the keyword __BALANCE AT__ in the `PRT` file; produced volumes of oil, gas and water as well as the COIP, or currently oil in place. There is another block of interest in the text file, that we could call the __STEP__ block. This block contains instantaneous production variables such as GOR, watercut, WGR and the average bottomhole pressure (PAV). WHat we are looking here is to improve the current dataset from containing cumulative volumes to include these new production variables. In other other words, we want a more comprehensive dataset that we could use as part of a machine learning algorithm later.
## Challenges ahead
There are some challenges though while doing this:
* The number of observations (rows) of the variables extracted from the __STEP__ block is greater than the observations of the field totals dataset.
* Since the data originated from the __STEP__ block is more granular than that of the __BALANCE-AT__ block, we will face the challenge of mismatched dates between both. So, a more complete dataset of production variables comes with its costs. Merging two dataframes will have to be carefully performed.
* Additionally, we may not getting all the variables from the __PRT__ text file as from the binary files generated by Eclipse. As one of the reservoir engineers, who read the previous article, noticed there is an inconsistency in the __produced water__ between the PRT file values I got and the binary files he was able to read. This is one of the drawbacks of reading the __PRT__ text file and mining for data; it is difficult to find all the output variables from the simulation, and some of them -if not properly identified- could be misleading.
* While the extraction of data from a text file serves a purpose, when a reader of binary files from the simulation is not available, we should exercise caution with those extracted from a text file.
* It is nice to get a good match of volumes between simulation and the real world. But we also have to be prepared for the bad news: when one of the reservoir fluids is far, far away from the match.
# Read the simulation output data
## Read the PRT text file
As we did in the previous step, we start by reading the reservoir simulation output file: the text file __VOLVE_2016.PRT`__. It is a relatively big file: 228 megabytes. I have zipped it in order to save some disk space and prevent Github from complaining about the size of the file. Maximum size of a file in Github is 100 megabytes.
This data operation was shown in the previous article with the only difference that I used Google drive instead. In a second case, I used Zenodo, a service that allows sharing datasets up to 50 gigabytes per dataset. I am providing the links to the datasets living in Zenodo at the end of this article.
```{r read-prt-file}
library(dplyr)
library(ggplot2)
# read the Eclipse PRT output report
# had to zip the PRT file because it's 225 MB and too big for Github
volve_2016_zip <- file.path(proj_root, "inst/rawdata", "VOLVE_2016.zip")
temp <- tempdir()
volve_2016_txt <- readLines(unzip(volve_2016_zip, exdir = temp))
```
Once with the contents of `VOLVE_2016.PRT` loaded in the object __volve_2016_txt__, we proceed to perform the extraction.
We start by extracting few rows after the __STEP__ keyword.
```{r}
# get a list of rows from " STEP"
# find the rows where we find the word " STEP"
step_rows <- grep("^ STEP", volve_2016_txt)
# add rows ahead to where the keyword was found
step_info_range <- lapply(seq_along(step_rows), function(x)
c(step_rows[x], step_rows[x]+1:2)) # add two extra row indices
step_info_range[[1]] # sample for report page 1 only
```
These extra row indices are lines of text where the report keeps more information of the evolution of the simulation. Here is a couple of screenhots.
Step at day __1__:
```{r, out.width = "600px", fig.cap="Step at 1 day", echo=FALSE}
# knitr::include_graphics(file.path(proj_root, "img/step_0001.png"))
knitr::include_graphics("./img/step_0001.png")
```
Step at day __3,197__:
```{r, out.width = "600px", fig.cap="Step at 3197 days", , echo=FALSE}
knitr::include_graphics("./img/step_3197.png")
```
Now, knowing the row indices for the text we need to extract from the `PRT` file, we can proceed to extracting those lines of text and putting them in a list, one page or one step, per list element. We do this to later iterate through all the steps and extract the data we want. The object `steps_info_txt_pages` is a list containing the row indices that are of our interest.
```{r}
# get the text from all pages and put them in a list
steps_info_txt_pages <- lapply(seq_along(step_info_range), function(x)
volve_2016_txt[step_info_range[[x]]])
```
For example, this is an example of the first page for step #1.
```{r}
steps_info_txt_pages[1]
```
## Extracting step data from the text file
### Extract the days from the STEP block
Although we could extract all the data we require from the text file in one go, it is better to see one or two examples of seeing __regular expressions__ or __regex__ at work. Regular expressions are practically available to all programming languages: C++, Java, JavaScript, Python, Perl, R, etc.
In this first example, we will extract the number of days at the current simulation step. If this is the first step page:
```
STEP 1 TIME= 1.00 DAYS ( +1.0 DAYS INIT 5 ITS) (1-JAN-2008)
PAV= 329.6 BARSA WCT= 0.00 GOR= 0.00000 SM3/SM3 WGR= 0.00000 SM3/SM3
```
to extract the days we have to provide a regex pattern that detects a real number like `1.00`, which is `".*?(\\d+.\\d.)+.*"`.
__Explanation__
* `.*?` will match any characters. lazy matching.
* `(\\d+.\\d.)` capturing group.
* `\\d+` matches any number of digits
* `\\d.` matches a digit and then any character
Here we iterate through the list _steps_info_txt_pages_, extract the step-page, extract the number of days from the text. After that, we convert the vector to a dataframe.
```{r rows.print=20}
# iterate through the list of STEP pages
days_dfs <- lapply(seq_along(steps_info_txt_pages), function(x) {
page <- steps_info_txt_pages[[x]] # put all pages text in a list
days_row_txt <- page[1] # get 1st row of page
days_value <- sub(".*?(\\d+.\\d.)+.*", "\\1", days_row_txt,
perl = TRUE) # extract the days
# dataframe; days as double; no factors.
data.frame(days = as.double(days_value), stringsAsFactors = FALSE)
})
days_df <- do.call("rbind", days_dfs)
```
### Extract the days
Let's see a sample of the first ten and last ten rows for the dataframe _days_df_ just extracted:
```{r}
rbind(head(days_df, 10), tail(days_df, 10)) # show the first 10 and last 10 rows
```
This is just 20 rows of data. The dataframe _days_df_ has `r nrow(days_df)` observations or rows.
### Extract the simulator running date
This is the second example of data extraction using _regex_.
The following regular expresion pattern extracts the date from the current text line.
__Explanation__
* `.*?(\\d{1,2}-[A-Z]{3}-\\d{4}).` entire regex pattern
* `(\\d{1,2}-[A-Z]{3}-\\d{4})` parenthesis indicate a group to extract the date
* `.*?` match any character
* `\\d{1,2}` match one or two digits (day)
* `-[A-Z]{3}` match a dash followed by three letters (month)
* `-\\d{4}` match four digits (year)
Again, we iterate through the list _steps_info_txt_pages_, select a step-page, and then extract the date. Notice the difference between the previous _regex_ and this one for capturing the date.
```{r rows.print=20}
# iterate through the list of pages: dates
date_dfs <- lapply(seq_along(steps_info_txt_pages), function(x) {
page <- steps_info_txt_pages[[x]] # put all pages text in a list
date_row_txt <- grep(" STEP", page) # get row index at word STEP
date_value <- sub(".*?(\\d{1,2}-[A-Z]{3}-\\d{4}).", "\\1", page[date_row_txt])
# dataframe; no factors
data.frame(date = date_value, stringsAsFactors = FALSE)
})
date_df <- do.call("rbind", date_dfs)
# size of the dataframe: rows by columns
dim(date_df)
rbind(head(date_df, 10), tail(date_df, 10)) # show the first 10 and last 10 rows
```
This second dataframe _date_df_ also has `r nrow(date_df)` rows. You see a kind of a pattern here, right? We are extracting columns with the same number of observations (rows).
### Extract all the values from the __STEP__ block
After showing this pair of examples, we continue with the extraction of the rest of the values. If you take a look at the PRT file you will recognize these as the variables to be extracted:
* `STEP` simulation step number
* `TIME` number of days elapsed at the simulation step
* `date` current date at the simulation run
* `PAV` average pressure
* `WCT` watercut
* `GOR` gas oil ratio
* `WGR` water gas ratio
The mission here is to extract all the variables that are made available by the simulator in the __STEP__ block. As shown above, they are seven variables. The two previous examples were showing the work for two of these variables.
The following is an __R__ script that extracts all the variables from all the occurrences of the __STEP__ block in the __PRT__ file.
Something that we need to know: the steps are not entirely sequential. They may skip a day, or more, or could have been generated after "few hours" in the simulation, and they do not necessarily match the date in the _field totals_ dataframe. This is something to consider. Both dataframes will have different number of rows.
What is new here is that I am extracting several values from the text in one shot: __step__ (group 1), __days__ (group 2), and __date__ (group 4). Then assign them to their respective memory objects. A second thing that we do here -and very common in text files-, is the correction of the short name of the month. I am not sure what the reservoir guys were thinking about, but they decided to baptize the month of __July__ as __JLY__, when the standard practice is to short-name it __JUL__. Anyway, the side effect of this is that at the moment of converting character to date formats, the date using __JLY__ will be translated as _NULL_, since the date converter doesn't know anything about a month short-called __JLY__. I didn't know this in advance, but examining the output I found some gaps in the output that didn't show up in the raw data. So, be mindful of these events.
```{r rows.print=50}
# script that extracts production variables from the simulator output
library(lubridate)
# get the row indices where we find the keyword " STEP"
step_rows <- grep("^ STEP", volve_2016_txt)
# get rows ahead range. by block of text or per page
# in the case of the STEP block we are only interested in the next two rows
step_info_range <- lapply(seq_along(step_rows), function(x)
c(step_rows[x], step_rows[x]+1:2))
# get the text from all STEP pages and store each in a list element
steps_info_txt_pages <- lapply(seq_along(step_info_range), function(x)
volve_2016_txt[step_info_range[[x]]])
# iterate through the list of pages for the STEP blocks in the report
step_info_dfs <- lapply(seq_along(steps_info_txt_pages), function(x) {
page <- steps_info_txt_pages[[x]] # load a STEP block/page
# this is line 1
row_txt <- grep(" STEP", page) # line 1 starts with STEP
# pattern extraction for 1st line of text: STEP, TIME, date
line_1_pattern <- ".*?(\\d+)+.*?(\\d+.\\d+)+.*?(\\d+)+.*?(\\d{1,2}-[A-Z]{3}-\\d{4})+.*"
step_value <- sub(line_1_pattern, "\\1", page[row_txt], perl = TRUE) # extract step
days_value <- sub(line_1_pattern, "\\2", page[row_txt], perl = TRUE) # extract days
date_value <- sub(line_1_pattern, "\\4", page[row_txt], perl = TRUE) # extract date
date_value <- sub("JLY", "JUL", date_value) # change JLY by JUL
# this is line 2
row_txt <- grep(" PAV", page) # line 2 starts with PAV=
# pattern extraction for 2nd line of text: PAV, WCT, GOR, WGR
line_2_pattern <- ".*?(\\d+.\\d+)+.*?(\\d+.\\d+)+.*?(\\d+.\\d+)+.*?(\\d+.\\d+).*"
pav_value <- sub(line_2_pattern, "\\1", page[row_txt], perl = TRUE) # Get avg pres
wct_value <- sub(line_2_pattern, "\\2", page[row_txt], perl = TRUE) # get WCT
gor_value <- sub(line_2_pattern, "\\3", page[row_txt], perl = TRUE) # get GOR
wgr_value <- sub(line_2_pattern, "\\4", page[row_txt], perl = TRUE) # get WGR
# dataframe;
data.frame(step = as.integer(step_value),
date = dmy(date_value),
time_days = as.double(days_value),
pav_bar = as.double(pav_value),
wct_pct = as.double(wct_value),
gor_m3m3 = as.double(gor_value),
wgr_m3m3 = as.double(wgr_value),
stringsAsFactors = FALSE)
})
step_info <- do.call("rbind", step_info_dfs) # put together all dataframes in list
# show a summary of the dataframe
glimpse(step_info)
```
Finally, we get all the vectors for __step__ number, __date__, __date__, average __pressure__, __watercut__, __GOR__ and __WGR__ in a dataframe.
Then, we show the dataframe as a _tibble_, which is an elegant way of presenting long dataframes. This dataframe _step_info_, in particular, has `r nrow(step_info)` rows and `r ncol(step_info)` columns.
```{r}
# show as a tibble
(step_info <- as_tibble(step_info))
```
These are the names of the variables in the dataframe:
```{r}
names(step_info)
```
You can see that the __STEP__ block does not carry any data regarding cumulative production.
### A Sample of the _step_info_ dataframe
Let's test the first day and last day of the simulation:
```{r}
tail(step_info$date,1) - head(step_info$date,1)
```
The simulation runs for `r tail(step_info$date,1) - head(step_info$date,1)` days of the reservoir life.
### Save to data files
Now, let's save the data as a _.Rdata_ file (readable from R) and as as _CSV_ file (capable of being imported practically by any software).
```{r}
# data from STEP block.
data_folder <- file.path(proj_root, "data") # project folder
# full filename, including path
save(step_info, file = file.path(data_folder, "data_from_step.Rdata"))
write.csv(step_info, file = file.path(data_folder,
"data_from_step.CSV"),
row.names = FALSE)
```
### Plots from the step block
Next, we proceed to show some of the data as visualization output. Here we will use the R package __ggplot2__. This is a flexible and sophisticated visualization tool using the Grammar of Graphics. These will be very simple plots of the variables we just extracted.
#### Plot pressure vs time
Let's take a look at the pressure over the life of the field, from the simulator perspective.
```{r}
ggplot(step_info, aes(x =date, y = pav_bar)) +
geom_line(color = "red") +
labs(title = "Pressure over time", subtitle = "Simulator output",
y = "Average Pressure (PAV), bar")
```
#### Plot watercut vs time
```{r}
# plot from PRT simulator output (STEP block)
ggplot(step_info, aes(x =date, y = wct_pct)) +
geom_line(color = "blue") +
labs(title = "Field watercut over time", subtitle = "Simulator output",
y = "Watercut, percent")
```
### Plot all variables from `STEP` block
To prevent cluttering our report from lots of plots, we will show all plots in one figure using __facets__. In order to achieve this, we will have to transform our data in __tidy format__.
```{r}
library(tidyr)
# convert to tidy format
step_info_gather <-
step_info %>%
select(-c(step, time_days)) %>%
gather(key = var, value, pav_bar:wgr_m3m3) %>%
print()
# plot production variables from the STEP block in the PRT file
# change the name of the facet labels
facet_labels <- c(`gor_m3m3` = "GOR, m3/m3", `pav_bar` = "Avg.Pres, bar",
`wct_pct` = "Watercut, %", `wgr_m3m3` = "Water Gas Ratio, m3/m3")
# plot the facets with free y-axis.
ggplot(step_info_gather, aes(x = date, y = value)) +
geom_line() +
facet_wrap(.~var, scales = "free_y", labeller = as_labeller(facet_labels)) +
labs(title = "Production variables from STEP block",
subtitle = "Simulator output", y = "", x = "")
```
## Merge cumulative oil with simulator steps
Next step is combining the data from the _steps_ dataframe with the _balance-at_ production cumulatives. This is not an straight operation since both dataframes have different date references.
1. First, we will extract the production cumulatives (field totals) from the PRT file. This is something we already did in the previous article. We will load that script.
2. Save the production cumulatives dataframe as a CSV file.
3. Join the field_totals and step_info dataframe by a common variable.
4. Plot variables
5. Calculate cumulatives
6. Fix the dates in the merge dataframe so they are continuous from 2018 till October 2016.
7. Calculate cumulatives and volumes by month
## Extract Field cumulatives from the `BALANCE-AT` block
```{r rows.print=25}
# load a script with functions
r_folder <- file.path(proj_root, "R")
r_script <- file.path(r_folder, "extract_data_from_prt.R")
source(r_script)
# run function to extract field totals or cumulative production
field_totals <- extract_field_totals(prt_file_content = volve_2016_txt)
field_totals
```
#### Save field totals to data files
```{r}
# save the field totals from BALANCE block.
data_folder <- file.path(proj_root, "data")
save(field_totals, file = file.path(data_folder, "field_totals_balance.Rdata"))
write.csv(field_totals, file = file.path(data_folder,
"field_totals_balance.CSV"),
row.names = FALSE)
```
We quickly plot the water outflow over the years.
> Note. This data might not correct. The PRT file does not explicitely assign a variable for the water outflow. Besides, this volume comes with a negative sign, which indicates injection rather than water production.
```{r}
# plot from PRT simulator output (BALANCE AT block)
ggplot(field_totals, aes(x =date, y = wat_otw)) +
geom_line(color = "blue") +
labs(title = "Field Water Outflow Through Wells",
subtitle = "Simulator output, BALANCE block",
y = "Water volume, sm3")
```
Now, we know that the STEP dataframe has more rows than the BALANCE-AT dataframe. What we want is to correlate the step with the oil cumulatives.
These are the rows in the _step_info_ dataframe:
```{r}
dim(step_info)
```
These are the number of rows and colums of the _field_totals_ dataframe.
```{r}
dim(field_totals)
```
## Join field totals and step dataframes
We merge both tables, steps and field cumulatives. The resultant dataframe we call it _step_totals_.
```{r rows.print=50}
# join both tables by the common variable "date"
step_totals <-
left_join(step_info, field_totals, by = "date") %>%
na.omit() %>%
select(date, time_days, days, everything()) %>%
as_tibble() %>%
print
```
The dataframe step_totals has `r nrow(step_totals)` rows or observations and `r ncol(step_totals)` columns or variables.
Looking at this dataframe closer we observe that the date intervals is not uniform; it is not weekle, bi-weekly or monthly because it is a merged dataframe. What we will do next is transforming this dataframe to a monthly summary or where the date intervals is a month separation.
### Plot outflow through wells from simulator
Without additional data transformations we plot the cumulatives of the merged dataset.
```{r}
ggplot(step_totals, aes(x = date, y = oil_otw)) +
geom_line(color = "dark green", size = 1.1) +
ggtitle("Cumulative Oil, sm3", subtitle = "Simulator")
```
This ggplot for the cumulative gas.
```{r}
ggplot(step_totals, aes(x = date, y = gas_otw)) +
geom_line(color = "orange", size = 1.1) +
labs(title = "Cumulative Gas", subtitle = "Simulator",
y = "Cumulative Gas, sm3")
```
And this another one for cumulative water.
Let's just keep in mind that this curve may not be right because we are extracting from an incorrect labeled column (water outflow from wells).
```{r}
ggplot(step_totals, aes(x = date, y = wat_otw)) +
geom_line(color = "blue", size = 1.1) +
labs(title = "Cumulative Water", subtitle = "Simulator",
y = "Cumulative Water, sm3")
```
### Calculate cumulatives for oil, gas and water for monthly period
This is where we transform the dataframe from an heterogeneous date steps to monthly periods. We use the _group_by_ function operating on the year and the month. This operation, in the end, brings us to a more organized table with a periodic variation. The extra work is recalculating the volumes and cumulatives on the new monthly basis. All these series of transformations are very common when you merge dataframes with different number of rows. What is happening here is:
* calculating the volume at the current period, which is the difference between the current row and the previous one. Wee do that with the _lag_ function.
* decomposing the date in year and month
* grouping by year and month
* summarize the data calculating the volumes per month
* convert the (year, month) to (year, month, day) date format
```{r rows.print=50}
# step field totals
sim_cumulatives <-
step_totals %>%
select(date, oil_otw, gas_otw, wat_otw) %>%
mutate(oil_this_period = oil_otw - lag(oil_otw, default = 0)) %>%
mutate(gas_this_period = gas_otw - lag(gas_otw, default = 0)) %>%
mutate(wat_this_period = wat_otw - lag(wat_otw, default = 0)) %>%
mutate(year = year(date), month = month(date)) %>%
group_by(year, month) %>%
summarize(vol_oil = sum(oil_this_period),
vol_gas = sum(gas_this_period),
vol_wat = sum(wat_this_period)) %>%
ungroup() %>%
mutate(date = ymd(paste(year, month, "01", sep = "-"))) %>%
# mutate(source = "simulator") %>%
mutate(cum_oil = cumsum(vol_oil),
cum_gas = cumsum(vol_gas),
cum_wat = cumsum(vol_wat)) %>%
select(date, year, month, everything()) %>%
print()
```
### Filling the date gaps with periodic dates
In this case, at first sight, it seems that we got periodic dates for the production of the field. But what if we are missing a month or two? One way to ensure we have all months accounted for is building a known sequence of dates from _2016-01-01_ until _2008-10-01_. That is what we do with the function `seq.Date()`:
`seq.Date(as.Date("2008-01-01"), as.Date("2016-10-01"), by = "month")`
```{r}
# create a dataframe with complete dates from 2008 until Oct-2016
# this will fill any holes in the dates of any of the two dataframes
dates_complete <- as_tibble(data.frame(date= seq.Date(as.Date("2008-01-01"),
as.Date("2016-10-01"), by = "month"),
cum_oil = 0, cum_gas = 0, cum_wat = 0))
dates_complete
```
So, there are `r nrow(dates_complete)` months from start to end of production.
Next, we proceed to merge the dataframe of known dates (above) with the dataframe _sim_cumulatives_ that we obtained above. Finally, we calculate the volumes and cumulatives.
```{r}
# simulator production
# merge incomplete dataframe and fill with complete dates
# there will be blank rows or NAs where previously was not data
sim_cumulatives_dt <-
left_join(dates_complete, sim_cumulatives, by = "date") %>%
# remove NAs from the cumulatives .y
tidyr::replace_na(list(cum_oil.y = 0, vol_oil = 0,
cum_gas.y = 0, vol_gas = 0,
cum_wat.y = 0, vol_wat = 0)) %>% # replace NAs with zeros
# add up cumulatives .x and .y
mutate(cum_oil = cum_oil.x + cum_oil.y,
cum_gas = cum_gas.x + cum_gas.y,
cum_wat = cum_wat.x + cum_wat.y) %>% # sum cumulatives
select(date, cum_oil, cum_gas, cum_wat, vol_oil, vol_gas, vol_wat) %>%
# replace 0s with previous cumulative. these were rows that didn't exist
mutate(cum_oil = ifelse(cum_oil == 0, lag(cum_oil, default = 0), cum_oil)) %>%
mutate(cum_gas = ifelse(cum_gas == 0, lag(cum_gas, default = 0), cum_gas)) %>%
mutate(cum_wat = ifelse(cum_wat == 0, lag(cum_wat, default = 0), cum_wat)) %>%
mutate(vol_oil = ifelse(vol_oil == 0, lag(vol_oil, default = 0), vol_oil)) %>%
mutate(vol_gas = ifelse(vol_gas == 0, lag(vol_gas, default = 0), vol_gas)) %>%
mutate(vol_oil = ifelse(vol_wat == 0, lag(vol_wat, default = 0), vol_wat)) %>%
as_tibble() %>%
print
```
The negative volume of water and oil are possibly volume corrections by the operator.
```{r}
# show observations with negative volumes
sim_cumulatives_dt %>%
filter(vol_oil <0 | vol_wat < 0 | vol_gas < 0)
```
## Comparative of simulator vs historical production
The last step is reading the historical production and compare them against the numbers from the simulation. We start by reading the historical production.
### Reading the historical production
Like we did in the previous article, we read the production history from an Excel file. But this time we will be reading all the variables in the dataset.
```{r}
# load historical production from Excel file
library(xlsx) # library to read Excel files in R
# read the Excel file
proj_root <- rprojroot::find_rstudio_root_file() # get the project root folder
xl_file <- file.path(proj_root, "inst/rawdata", "Volve production data.xlsx")
# read only the monthly production
prod_hist <- as_tibble(read.xlsx(xl_file, sheetName = "Monthly Production Data"))
prod_hist
```
### Save raw production history to data files
Once we read the data from Excel we save it as a more standard format: _CSV_. But this saved data will be very raw because we haven't performed any operation yet. That's why we named it _production_history_raw.CSV_.
```{r}
# save historical data as raw
data_folder <- file.path(proj_root, "data")
save(prod_hist, file = file.path(data_folder, "production_history_raw.Rdata"))
write.csv(prod_hist, file = file.path(data_folder,
"production_history_raw.CSV"),
row.names = FALSE)
```
### Cumulatives from production history
These are the data transformations we will perform over the raw production data.
* convert from character to double, integer
* replace the NAs with zeros
* discard two columns with no meaningful data
* rename the variables to all lowercase (easier to remember)
* remove rows that have NAs
* group by year and month
* summarize by volumes
* convert date from character to date format
* sort the data by date
* keep the variables we require
* calculate the cumulatives from volumes
```{r}
hist_cumulatives <-
prod_hist %>%
mutate(Oil = as.double(as.character(Oil))) %>%
mutate(Gas = as.double(as.character(Gas))) %>%
mutate(Water = as.double(as.character(Water))) %>%
mutate(Year = as.integer(as.character(Year))) %>%
mutate(Month = as.integer(as.character(Month))) %>%
mutate(GI = as.double(as.character(GI))) %>%
mutate(WI = as.double(as.character(WI))) %>%
tidyr::replace_na(list(GI = 0, WI = 0)) %>%
select(-c(NPDCode, On.Stream)) %>%
rename(year = Year, month = Month, oil = Oil, gas = Gas, wat = Water) %>%
na.omit() %>% # remove all rows that have at least one NA
group_by(year, month) %>%
summarise(vol_oil = sum(oil), vol_gas = sum(gas), vol_wat = sum(wat),
vol_gi = sum(GI), vol_wi = sum(WI)) %>%
mutate(date = ymd(paste(year, month, "01", sep = "-"))) %>%
arrange(date) %>%
ungroup() %>%
select(date, vol_oil, vol_gas, vol_wat, vol_gi, vol_wi) %>%
mutate(cum_oil = cumsum(vol_oil), cum_gas = cumsum(vol_gas),
cum_wat = cumsum(vol_wat),
cum_gi = cumsum(vol_gi), cum_wi = cumsum(vol_wi)) %>%
print()
```
Observe that we've got `r nrow(hist_cumulatives) ` rows, which wouldn't match the rows from the simulation data. We have to fix that. The problem is primarily dates.
### Save historical cumulatives as-is
This saves the production history dataframe after the data transformations.
```{r}
# save historical data after data transformations. 104 rows.
data_folder <- file.path(proj_root, "data")
save(hist_cumulatives, file = file.path(data_folder, "hist_cumulatives_104.Rdata"))
write.csv(hist_cumulatives, file = file.path(data_folder,
"hist_cumulatives_104.CSV"),
row.names = FALSE)
```
```{r}
# create a dataframe with complete dates from 2008 until Oct-2016
df <- as_tibble(data.frame(date= seq.Date(as.Date("2008-01-01"),
as.Date("2016-10-01"), by = "month"),
cum_oil = 0, cum_gas = 0, cum_wat = 0))
df
```
Remember that we created this date sequence above. This sequence had `r nrow(dates_complete)` rows.
```{r}
dates_complete
```
### Fill the dates gap in the production history dataframe
We do almost the same data transformations as before:
* replace NAs in the variables with zeros
* sum up the additional cumulative variables generated during the merge
* replace a cumulative of zero by the previous cumulative (carry over). There cannot be cumulatives of zeros in between. That anomaly is caused by the absence of rows.
* clean up the variables for gas injection (GI) and water injection (WI) volumes
* calculate the cumulatives for gas injection and water injection
```{r}
# historical production
# merge incomplete dataframe and complete with dates
hist_cumulatives_dt <-
left_join(dates_complete, hist_cumulatives, by = "date") %>%
# replace NAs with zeros
tidyr::replace_na(list(cum_oil.y = 0, cum_gas.y = 0, cum_wat.y = 0)) %>%
tidyr::replace_na(list(vol_oil = 0, vol_gas = 0, vol_wat = 0)) %>%
tidyr::replace_na(list(vol_gi = 0, vol_wi = 0, cum_gi = 0, cum_wi = 0)) %>%
# add up the extra column .y
mutate(cum_oil = cum_oil.x + cum_oil.y) %>%
mutate(cum_gas = cum_gas.x + cum_gas.y) %>%
mutate(cum_wat = cum_wat.x + cum_wat.y) %>%
# filter(date != as.Date("2016-10-01")) %>%
# this fixes the zeros generated by adding complete dates
mutate(cum_oil = ifelse(cum_oil == 0, lag(cum_oil, default=0), cum_oil)) %>%
mutate(cum_gas = ifelse(cum_gas == 0, lag(cum_gas, default=0), cum_gas)) %>%
mutate(cum_wat = ifelse(cum_wat == 0, lag(cum_wat, default=0), cum_wat)) %>%
mutate(cum_gi = ifelse(cum_gi == 0, lag(cum_gi, default=0), cum_gi)) %>%
mutate(cum_wi = ifelse(cum_wi == 0, lag(cum_wi, default=0), cum_wi)) %>%
select(date, cum_oil, cum_gas, cum_wat, vol_oil, vol_gas, vol_wat,
vol_gi, vol_wi, cum_gi, cum_wi) %>%
as_tibble() %>%
print
```
Great! We got the dataframe _hist_cumulatives_dt_, with `r nrow(hist_cumulatives_dt)` rows and `r ncol(hist_cumulatives_dt)` columns.
### Plot production rates
If we want to plot several variables in one figure we have to convert the dataframe into tidy data format, and then use facets.
```{r}
hist_cumulatives_dt_gather <-
hist_cumulatives_dt %>%
select(date, vol_oil, vol_gas, vol_wat) %>%
gather(key = var, value, vol_oil:vol_wat) %>%
print()
```
And this is the plot of the cumulative variables from the production history.
```{r fig.height=8, fig.width=6}
# plot production variables from the STEP block in the PRT file
# order of the plots
hist_cumulatives_dt_gather$var_f <-
factor(hist_cumulatives_dt_gather$var,
levels = c("vol_oil", "vol_gas", "vol_wat"))
# change the name of the facet labels
facet_labels <- c(`vol_oil` = "Oil", `vol_gas` = "Gas", `vol_wat` = "Water")
ggplot(hist_cumulatives_dt_gather, aes(x = date, y = value, color = var)) +
geom_line(size = 1) +
facet_grid(var_f ~., scales = "free_y",
labeller = as_labeller(facet_labels)) +
labs(title = "Monthly fluid Volumes, m3",
subtitle = "Historical Production", y = "", x = "")
```
### Save historical cumulatives with fixed dates
We save the date fixed production history in CSV format file.
```{r}
# save historical data with complete dates. 106 rows
data_folder <- file.path(proj_root, "data")
save(hist_cumulatives_dt, file = file.path(data_folder, "hist_cumulatives_dt.Rdata"))
write.csv(hist_cumulatives_dt, file = file.path(data_folder, "hist_cumulatives_dt.CSV"),
row.names = FALSE)
```
### Plot historical cumulatives of oil, gas and water
```{r fig.height=8, fig.width=5}
library(gridExtra)
# cumulative oil from historical production
p1 <- ggplot(hist_cumulatives_dt, aes(x = date, y = cum_oil)) +
geom_line() +
geom_col(color = "dark green", fill = "dark green", alpha = 0.35) +
ggtitle("Cumulative Oil, sm3", subtitle = "Historical Production")
# cumulative gas from historical production
p2 <- ggplot(hist_cumulatives_dt, aes(x = date, y = cum_gas)) +
geom_line() +
geom_col(color = "orange", fill = "orange", alpha = 0.35) +
ggtitle("Cumulative Gas, sm3", subtitle = "Historical Production")
# cumulative water from historical production
p3 <- ggplot(hist_cumulatives_dt, aes(x = date, y = cum_wat)) +
geom_line() +
geom_col(color = "blue", fill = "blue", alpha = 0.35) +
ggtitle("Cumulative Water, sm3", subtitle = "Historical Production")
grid.arrange(p1, p2, p3, ncol =1)
```
### Rename the variables according to source
Finally, because we want to combine the simulation results with the measured production for each fluid, on the same plot, we will rename the variables to something is directly recognizable as the source: simulator or historical.
These are the data transformations:
* rename the variables in the _sim_cumulatives_dt_ dataframe
* rename the variables in the _hist_cumulatives_dt_ dataframe
* join the resulting dataframe by the common variable _date_
```{r}
# rename the simulation cumulatives
sim_cumulatives_src <-
sim_cumulatives_dt %>%
select(date, cum_oil, cum_gas, cum_wat) %>%
rename(cum_oil_sim = cum_oil, cum_gas_sim = cum_gas, cum_wat_sim = cum_wat)
# rename historical cumulatives according to source
hist_cumulatives_src <-
hist_cumulatives_dt %>%
select(date, cum_oil, cum_gas, cum_wat) %>%
rename(cum_oil_hist = cum_oil, cum_gas_hist = cum_gas, cum_wat_hist = cum_wat)
# combine simulator and historical dataframes. common variable is "date"
cumulatives_all <- full_join(hist_cumulatives_src, sim_cumulatives_src, by = "date")
cumulatives_all
```
That's it. Our final comparison dataframe with `r nrow(cumulatives_all)` rows and `r ncol(cumulatives_all)` columns.
## Observations
In order to observe how close was the history matching, we plot the cumulative production from the simulator versus that of the historical production. Additionally, we want to shade the are between the two curves to show the difference.
## How close cumulative productions are
### Cumulative oil. Historical vs simulator
```{r}
# Volve reservoir model dataset
# plot historical vs simulator cum_oil
# manual assignment of colors in the legend
cols <- c("simulator"="red", "historical"="blue") # legend: colors and names
ggplot(cumulatives_all) +
# shade the area between the curves
geom_ribbon(aes(x = date, ymin= cum_oil_sim, ymax= cum_oil_hist),
fill = "dark green", alpha = 0.35) +
geom_line(aes(x = date, y = cum_oil_sim, color = "simulator")) +
geom_line(aes(x = date, y = cum_oil_hist, color = "historical")) +
labs(title = "Volve reservoir model. Comparison Cumulative Oil",
subtitle = "Historical vs Simulator",
y = "cumulative oil, sm3") +
scale_color_manual(name = "Curve", values = cols) # manual legend
```
### Cumulative gas. Historical vs simulator
```{r}
# Volve reservoir model dataset
# plot historical vs simulator cum_gas
cols <- c("simulator"="red", "historical"="blue") # legend: colors and names
ggplot(cumulatives_all) +
# shade the area between the curves
geom_ribbon(aes(x = date, ymin= cum_gas_sim, ymax= cum_gas_hist),
fill = "orange", alpha = 0.35) +
geom_line(aes(x = date, y = cum_gas_sim, color = "simulator"), size = 1) +
geom_line(aes(x = date, y = cum_gas_hist, color = "historical"), size = 1) +
labs(title = "Volve reservoir model. Comparison Cumulative Gas",
subtitle = "Historical vs Simulator",
y = "cumulative gas, sm3") +
scale_color_manual(name = "Curve", values = cols) # manual legend
```
### Cumulative water. Historical vs simulator
```{r}
# Volve reservoir model dataset
# plot historical vs simulator cumulative water variable
cols <- c("simulator"="red", "historical"="blue") # legend: colors and names
ggplot(cumulatives_all) +
# shade the area between the curves
geom_ribbon(aes(x = date, ymin= cum_wat_sim, ymax= cum_wat_hist),
fill = "cyan", alpha = 0.35) +
geom_line(aes(x = date, y = cum_wat_sim, color = "simulator"), size = 1) +
geom_line(aes(x = date, y = cum_wat_hist, color = "historical"), size = 1) +
labs(title = "Volve reservoir model. Comparison Cumulative Water",
subtitle = "Historical vs Simulator",
y = "cumulative water, sm3") +
scale_color_manual(name = "Curve", values = cols) # manual legend
```
__Note.__ This last plot for the cumulative water has been observed different when another reservoir engineer [Konstantin Sermyagin](https://www.linkedin.com/in/konstantin-sermyagin/?lipi=urn%3Ali%3Apage%3Ad_flagship3_messaging%3BMkGp%2FisTS7ehrKLYwnd%2Frw%3D%3D&licu=urn%3Ali%3Acontrol%3Ad_flagship3_messaging-view_profile) was able to read directly from the Eclipse __binary__ files. I have checked the cumulative water from the simulator and it has been properly extracted and transformed. Still, I have some doubts about the extraction of the cumulative water because the outflow doesn't seem to originate from connate water but from injection water as well. If you download the Eclipse __PRT__ file, please, take a look at the __BALANCE AT__ blocks. Try to identify the variable corresponding to connate water only. I am not sure why Eclipse does not make a distinction between the two sources of water.
## Datasets
* [Volve 2016 dataset for reservoir simulation model. Compressed PRT output as zip file](https://zenodo.org/record/2586212). Zenodo
* [Volve 2016 reservoir simulation PRT output file](https://zenodo.org/record/2586209). Zenodo.
* [data_from_step.CSV](https://github.com/f0nzie/volve-reservoir-model-evolution/raw/master/data/data_from_step.CSV): data from STEP block.
* [field_totals_balance.CSV](https://github.com/f0nzie/volve-reservoir-model-evolution/raw/master/data/field_totals_balance.CSV): save the field totals from BALANCE block.
* [hist_cumulatives_104.CSV](https://github.com/f0nzie/volve-reservoir-model-evolution/raw/master/data/hist_cumulatives_104.CSV): historical data after data transformations. 104 rows.
* [hist_cumulatives_dt.CSV](https://github.com/f0nzie/volve-reservoir-model-evolution/raw/master/data/hist_cumulatives_dt.CSV): historical data with complete dates. 106 rows.
* [production_history.CSV](https://github.com/f0nzie/volve-reservoir-model-evolution/raw/master/data/production_history_raw.CSV): Production history read from Excel file. Raw.