Distance between shore and ocean

I started an “open source” open source project to create a new dataset for the pH of the Earth’s oceans.

I started with an open dataset from NOAA and created a 2.45 million rows dataset with these columns:

colnames(NOAA_NODC_OSD_SUR_pH_7to9) [1] "Year" "Month" "Day" "Hour" "Lat" "Long" "Depth" "pH" 

Method document HERE .

Dataset HERE .

Now my goal is to “qualify” each row (2.45 m) ... for this I need to calculate the distance from each Lat / Long point to the nearest shore.

So, I'm looking for a method that In: Lat / Long Out: Distance (km from the coast)

At the same time, I can qualify whether the data point may be affected by coastal pollution, for example, from urban efficiency.

I have a search method for this, but everything seems to need packages / software that I don’t have.

If anyone wants to help, I will be grateful. Or, if you know a simple (free) method to achieve this, let me know ...

I can work in R-programming, Shell scripts, but not an expert of those ....

+6
source share
1 answer

So here are a few things going on. First, your dataset appears to have a pH value and depth. So as long as there are ~ 2.5MM lines, there are only ~ 200,000 lines with depth = 0 - there are still many.

Secondly, to get to the nearest coast, you need a shapefile of coastlines. Fortunately, it is available here on the excellent Natural Earth website .

Thirdly, your data is long / lat (so, units = degrees), but you want the distance in km, so you need to convert your data (the shoreline data above are also in long / lat, and also need to be transformed). One of the problems with transformations is that your data is obviously global, and any global transformation will necessarily be unplanar. Thus, the accuracy will depend on the actual location. The right way to do this is to compose your data, and then use a set of planar transforms that match any grid your points are in. However, this is beyond the scope of this question, so we will use the global transform (mollweide) just to give you an idea of ​​how this is done in R.

 library(rgdal) # for readOGR(...); loads package sp as well library(rgeos) # for gDistance(...) setwd(" < directory with all your files > ") # WGS84 long/lat wgs.84 <- "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0" # ESRI:54009 world mollweide projection, units = meters # see http://www.spatialreference.org/ref/esri/54009/ mollweide <- "+proj=moll +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs" df <- read.csv("OSD_All.csv") sp.points <- SpatialPoints(df[df$Depth==0,c("Long","Lat")], proj4string=CRS(wgs.84)) coast <- readOGR(dsn=".",layer="ne_10m_coastline",p4s=wgs.84) coast.moll <- spTransform(coast,CRS(mollweide)) point.moll <- spTransform(sp.points,CRS(mollweide)) set.seed(1) # for reproducible example test <- sample(1:length(sp.points),10) # random sample of ten points result <- sapply(test,function(i)gDistance(point.moll[i],coast.moll)) result/1000 # distance in km # [1] 0.2185196 5.7132447 0.5302977 28.3381043 243.5410571 169.8712255 0.4182755 57.1516195 266.0498881 360.6789699 plot(coast) points(sp.points[test],pch=20,col="red") 

This way it reads your dataset, retrieves the rows where Depth==0 , and converts them to a SpatialPoints object. Then we read the coastline database, loaded with the link above, into the SpatialLines object. Then we transform as a Mollweide projection using spTransform(...) , then use gDistance(...) in the rgeos package to calculate the minimum distance between each point and the nearest coast.

Again, it is important to remember that, despite all the decimal places, these distances are approximate.

One very big problem is speed: this process takes ~ 2 minutes per 1000 distances (on my system), so it will take about 6.7 hours to run just 200,000 distances. One option, theoretically, would be to find a lower resolution coastline database.

The code below will calculate all 201,000 distances.

 ## not run ## estimated run time ~ 7 hours result <- sapply(1:length(sp.points), function(i)gDistance(sp.points[i],coast)) 

EDIT : OP's comment on kernels made me think that this might be the case when the improvement from parallelization could be worth the effort. So, here is how you could run this (on Windows) using parallel processing.

 library(foreach) # for foreach(...) library(snow) # for makeCluster(...) library(doSNOW) # for resisterDoSNOW(...) cl <- makeCluster(4,type="SOCK") # create a 4-processor cluster registerDoSNOW(cl) # register the cluster get.dist.parallel <- function(n) { foreach(i=1:n, .combine=c, .packages="rgeos", .inorder=TRUE, .export=c("point.moll","coast.moll")) %dopar% gDistance(point.moll[i],coast.moll) } get.dist.seq <- function(n) sapply(1:n,function(i)gDistance(point.moll[i],coast.moll)) identical(get.dist.seq(10),get.dist.parallel(10)) # same result? # [1] TRUE library(microbenchmark) # run "benchmark" microbenchmark(get.dist.seq(1000),get.dist.parallel(1000),times=1) # Unit: seconds # expr min lq mean median uq max neval # get.dist.seq(1000) 140.19895 140.19895 140.19895 140.19895 140.19895 140.19895 1 # get.dist.parallel(1000) 50.71218 50.71218 50.71218 50.71218 50.71218 50.71218 1 

Using 4 cores improves processing speed by about 3 times. Thus, since 1000 distances takes about a minute, 100 000 should take a little less than 2 hours.

Note that using times=1 indeed an abuse of microbenchmark(...) , since the goal is to run the process several times and average the results, but I just had no patience.

+7
source

Source: https://habr.com/ru/post/980277/


All Articles