top of page
Search

Part 2: Ingesting ocean data

  • Writer: Aleksandr Kolmakov
    Aleksandr Kolmakov
  • Aug 22
  • 3 min read

In the previous article I outlined several sources of data:

ree

Let’s start by understanding what they are and how to ingest them properly.


GBIF - Google public datasets

Main way to access this data is through the main portal. However, it is large enough to give up on any attempts to work with it locally.

There is a workaround for this: public datasets accessible through BigQuery.

But be warned - this screenshot costs more than the whole free tier allowance for BigQuery. 
But be warned - this screenshot costs more than the whole free tier allowance for BigQuery. 

Why?


At the time of writing, the whole dataset weighs almost 1.6 TB and is constantly growing.
At the time of writing, the whole dataset weighs almost 1.6 TB and is constantly growing.

BigQuery prices bytes processed on the referenced data.

Because LIMIT is applied quite late in the query plan - full size of the dataset will be processed and billed.

It would be nice to know that in advance, but I admit - that is a lesson that I learned now.

To prevent losing all the money on a hobby project, there is better way called TABLESAMPLE SYSTEM:

Bytes processed went from 1.6TB to 1.7GB, almost a 1000x improvement on cloud costs.
Bytes processed went from 1.6TB to 1.7GB, almost a 1000x improvement on cloud costs.

Now I just need to bring the rest of the datasets to the GCP.


OBIS - partitioned parquet files


Ocean Biodiversity Information System accumulates and aggregates data from lots of resources and allows to download full database in one go.

And unlike most common scientific data sources - they provide properly partitioned Parquet file:

ree

Which is a really nice - because it allows anyone to process this data locally with minimal setup.

Assuming that my network is not the best (I do love to work from libraries from time to time) and because there are no hash-sums available - I can at least pull file size from headers. That way if some chunks are missing - I will be sure of it because of size mismatch:

Downloaded data is a zipped folder with partitioned parquet. And while opening this at once is a viable strategy for someone with more than 15GB of RAM - I am not that person. Instead I will need something that can read partitioned parquet without extra headache.


DuckDB is an awesome little package that allows to do exactly that.


My end goal - is to understand which species can inhabit some of the dive sites - which is why not all columns of the data will be relevant for me:

And just like that, largest fauna occurrence datasets are ready for future use.


WoRMS, ISSG, IUCN - Darwin core archives for species understanding


Next are the lists of species: marine(WoRMS), endangered(IUCN Red List) and invasive(ISSG GISD). By using them I can narrow my search to only marine fauna and then mark invasive and endangered.

But first let me introduce their data format - Darwin Core Archive.

Thats is how Gemini sees it - love it.
Thats is how Gemini sees it - love it.


Since logic to download and unzip files is ready for reuse, lets check what is inside GISD:

ree

This in combination with official documentation means that i need a little bit of additional logic to understand how to extract data properly from them.

And of course official python-dwca-reader already allows to create a dataframe from DwCA:

I just need to repeat this for each of the 3 sources above. IUCN and ISSG are fully open to public, while WoRMS asked me to register and answer some questions before giving access.


PADI - Dive sites data

PADI (or Professional Association of Diving Instructors) is one of the larges scuba diving entities in the world and their site allows you to view divesites around the globe. That is perfect.

Sometimes web-scraping can be a viable way to go.

But sometimes you can just peek at website`s requests to understand how it is loading required data. And since PADI website loads dive sites data from the open endpoints on the server - I will go with the second option.

DISCLAIMER: If there are proper ways to obtain this information or you are PADI representative who wants to discuss a different approach - please let me know. It gives me no pleasure to hit your website with hundreds of async requests.
ree

The only problem here is that coordinates and metadata for the dive sites come from different endpoints but there is a unique id for every entity that allows to join both datasets.

We will just fetch all of metadata until pagination says there is nothing left and then slice the map into segments that we will load all at once using async:

This begs for a rewrite with a custom pipeline built on DLT - and it will happen! Stay tuned.


After merging both dataframes into one by ID - data for thousands of divesites is acquired.


Ingestion is complete - next step is to orchestrate this automate. For this I will need some magic - Mage.AI. Stay tuned!


 
 
 

Comments


Full logo with text

176-0, Saih Shuaib 2 Dubai, UAE

  • LinkedIn
  • GitHub

© 2025 by kolm logic.

Connect with me 

bottom of page