Building Virtual Knowledge Graphs with Ontop and Apache Iceberg
What’s So Special About Apache Iceberg?
Apache Iceberg is one of the most fascinating technologies when it comes to standardized access to large analytic tables.
And Apache Iceberg combines very well with the idea of virtual knowledge graphs. Together, these technologies allow us to create huge knowledge graphs almost instantaneously. Why should we do this? We can apply a comprehensible and shared vocabulary and model to describe and access data. And it becomes easier to connect different datasets. All this makes it easier for LLMs and humans to work with the data.
The following is a hand-on example, which allows one to get a feeling why this combination is so exciting. There has been a lot of buzz around Apache Iceberg. What I find really fascinating is the great flexibility in acting as a catalog for large standardized data files, e.g. tabular data in Parquet data files. For good reason it is frequently described as Git for data, allowing to version data files, going back in time, etc. And all of this is done in a standardized, highly transparent manner.
The other aspect, which makes me believe in this technology is the large support from many vendors in the data lake area. Have a look at the growing list of vendors supporting this standard.
Our Building Blocks
For our experiment, we will use Nessie for the Iceberg catalog.
Data files are usually kept on some kind of object storage, with S3 and S3-compatible systems being very popular choices. For our small example, I preferred to have everything self-hosted, using Minio as an S3 implementation. But everything can be easily adapted to your preferred cloud object storage.
The third component we will rely on is Trino, using it as a query engine over the datafiles in S3. Again, there are many other options, both self-hosted and as cloud solutions, which can be easily used here instead.
On top of these, we will use Ontop for creating the virtual knowledge graph.
Putting The Pieces Together
It sounds like a lot of work to set up the whole system. Luckily, somebody already did most of the work for us (thank you, ChungTing Wu!). I simply updated some parts and added support for virtual knowledge graphs, you will find all the files on Github.
Clone the repository, pull the docker images and start up the services:
git clone https://github.com/ontopic-vkg/ontop-trino-iceberg-playground.git
cd ontop-trino-iceberg-playground
docker compose pull
docker compose up
Check the files in Minio at http://localhost:9001, using admin and password as credentials. The two buckets are empty. Also the Iceberg catalog at http://localhost:19120 is still pretty empty .
Things become more interesting, once we start filling in some data. We will do that with Trino. Let’s use the Trino client application, running it from the Trino container:
docker exec -ti trino trino
Have a look at the available catalogs, SHOW CATALOGS;
and you can spot our iceberg catalog “example”, which was created from the post-init.sql script.
The catalogs tpcds and tpch contain sample data in line with the TPC-H e TPC-DS industry standard benchmarks for evaluating database systems performance and fit well our purpose here. We will simply copy some data from tpch into our iceberg hosted catalog example.
The schema of this popular dataset for evaluating database performance is described in more detail on the TPC site or for example on the Snowflake site .
Have a look at what is already there:
SHOW SCHEMAS FROM example;
USE example.example_s3_schema;
SHOW TABLES;
The easiest way to load data into our Iceberg managed schema is by copying data from the Trino standard installation.
USE example.example_s3_schema;
CREATE TABLE customer AS SELECT * FROM tpch.sf1.customer;
CREATE TABLE lineitem AS SELECT * FROM tpch.sf1.lineitem;
CREATE TABLE nation AS SELECT * FROM tpch.sf1.nation;
CREATE TABLE orders AS SELECT * FROM tpch.sf1.orders;
CREATE TABLE part AS SELECT * FROM tpch.sf1.part;
CREATE TABLE partsupp AS SELECT * FROM tpch.sf1.partsupp;
CREATE TABLE region AS SELECT * FROM tpch.sf1.region;
CREATE TABLE supplier AS SELECT * FROM tpch.sf1.supplier;
This creates the new tables in our iceberg environment. As expected, the physical copy of the data is in our S3 bucket, while the Nessie catalog now contains the pointers to these data chunks.
Also the S3 bucket filled up:
Click around to find data and metadata. There is plenty of information about how these files are used in Iceberg. A good introduction to this topic is given in an article on Dremio’s site.
Adding The Virtual Knowledge Graph
Now, let’s put our attention more on the knowledge graph part and see how we can create a Virtual Knowledge Graph. Let’s start with a subset of the tables taken from TPC-H. Indeed, it’s good practice to start these mapping projects with a subset of the data and increase the coverage step by step. We’ll focus on the orders, therefore taking the tables customer, orders, lineitem, supplier and part.
You might notice that our tables are lacking the information about columns with unique values. Without primary keys or other unique constraints, Ontop can not take advantage of some important optimizations. The Ontop documentation has more Information about this topic. We can add this information using a powerful feature of Ontop called lenses. Have a look at the lenses directly from the GitHub repo.
You might remember that in the world of Knowledge Graphs, the terms to describe data do really matter, they are very crucial for making data interoperable. For this example we will use what is very likely the most used vocabulary, schema.org. Here we will not use the full vocabulary, just the terms we need. The classes are Order, OrderItem, Organization and Product. The relationship between those is described by orderedItem, customer and vendor. The following diagram summarized the final structure of our virtual knowledge graph:
With this schema it becomes easy to query the knowledge graph, see how in the query we follow the connections from one class to another along the edges, which connect them.
PREFIX schema: <https://schema.org/>
SELECT
?productName
(COUNT(?order) AS ?numOrders)
(SUM(?orderQuantity) AS ?totalOrderQuantity)
WHERE
{
?order a schema:Order ;
schema:orderedItem ?orderItem .
?orderItem schema:orderedItem ?product .
?orderItem schema:orderQuantity ?orderQuantity .
?product schema:name ?productName .
}
GROUP BY ?product ?productName
ORDER BY DESC(?c)
LIMIT 10
The relationship between the schema of the virtual knowledge graph and the underlying data source is defined by a so-called “mapping”. Mappings can be specified in a simple text file following the Ontop mapping file format. The mapping we use here is also on Github.
Again, everything is packed in a single docker compose file. In case, shut down the previously started containers with
docker compose down
And fire up the complete stack, which includes also Ontop, with
docker compose -f docker-compose.full.yaml up -d
You can now access the Ontop portal with the SPARQL console, pointing your browser to http://localhost:9090. Enter the above quoted query, play around. Maybe even extend the mapping and the lenses in the directory ontop/input .
Summing Up
Combining Apache Iceberg with Virtual Knowledge Graph allows to create highly interoperable and easily queryable data very quickly. Virtualization allows us to query directly the data source, leveraging the scalability of many powerful cloud data sources and to avoid transforming and loading large volumes of data into the target database, keeping redundancy of data and infrastructure at minimal levels. We only scratched the surface of the exciting topic of combining the expressivity of knowledge graphs with the scalability and flexibility of cloud native technologies, such as object storage, flexible data catalogs and powerful query engines . There is enough documentation out there to go much deeper for any of the mentioned technologies. The creation of the lenses and the mappings can be done in a scalable and comfortable environment, such as Ontopic Studio. Interested in putting this into production? Contact us at inquiry@ontopic.ai.
Notes
In this article we preferred the OBDA file format for the mappings, instead of the R2RML syntax, due to the better readabilty. Anyway, the W3C standard is R2RML, which is well accepted by Ontop, as well.
You can convert from OBDA to R2RML syntax and vice versa with Ontop.
You might also be interested in the article about creating virtual knowledge graphs directly on Apache Parquet files.
Get a demo access of Ontopic Studio
Ready to do mapping with a no-code approach? Let us help you. Get a demo access: