Make data in Apache Parquet Files AI ready with Virtual Knowledge Graphs
Photo by Ontopic
by Peter Hopfgartner, last update: 5 Febraury 2025 (10 min read)

Make data in Apache Parquet Files AI ready with Virtual Knowledge Graphs

Apache Parquet is an important part of the mainstream data stack. It provides a space-efficient, widely-supported way to exchange tabular data that can be used directly by various query engines. In fact, you can execute SQL queries directly from Parquet files.

For these and many other good reasons, storing data in Parquet files has become a common practice in many organizations, for example, to store data in data lakes. Also, some of the websites that provide access to public data have switched to Parquet for distributing their data, including the famous New York City Taxi and Limousine Commission and the Open Targets Platform.

Can virtual knowledge graphs be built directly from Parquet files?

Building Virtual Knowledge Graphs from Parquet files

An easy way to create VKGs directly from Parquet files is to use DuckDB, a fairly novel and extremely efficient in-process, analytical database.

  1. Download the Ontop CLI archive (https://github.com/ontop/ontop/releases ) and unzip it
  2. Download the DuckDB JDBC driver (https://duckdb.org/docs/installation/?version=stable&environment=java&download_method=direct) and and place it in the jdbc folder in the Ontop folder
  3. Download a Parquet file, for example userdata.parquet from here and place it in the Ontop folder
  4. Create a minimal mapping file for Ontop. You could take the following, which maps the source data to the schema.org ontology and save it as userdata.obda
[PrefixDeclaration]
:   http://www.example.org/
xsd:    http://www.w3.org/2001/XMLSchema#
schema:   https://schema.org/

[MappingDeclaration] @collection [[
mappingId     Registration
target        :userdata/register_action/{id} a schema:RegisterAction; schema:startTime {registration_dttm}^^xsd:datetime; schema:location :userdata/country/{country}; schema:agent :userdata/user/{id} .
source        SELECT id, registration_dttm, country FROM 'userdata.parquet'

mappingId     User
target        :userdata/user/{id} a schema:Person; schema:givenName {first_name}^^xsd:string; schema:familyName {last_name}^^xsd:string; schema:birthDate {birthdate}^^xsd:date; schema:gender {gender}^^xsd:string; schema:hasOccupation :userdata/role/{title}/{id} .
source        SELECT id, first_name, last_name, birthdate, gender, title FROM 'userdata.parquet'

mappingId     Role
target        :userdata/role/{title}/{id} a schema:Role; schema:hasOccupation :userdata/occupation/{title}; schema:estimatedSalary {salary}^^xsd:decimal.
source        SELECT id, title, salary FROM 'userdata.parquet'

mappingId     Occupation
target        :userdata/occupation/{title} a schema:Occupation; schema:name {title}^^xsd:string .
source        SELECT title FROM 'userdata.parquet'

mappingId     Country
target        :userdata/country/{country} a schema:Country; schema:name {country}^^xsd:string .
source        SELECT country FROM 'userdata.parquet'

mappingId     Comment
target        :userdata/comment/{id} a schema:Comment; schema:text {comments}^^xsd:string; schema:about :userdata/register_action/{id} .
source        SELECT id, comments FROM 'userdata.parquet'
]]
  1. Create also the properties file parquet.properties.
jdbc.url=jdbc:duckdb:
jdbc.driver=org.duckdb.DuckDBDriver
ontop.allowRetrievingBlackBoxViewMetadataFromDB=true
  1. Start Ontop: ontop endpoint -m userdatatest.obda -p parquet.properties
  2. Point your browser to http://localhost:8080 to access the SPARQL console. The Virtual Knowledge Graph is indeed immediately available

With this, we enriched the structure of the Parquet file:

schema of userdata.parquet

to something more elaborated, like

schema of teh knowledge graph

If you take a closer look, you’ll notice that a number of aspects have been disambiguated or made explicit . For example, Country is the country associated with the registration, so clearly the country where the user is located at that moment, not the one he was born in. This addistional structure clearly helps humans and LLMs to better grasp the main information conveyed with this dataset.

You now have a Knowledge Graph without having to ingest any data. We have added a rich structure to a flat Parquet file with very little effort. Now userdata.parquet is quite small, but it could have potentially been many hundreds of GB large. Again, the fact that there is no need to ingest data allows you to create a knowledge graph in no time.

Once the data is mapped, the VKG is immediately available. This can be a big deal, especially in the context of data that is delivered or updated periodically without changes in structure.

Large Parquet Files

The first approach works well for small to medium-sized data sets. Unfortunately, the Parquet file does not convey some information that is important for Ontop to query the file in an efficient way. The most important information is about UNIQUE constraints. If you want to know more about this, I suggest you look at the Ontop documentation on constraints . Ontop has a very flexible way of adding this and many other pieces of information about datasets, called “lenses”. You can even use them to do lightweight data cleaning, see docuemntation about lenses .

In the following lens, I simply declare that each value in the id column is unique. With this little information, Ontop creates highly efficient queries and even large datasets can be virtualized.

{
  "relations" : [
    {
      "name" : [
        "\"lenses\"",
        "\"userdata\""
      ],
      "query" : "SELECT * FROM 'userdata.parquet'",
      "uniqueConstraints" : {
        "added" : [
          {
            "name" : "uc2",
            "determinants" : [
              "\"id\""
            ],
            "isPrimaryKey" : false
          }
        ]
      },
      "type" : "SQLLens"
    }
  ]
}

In mapping, we have to address the changed table names, replacing 'userdata.parquet' with lenses.userdata:

[PrefixDeclaration]
:   http://www.example.org/#
xsd:    http://www.w3.org/2001/XMLSchema#
schema:   https://schema.org/
rdf:    http://www.w3.org/1999/02/22-rdf-syntax-ns#

[MappingDeclaration] @collection [[
mappingId     Registration
target        :userdata/register_action/{id} a schema:RegisterAction; schema:startTime {registration_dttm}; schema:location :userdata/country/{country}; schema:agent :userdata/user/{id} .
source        SELECT id, registration_dttm, country FROM lenses.userdata

mappingId     User
target        :userdata/user/{id} a schema:Person; schema:givenName {first_name}; schema:familyName {last_name}; schema:birthDate {birthdate}; schema:gender {gender}; schema:hasOccupation :userdata/role/{title}/{id} .
source        SELECT id, first_name, last_name, birthdate, gender, title FROM lenses.userdata

mappingId     Role
target        :userdata/role/{title}/{id} a schema:Role; schema:hasOccupation :userdata/occupation/{title}; schema:estimatedSalary {salary}.
source        SELECT id, title, salary FROM lenses.userdata

mappingId     Occupation
target        :userdata/occupation/{title} a schema:Occupation; schema:name {title} .
source        SELECT title FROM lenses.userdata

mappingId     Country
target        :userdata/country/{country} a schema:Country; schema:name {country} .
source        SELECT country FROM lenses.userdata

mappingId     Comment
target        :userdata/comment/{id} a schema:Comment; schema:text {comments}; schema:about :userdata/register_action/{id} .
target        :userdata/comment/{id} a schema:Comment; schema:text {comments}; schema:about :userdata/register_action/{id} .
source        SELECT id, comments FROM lenses.userdata
]]

Conclusion

The barrier to entry for using Knowledge Graph is significantly lowered by virtualization, making it easier to experiment with Knowledge Graph design, not to mention the huge advantage of reducing the need for ETL on large datasets.

This makes it easy to add rich context to data sets, making them much more accessible to humans and computers alike, with particular benefits for LLMs.

Curious about Virtual Knowledge Graphs and the mainstream data stack in an enterprise environment? Keep an eye out for upcoming product announcements from Ontopic.

Apache Parquet + DuckDB + Ontop = perfect match

Notes

In this article we preferred the OBDA file format for the mappings, instead of the R2RML syntax, due to the better readabilty. Anyway, the W3C standard is R2RML, which is well accepted by Ontop, as well.

You can convert from OBDA to R2RML syntax and vice versa with Ontop.

Get a demo access of Ontopic Studio

Ready to do mapping with a no-code approach? Let us help you. Get a demo access:


We'll never share your email with anyone else.
Please supply a valid email address

From time to time we send updates.