Google Cloud launches BigLake, a new cross-platform data storage engine

Google Cloud launches BigLake, a new cross-platform data storage engine

Large companies may now more easily evaluate the information stored in their data warehouses and lakes thanks to Google’s preview launch of BigLake, which was revealed at the company’s Cloud Data Summit this morning.

What we’re trying to do with Google Cloud Storage and BigQuery is to combine the best of both into a single service that abstracts away the underlying storage formats and systems, which is what we’ve done with BigQuery so far.

BigQuery or AWS S3 or Azure Data Lake Storage Gen2 might also house this data, as should be noted. Developers will be able to query the underlying data stores through a single system using BigLake’s consistent storage engine, eliminating the need to move or duplicate data.

As Gerrit Kazmaier, Google Cloud VP and GM of Databases, Data Analytics and Business Intelligence points out in the company’s announcement of today, managing data across disparate lakes and warehouses creates silos and adds risk and cost. It is not necessary to duplicate or migrate data from a source, which minimises costs and inefficiencies, because BigLake unifies data warehouses and lakes without worrying about the underlying storage format or system.

BigLake’s policy tags let administrators to up security settings for individual tables, rows, and columns. Google’s multi-cloud analytics solution, BigQuery Omni, allows these security restrictions for data stored in Google Cloud Storage as well as the two third-party systems it supports. As a result of these security constraints, only the appropriate data is able to pass into technologies like Spark, Presto, Trino, and TensorFlow. A Google Dataplex integration is also available for extra data management options.

Google said that BigLake’s API will cover Google Cloud, open column-oriented Apache Parquet, and open-source processing engines like Apache Spark and give fine-grained access controls.

In today’s release, Google Cloud software engineer Justin Levandoski and product manager Gaurav Saxena note that “The volume of valuable data that organizations have to manage and analyze is growing at an incredible rate.” These data are dispersed throughout a variety of sites, including data lakes and NoSQL data stores. Data silos form as an organization’s data grows increasingly complex and proliferates across different data environments, increasing risk and cost when that data needs to be transferred. In other words, they’ve made it apparent that they need assistance.

Google also announced today that their globally distributed SQL database Spanner will get a new feature dubbed “change streams” in addition to BigLake. Using these, database users

may quickly keep track of any database changes, whether they’re updates, deletions, or inserts. According to Kazmaier, “This ensures that clients always have access to the freshest data as they can simply duplicate changes from Spanner to BigQuery for real-time analytics, trigger downstream application behavior using Pub/Sub, or save changes in Google Cloud Storage (GCS) for compliance,”

In addition, Vertex AI Workbench, a platform for managing the complete lifecycle of a data science project, was made generally available by Google Cloud today, as were Connected Sheets for Looker and the ability to access Looker data models in Google Cloud’s Data Studio BI tool.