Type object
File match *.sl.yml
Schema URL https://catalog.lintel.tools/schemas/schemastore/starlake-data-pipeline/latest.json
Source https://www.schemastore.org/starlake.json

Validate with Lintel

npx @lintel/lintel check
Type: object

JSON Schema for Starlake Data Pipeline

Properties

version integer required
Values: 1

All of

1. StarlakeV1Base object

Definitions

ConvertibleToString string | boolean | number | integer | null
MergeOnV1 const: "TARGET" | const: "SOURCE_AND_TARGET"
PrimitiveTypeV1 string | boolean | number | integer | null
TrimV1 string | boolean | number | integer | null
TableSync string | boolean | number | integer | null
TableDdlV1 object

DDL used to create a table

createSql string | boolean | number | integer | null required
pingSql string | boolean | number | integer | null
selectSql string | boolean | number | integer | null
Materialization string | boolean | number | integer | null
TableTypeBase string | boolean | number | integer | null
TableTypeV1 string | boolean | number | integer | null
TypeV1 object

Custom type definition. Custom types are defined in the types/types.sl.yml file

name string | boolean | number | integer | null required
pattern string | boolean | number | integer | null required
primitiveType const: "string" | const: "long" | const: "int" | const: "short" | const: "double" | const: "boolean" | const: "byte" | const: "date" | const: "timestamp" | const: "decimal" | const: "variant" | const: "struct"

Define the value type

zone string | boolean | number | integer | null
sample string | boolean | number | integer | null
comment string | boolean | number | integer | null
ddlMapping Record<string, string | boolean | number | integer | null>

Map of string

PositionV1 object

First and last char positions of an attribute in a fixed length record

first number required

Zero based position of the first character for this attribute

last number required

Zero based position of the last character to include in this attribute

ConnectionV1 object

Connection properties to a datawarehouse.

type string | boolean | number | integer | null required
sparkFormat string | boolean | number | integer | null
loader string | boolean | number | integer | null
quote string | boolean | number | integer | null
separator string | boolean | number | integer | null

Map of string

DagGenerationConfigV1 object

Dag configuration.

template string | boolean | number | integer | null required
filename string | boolean | number | integer | null required
comment string | boolean | number | integer | null

Map of string

RowLevelSecurityV1 object

Row level security policy to apply to the output data.

name string | boolean | number | integer | null required
grants ConvertibleToString[] required

user / groups / service accounts to which this security level is applied. ex : user:[email protected],group:[email protected],serviceAccount:[email protected]

predicate string | boolean | number | integer | null
description string | boolean | number | integer | null
AccessControlEntryV1 object

Column level security policy to apply to the attribute.

role string | boolean | number | integer | null required
grants ConvertibleToString[] required

user / groups / service accounts to which this security level is applied. ex : user:[email protected],group:[email protected],serviceAccount:[email protected]

name string | boolean | number | integer | null
FormatV1 string | boolean | number | integer | null
MapString Record<string, string | boolean | number | integer | null>

Map of string

MapConnectionV1 Record<string, object>

Map of jdbc engines

MapJdbcEngineV1 Record<string, object>

Map of jdbc engines

MapTableDdlV1 Record<string, object>

Map of table ddl

JdbcEngineV1 object

Jdbc engine

tables Record<string, object> required

Map of table ddl

quote string required

How to quote identifiers

strategyBuilder string required

Override the default strategy builder used to write data. A strategy is a folder located under metadata/templates/write-strategies/[strategyBuilder]

viewPrefix string

When creating views, how they should be prefixed. Some databases like redshift require view name to be prefixed by the character '#'. This is not required for other databases like snowflake or bigquery. Default is empty string

preActions string

SQL statements to execute immediately after the database connection is opened (e.g., SET commands)

partitionBy string

keyword used to partition the table. Default is PARTITION BY

clusterBy string

keyword used to cluster the table. Default is CLUSTER BY

columnRemarks string

How to get column remarks

tableRemarks string

How to get table remarks

PrivacyV1 object

Map of string

InternalV1 object

configure Spark internal options

cacheStorageLevel string | boolean | number | integer | null
intermediateBigqueryFormat string | boolean | number | integer | null
temporaryGcsBucket string | boolean | number | integer | null
substituteVars boolean

Internal use. Do not modify.

bqAuditSaveInBatchMode boolean

Should audit logs when using BigQuery be saved in batch or interactive mode ? Interactive by default (false)

AccessPoliciesV1 object
apply boolean

Should access policies be enforced ?

location string | boolean | number | integer | null
database string | boolean | number | integer | null
taxonomy string | boolean | number | integer | null
SparkSchedulingV1 object
maxJobs integer

Max number of Spark jobs to run in parallel, default is 1

poolName string | boolean | number | integer | null
mode string | boolean | number | integer | null
file string | boolean | number | integer | null
ExpectationsConfigV1 object
path string | boolean | number | integer | null
active boolean

should expectations be executed ?

failOnError boolean

should load / transform fail on expectation error ?

ExpectationItemV1 object
expect string | boolean | number | integer | null
failOnError boolean

should load / transform fail on expectation error ?

MetricsV1 object
path string | boolean | number | integer | null
discreteMaxCardinality integer

Max number of unique values accepted for a discrete column. Default is 10

active boolean

Should metrics be computed ?

AllSinksV1 object
connectionRef string | boolean | number | integer | null

FS or BQ: List of attributes to use for clustering

days number

BQ: Number of days before this table is set as expired and deleted. Never by default.

requirePartitionFilter boolean

BQ: Should be require a partition filter on every request ? No by default.

materializedView const: "TABLE" | const: "VIEW" | const: "MATERIALIZED_VIEW" | const: "HYBRID"

Table types supported by the Sink option

enableRefresh boolean

BQ: Enable automatic refresh of materialized view ? false by default.

refreshIntervalMs number

BQ: Refresh interval in milliseconds. Default to BigQuery default value

id string | boolean | number | integer | null
format string | boolean | number | integer | null
extension string | boolean | number | integer | null

columns to use for sharding. table will be named table_{sharding(0)}_{sharding(1)}

partition string[]

FS or BQ: List of partition attributes

coalesce boolean

When outputting files, should we coalesce it to a single file. Useful when CSV is the output format.

path string

Optional path attribute if you want to save the file outside of the default location (datasets folder)

Map of string

WriteStrategyTypeBase string | boolean | number | integer | null
WriteStrategyTypeV1 string | boolean | number | integer | null
OpenWriteStrategyTypeV1 string | boolean | number | integer | null
WriteStrategyV1 object

Write strategy type including custom strategies. Allows predefined strategies or custom strategy names

types Record<string, ConvertibleToString>

Map of connection type to write strategy. Allows different strategies per target database

List of columns to use as key(s) for the target table. This is used to update existing records in the target table.

timestamp string | boolean | number | integer | null
queryFilter string | boolean | number | integer | null
on const: "TARGET" | const: "SOURCE_AND_TARGET"
startTs string | boolean | number | integer | null
endTs string | boolean | number | integer | null
MetadataV1 object
format const: "DATAFRAME" | const: "DSV" | const: "POSITION" | const: "JSON" | const: "JSON_ARRAY" | const: "JSON_FLAT" | const: "XML" | const: "TEXT_XML" | const: "KAFKA" | const: "KAFKASTREAM" | const: "GENERIC" | const: "PARQUET"

DSV by default. Supported file formats are :\n- DSV : Delimiter-separated values file. Delimiter value is specified in the "separator" field.\n- POSITION : FIXED format file where values are located at an exact position in each line.\n- JSON_FLAT : For optimisation purpose, we differentiate JSON with top level values from JSON\n with deep level fields. JSON_FLAT are JSON files with top level fields only.\n- JSON : Deep JSON file. Use only when your json documents contain sub-documents, otherwise prefer to\n use JSON_FLAT since it is much faster.\n- XML : XML files

encoding string | boolean | number | integer | null
multiline boolean

are json objects on a single line or multiple line ? Single by default. false means single. false also means faster

array boolean

Is the json stored as a single object array ? false by default. This means that by default we have on json document per line.

withHeader boolean

does the dataset has a header ? true by default

separator string | boolean | number | integer | null
quote string | boolean | number | integer | null
escape string | boolean | number | integer | null
sink object
15 nested properties
connectionRef string | boolean | number | integer | null

FS or BQ: List of attributes to use for clustering

days number

BQ: Number of days before this table is set as expired and deleted. Never by default.

requirePartitionFilter boolean

BQ: Should be require a partition filter on every request ? No by default.

materializedView const: "TABLE" | const: "VIEW" | const: "MATERIALIZED_VIEW" | const: "HYBRID"

Table types supported by the Sink option

enableRefresh boolean

BQ: Enable automatic refresh of materialized view ? false by default.

refreshIntervalMs number

BQ: Refresh interval in milliseconds. Default to BigQuery default value

id string | boolean | number | integer | null
format string | boolean | number | integer | null
extension string | boolean | number | integer | null

columns to use for sharding. table will be named table_{sharding(0)}_{sharding(1)}

partition string[]

FS or BQ: List of partition attributes

coalesce boolean

When outputting files, should we coalesce it to a single file. Useful when CSV is the output format.

path string

Optional path attribute if you want to save the file outside of the default location (datasets folder)

Map of string

directory string | boolean | number | integer | null

recognized filename extensions. json, csv, dsv, psv are recognized by default. Only files with these extensions will be moved to the stage folder.

ack string | boolean | number | integer | null

Map of string

loader string | boolean | number | integer | null
emptyIsNull boolean

Treat empty columns as null in DSV files. Default to false

dagRef string | boolean | number | integer | null
freshness object
2 nested properties
warn string | boolean | number | integer | null
error string | boolean | number | integer | null
nullValue string | boolean | number | integer | null
schedule string | boolean | number | integer | null
writeStrategy object
8 nested properties

Write strategy type including custom strategies. Allows predefined strategies or custom strategy names

types Record<string, ConvertibleToString>

Map of connection type to write strategy. Allows different strategies per target database

List of columns to use as key(s) for the target table. This is used to update existing records in the target table.

timestamp string | boolean | number | integer | null
queryFilter string | boolean | number | integer | null
on const: "TARGET" | const: "SOURCE_AND_TARGET"
startTs string | boolean | number | integer | null
endTs string | boolean | number | integer | null
AreaV1 object
incoming string | boolean | number | integer | null
stage string | boolean | number | integer | null
unresolved string | boolean | number | integer | null
archive string | boolean | number | integer | null
ingesting string | boolean | number | integer | null
replay string | boolean | number | integer | null
hiveDatabase string | boolean | number | integer | null
FreshnessV1 object
warn string | boolean | number | integer | null
error string | boolean | number | integer | null
TableV1 object

Table Schema definition.

name string | boolean | number | integer | null required
pattern string | boolean | number | integer | null required
attributes AttributeV1[] required

Attributes parsing rules.

metadata object
20 nested properties
format const: "DATAFRAME" | const: "DSV" | const: "POSITION" | const: "JSON" | const: "JSON_ARRAY" | const: "JSON_FLAT" | const: "XML" | const: "TEXT_XML" | const: "KAFKA" | const: "KAFKASTREAM" | const: "GENERIC" | const: "PARQUET"

DSV by default. Supported file formats are :\n- DSV : Delimiter-separated values file. Delimiter value is specified in the "separator" field.\n- POSITION : FIXED format file where values are located at an exact position in each line.\n- JSON_FLAT : For optimisation purpose, we differentiate JSON with top level values from JSON\n with deep level fields. JSON_FLAT are JSON files with top level fields only.\n- JSON : Deep JSON file. Use only when your json documents contain sub-documents, otherwise prefer to\n use JSON_FLAT since it is much faster.\n- XML : XML files

encoding string | boolean | number | integer | null
multiline boolean

are json objects on a single line or multiple line ? Single by default. false means single. false also means faster

array boolean

Is the json stored as a single object array ? false by default. This means that by default we have on json document per line.

withHeader boolean

does the dataset has a header ? true by default

separator string | boolean | number | integer | null
quote string | boolean | number | integer | null
escape string | boolean | number | integer | null
sink object
15 nested properties
connectionRef string | boolean | number | integer | null

FS or BQ: List of attributes to use for clustering

days number

BQ: Number of days before this table is set as expired and deleted. Never by default.

requirePartitionFilter boolean

BQ: Should be require a partition filter on every request ? No by default.

materializedView const: "TABLE" | const: "VIEW" | const: "MATERIALIZED_VIEW" | const: "HYBRID"

Table types supported by the Sink option

enableRefresh boolean

BQ: Enable automatic refresh of materialized view ? false by default.

refreshIntervalMs number

BQ: Refresh interval in milliseconds. Default to BigQuery default value

id string | boolean | number | integer | null
format string | boolean | number | integer | null
extension string | boolean | number | integer | null

columns to use for sharding. table will be named table_{sharding(0)}_{sharding(1)}

partition string[]

FS or BQ: List of partition attributes

coalesce boolean

When outputting files, should we coalesce it to a single file. Useful when CSV is the output format.

path string

Optional path attribute if you want to save the file outside of the default location (datasets folder)

Map of string

directory string | boolean | number | integer | null

recognized filename extensions. json, csv, dsv, psv are recognized by default. Only files with these extensions will be moved to the stage folder.

ack string | boolean | number | integer | null

Map of string

loader string | boolean | number | integer | null
emptyIsNull boolean

Treat empty columns as null in DSV files. Default to false

dagRef string | boolean | number | integer | null
freshness object
2 nested properties
warn string | boolean | number | integer | null
error string | boolean | number | integer | null
nullValue string | boolean | number | integer | null
schedule string | boolean | number | integer | null
writeStrategy object
8 nested properties

Write strategy type including custom strategies. Allows predefined strategies or custom strategy names

types Record<string, ConvertibleToString>

Map of connection type to write strategy. Allows different strategies per target database

List of columns to use as key(s) for the target table. This is used to update existing records in the target table.

timestamp string | boolean | number | integer | null
queryFilter string | boolean | number | integer | null
on const: "TARGET" | const: "SOURCE_AND_TARGET"
startTs string | boolean | number | integer | null
endTs string | boolean | number | integer | null
comment string | boolean | number | integer | null

attach streams to table (Snowflake only)

Reserved for future use.

List of SQL requests to executed after the table has been loaded.

Set of string to attach to this Schema

Row level security on this schema.

expectations ExpectationItemV1[]

Expectations to check after Load / Transform has succeeded

List of columns that make up the primary key

Map of rolename -> List[Users].

rename string | boolean | number | integer | null
sample string | boolean | number | integer | null
filter string | boolean | number | integer | null
patternSample string | boolean | number | integer | null
MetricTypeV1 string | boolean | number | integer | null
AttributeV1 object
name string | boolean | number | integer | null required
type string | boolean | number | integer | null
array boolean

Is this attribute an array/list of values? Default is false

required boolean

Should this attribute always be present in the source. Default to true.

privacy string | boolean | number | integer | null
comment string | boolean | number | integer | null
rename string | boolean | number | integer | null
sample string | boolean | number | integer | null
metricType const: "DISCRETE" | const: "CONTINUOUS" | const: "TEXT" | const: "NONE"

Used to compute metrics on column values.

attributes AttributeV1[]

List of sub-attributes (valid for JSON and XML files only)

position object

First and last char positions of an attribute in a fixed length record

2 nested properties
first number required

Zero based position of the first character for this attribute

last number required

Zero based position of the last character to include in this attribute

default string | boolean | number | integer | null

Tags associated with this attribute

trim const: "LEFT" | const: "RIGHT" | const: "BOTH" | const: "NONE"

How to trim the input string

script string | boolean | number | integer | null
foreignKey string | boolean | number | integer | null
ignore boolean

Should this attribute be ignored on ingestion. Default to false

accessPolicy string | boolean | number | integer | null
AutoTaskDescV1 object
name string | boolean | number | integer | null
sql string | boolean | number | integer | null

attach streams to task (Snowflake only)

List of columns that make up the primary key for the output table

database string | boolean | number | integer | null
domain string | boolean | number | integer | null
table string | boolean | number | integer | null

List of columns used for partitioning the output.

List of SQL requests to executed before the main SQL request is run

List of SQL requests to executed after the main SQL request is run

sink object
15 nested properties
connectionRef string | boolean | number | integer | null

FS or BQ: List of attributes to use for clustering

days number

BQ: Number of days before this table is set as expired and deleted. Never by default.

requirePartitionFilter boolean

BQ: Should be require a partition filter on every request ? No by default.

materializedView const: "TABLE" | const: "VIEW" | const: "MATERIALIZED_VIEW" | const: "HYBRID"

Table types supported by the Sink option

enableRefresh boolean

BQ: Enable automatic refresh of materialized view ? false by default.

refreshIntervalMs number

BQ: Refresh interval in milliseconds. Default to BigQuery default value

id string | boolean | number | integer | null
format string | boolean | number | integer | null
extension string | boolean | number | integer | null

columns to use for sharding. table will be named table_{sharding(0)}_{sharding(1)}

partition string[]

FS or BQ: List of partition attributes

coalesce boolean

When outputting files, should we coalesce it to a single file. Useful when CSV is the output format.

path string

Optional path attribute if you want to save the file outside of the default location (datasets folder)

Map of string

expectations ExpectationItemV1[]

Expectations to check after Load / Transform has succeeded

Map of rolename -> List[Users].

comment string | boolean | number | integer | null
freshness object
2 nested properties
warn string | boolean | number | integer | null
error string | boolean | number | integer | null
attributes AttributeV1[]

Attributes

python string | boolean | number | integer | null

Set of string to attach to the output table

writeStrategy object
8 nested properties

Write strategy type including custom strategies. Allows predefined strategies or custom strategy names

types Record<string, ConvertibleToString>

Map of connection type to write strategy. Allows different strategies per target database

List of columns to use as key(s) for the target table. This is used to update existing records in the target table.

timestamp string | boolean | number | integer | null
queryFilter string | boolean | number | integer | null
on const: "TARGET" | const: "SOURCE_AND_TARGET"
startTs string | boolean | number | integer | null
endTs string | boolean | number | integer | null
schedule string | boolean | number | integer | null
dagRef string | boolean | number | integer | null
taskTimeoutMs integer

Number of milliseconds before a communication timeout.

parseSQL boolean

Should we parse this SQL make it update the table according to write strategy or just execute it ?

connectionRef string

Used when the default connection ref present in the application.sl.yml file is not the one to use to run the SQL request for this task.

syncStrategy const: "NONE" | const: "ADD" | const: "ALL"

Should this YAML table schema be synchronized with the source table ?

dataset_triggering_strategy string

Dataset triggering strategy to determine when this task should be executed based on dataset changes: & and | operators are allowed (dataset1 & dataset2) | dataset3

LockV1 object
path string | boolean | number | integer | null
timeout integer

reserved

pollTime integer

Default 5 seconds

refreshTime integer

Default 5 seconds

AuditV1 object
path string | boolean | number | integer | null
sink object
15 nested properties
connectionRef string | boolean | number | integer | null

FS or BQ: List of attributes to use for clustering

days number

BQ: Number of days before this table is set as expired and deleted. Never by default.

requirePartitionFilter boolean

BQ: Should be require a partition filter on every request ? No by default.

materializedView const: "TABLE" | const: "VIEW" | const: "MATERIALIZED_VIEW" | const: "HYBRID"

Table types supported by the Sink option

enableRefresh boolean

BQ: Enable automatic refresh of materialized view ? false by default.

refreshIntervalMs number

BQ: Refresh interval in milliseconds. Default to BigQuery default value

id string | boolean | number | integer | null
format string | boolean | number | integer | null
extension string | boolean | number | integer | null

columns to use for sharding. table will be named table_{sharding(0)}_{sharding(1)}

partition string[]

FS or BQ: List of partition attributes

coalesce boolean

When outputting files, should we coalesce it to a single file. Useful when CSV is the output format.

path string

Optional path attribute if you want to save the file outside of the default location (datasets folder)

Map of string

maxErrors string | boolean | number | integer | null
database string | boolean | number | integer | null
domain string | boolean | number | integer | null
domainExpectation string | boolean | number | integer | null
domainRejected string | boolean | number | integer | null
detailedLoadAudit boolean

Create individual entry for each ingested file instead of a global one. Default: false

active boolean

Enable or disable audit logging. Default is true

sql string | boolean | number | integer | null
DomainV1 object

A schema in JDBC database or a folder in HDFS or a dataset in BigQuery.

name string | boolean | number | integer | null
metadata object
20 nested properties
format const: "DATAFRAME" | const: "DSV" | const: "POSITION" | const: "JSON" | const: "JSON_ARRAY" | const: "JSON_FLAT" | const: "XML" | const: "TEXT_XML" | const: "KAFKA" | const: "KAFKASTREAM" | const: "GENERIC" | const: "PARQUET"

DSV by default. Supported file formats are :\n- DSV : Delimiter-separated values file. Delimiter value is specified in the "separator" field.\n- POSITION : FIXED format file where values are located at an exact position in each line.\n- JSON_FLAT : For optimisation purpose, we differentiate JSON with top level values from JSON\n with deep level fields. JSON_FLAT are JSON files with top level fields only.\n- JSON : Deep JSON file. Use only when your json documents contain sub-documents, otherwise prefer to\n use JSON_FLAT since it is much faster.\n- XML : XML files

encoding string | boolean | number | integer | null
multiline boolean

are json objects on a single line or multiple line ? Single by default. false means single. false also means faster

array boolean

Is the json stored as a single object array ? false by default. This means that by default we have on json document per line.

withHeader boolean

does the dataset has a header ? true by default

separator string | boolean | number | integer | null
quote string | boolean | number | integer | null
escape string | boolean | number | integer | null
sink object
15 nested properties
connectionRef string | boolean | number | integer | null

FS or BQ: List of attributes to use for clustering

days number

BQ: Number of days before this table is set as expired and deleted. Never by default.

requirePartitionFilter boolean

BQ: Should be require a partition filter on every request ? No by default.

materializedView const: "TABLE" | const: "VIEW" | const: "MATERIALIZED_VIEW" | const: "HYBRID"

Table types supported by the Sink option

enableRefresh boolean

BQ: Enable automatic refresh of materialized view ? false by default.

refreshIntervalMs number

BQ: Refresh interval in milliseconds. Default to BigQuery default value

id string | boolean | number | integer | null
format string | boolean | number | integer | null
extension string | boolean | number | integer | null

columns to use for sharding. table will be named table_{sharding(0)}_{sharding(1)}

partition string[]

FS or BQ: List of partition attributes

coalesce boolean

When outputting files, should we coalesce it to a single file. Useful when CSV is the output format.

path string

Optional path attribute if you want to save the file outside of the default location (datasets folder)

Map of string

directory string | boolean | number | integer | null

recognized filename extensions. json, csv, dsv, psv are recognized by default. Only files with these extensions will be moved to the stage folder.

ack string | boolean | number | integer | null

Map of string

loader string | boolean | number | integer | null
emptyIsNull boolean

Treat empty columns as null in DSV files. Default to false

dagRef string | boolean | number | integer | null
freshness object
2 nested properties
warn string | boolean | number | integer | null
error string | boolean | number | integer | null
nullValue string | boolean | number | integer | null
schedule string | boolean | number | integer | null
writeStrategy object
8 nested properties

Write strategy type including custom strategies. Allows predefined strategies or custom strategy names

types Record<string, ConvertibleToString>

Map of connection type to write strategy. Allows different strategies per target database

List of columns to use as key(s) for the target table. This is used to update existing records in the target table.

timestamp string | boolean | number | integer | null
queryFilter string | boolean | number | integer | null
on const: "TARGET" | const: "SOURCE_AND_TARGET"
startTs string | boolean | number | integer | null
endTs string | boolean | number | integer | null
comment string | boolean | number | integer | null

Set of string to attach to this domain

rename string | boolean | number | integer | null
database string | boolean | number | integer | null
AutoJobDescV1 object
name string | boolean | number | integer | null
comment string | boolean | number | integer | null
default object
27 nested properties
name string | boolean | number | integer | null
sql string | boolean | number | integer | null

attach streams to task (Snowflake only)

List of columns that make up the primary key for the output table

database string | boolean | number | integer | null
domain string | boolean | number | integer | null
table string | boolean | number | integer | null

List of columns used for partitioning the output.

List of SQL requests to executed before the main SQL request is run

List of SQL requests to executed after the main SQL request is run

sink object
15 nested properties
connectionRef string | boolean | number | integer | null

FS or BQ: List of attributes to use for clustering

days number

BQ: Number of days before this table is set as expired and deleted. Never by default.

requirePartitionFilter boolean

BQ: Should be require a partition filter on every request ? No by default.

materializedView const: "TABLE" | const: "VIEW" | const: "MATERIALIZED_VIEW" | const: "HYBRID"

Table types supported by the Sink option

enableRefresh boolean

BQ: Enable automatic refresh of materialized view ? false by default.

refreshIntervalMs number

BQ: Refresh interval in milliseconds. Default to BigQuery default value

id string | boolean | number | integer | null
format string | boolean | number | integer | null
extension string | boolean | number | integer | null

columns to use for sharding. table will be named table_{sharding(0)}_{sharding(1)}

partition string[]

FS or BQ: List of partition attributes

coalesce boolean

When outputting files, should we coalesce it to a single file. Useful when CSV is the output format.

path string

Optional path attribute if you want to save the file outside of the default location (datasets folder)

Map of string

expectations ExpectationItemV1[]

Expectations to check after Load / Transform has succeeded

Map of rolename -> List[Users].

comment string | boolean | number | integer | null
freshness object
2 nested properties
warn string | boolean | number | integer | null
error string | boolean | number | integer | null
attributes AttributeV1[]

Attributes

python string | boolean | number | integer | null

Set of string to attach to the output table

writeStrategy object
8 nested properties

Write strategy type including custom strategies. Allows predefined strategies or custom strategy names

types Record<string, ConvertibleToString>

Map of connection type to write strategy. Allows different strategies per target database

List of columns to use as key(s) for the target table. This is used to update existing records in the target table.

timestamp string | boolean | number | integer | null
queryFilter string | boolean | number | integer | null
on const: "TARGET" | const: "SOURCE_AND_TARGET"
startTs string | boolean | number | integer | null
endTs string | boolean | number | integer | null
schedule string | boolean | number | integer | null
dagRef string | boolean | number | integer | null
taskTimeoutMs integer

Number of milliseconds before a communication timeout.

parseSQL boolean

Should we parse this SQL make it update the table according to write strategy or just execute it ?

connectionRef string

Used when the default connection ref present in the application.sl.yml file is not the one to use to run the SQL request for this task.

syncStrategy const: "NONE" | const: "ADD" | const: "ALL"

Should this YAML table schema be synchronized with the source table ?

dataset_triggering_strategy string

Dataset triggering strategy to determine when this task should be executed based on dataset changes: & and | operators are allowed (dataset1 & dataset2) | dataset3

JDBCTableV1 object
name string | boolean | number | integer | null
sql string | boolean | number | integer | null
columns ConvertibleToString | object[]

List of columns to extract. All columns by default.

minItems=1
partitionColumn string | boolean | number | integer | null
numPartitions integer

Number of data partitions to create. Scope: Data extraction.

connectionOptions Record<string, string | boolean | number | integer | null>

Map of string

fetchSize integer

Number of rows to be fetched from the database when additional rows are needed. By default, most JDBC drivers use a fetch size of 10, so if you are reading 1000 objects, increasing the fetch size to 256 can significantly reduce the time required to fetch the query's results. The optimal fetch size is not always obvious. Scope: Data extraction.

fullExport boolean

If true, extract all data from the table. Scope: Data extraction.

filter string | boolean | number | integer | null
stringPartitionFunc string | boolean | number | integer | null
OutputV1 object

Output configuration for a domain

encoding string | boolean | number | integer | null
withHeader boolean

If true, writes the names of columns as the first line.

separator string | boolean | number | integer | null
quote string | boolean | number | integer | null
escape string | boolean | number | integer | null
nullValue string | boolean | number | integer | null
datePattern string | boolean | number | integer | null
timestampPattern string | boolean | number | integer | null
JDBCSchemaBase object
catalog string | boolean | number | integer | null
schema string | boolean | number | integer | null
tableRemarks string | boolean | number | integer | null
columnRemarks string | boolean | number | integer | null
tableTypes TableTypeV1[]

One or many of the predefined table types. Scope: Schema and Data extraction.

template string | boolean | number | integer | null
pattern string | boolean | number | integer | null
numericTrim const: "LEFT" | const: "RIGHT" | const: "BOTH" | const: "NONE"

How to trim the input string

partitionColumn string | boolean | number | integer | null
numPartitions integer

Number of data partitions to create. Scope: Data extraction.

connectionOptions Record<string, string | boolean | number | integer | null>

Map of string

fetchSize integer

Number of rows to be fetched from the database when additional rows are needed. By default, most JDBC drivers use a fetch size of 10, so if you are reading 1000 objects, increasing the fetch size to 256 can significantly reduce the time required to fetch the query's results. The optimal fetch size is not always obvious. Scope: Data extraction.

stringPartitionFunc string | boolean | number | integer | null
fullExport boolean

Define if we should fetch the entire table's or not. If not, maximum value of partitionColumn seen during last extraction is used in order to fetch incremental data. Scope: Data extraction.

sanitizeName boolean

Sanitize domain's name by keeping alpha numeric characters only. Scope: Schema and Data extraction.

DefaultJDBCSchemaV1 object
catalog string | boolean | number | integer | null
schema string | boolean | number | integer | null
tableRemarks string | boolean | number | integer | null
columnRemarks string | boolean | number | integer | null
tableTypes TableTypeV1[]

One or many of the predefined table types. Scope: Schema and Data extraction.

template string | boolean | number | integer | null
pattern string | boolean | number | integer | null
numericTrim const: "LEFT" | const: "RIGHT" | const: "BOTH" | const: "NONE"

How to trim the input string

partitionColumn string | boolean | number | integer | null
numPartitions integer

Number of data partitions to create. Scope: Data extraction.

connectionOptions Record<string, string | boolean | number | integer | null>

Map of string

fetchSize integer

Number of rows to be fetched from the database when additional rows are needed. By default, most JDBC drivers use a fetch size of 10, so if you are reading 1000 objects, increasing the fetch size to 256 can significantly reduce the time required to fetch the query's results. The optimal fetch size is not always obvious. Scope: Data extraction.

stringPartitionFunc string | boolean | number | integer | null
fullExport boolean

Define if we should fetch the entire table's or not. If not, maximum value of partitionColumn seen during last extraction is used in order to fetch incremental data. Scope: Data extraction.

sanitizeName boolean

Sanitize domain's name by keeping alpha numeric characters only. Scope: Schema and Data extraction.

JDBCSchemaV1 object
JDBCSchemasV1 object
OpenAPIObjectSchemasV1 object

List of regex used to include open api schemas (#/components/schemas). Defaults to ['.*']. 'Includes' is evaluated before 'excludes'

minItems=1

List of regex used to exclude open api schemas (#/components/schemas). Defaults to [].

OpenAPIRouteObjectExplosionV1 object
on string

Explode route's object to more object definition. Use object's path with route path as final name. Defaults to ALL

Any of: Keep properties of type object or array. const: "ALL", Keep properties of type object. Don't dive on array type. const: "OBJECT", Keep properties of type array. If encounters an object, dive deeper. const: "ARRAY"

filter out on field path. Each field is separated by _. Default to []

Regex applied on object path. If matches, use the given name otherwise fallback to route_path + object path as final name

OpenAPIRoutesV1 object

List of regex used to include open api path '.*'

minItems=1
as string | boolean | number | integer | null
operations const: "GET" | const: "POST"[]

List of operations to retrieve schema from. Defaults to ['GET']. Supported values are GET and POST.

minItems=1

List of regex used to excludes api path []

minItems=1
excludeFields ConvertibleToString[]

List of regex used to excludes fields. Fields and their subfields are separated by _.

minItems=1
explode object
3 nested properties
on string

Explode route's object to more object definition. Use object's path with route path as final name. Defaults to ALL

Any of: Keep properties of type object or array. const: "ALL", Keep properties of type object. Don't dive on array type. const: "OBJECT", Keep properties of type array. If encounters an object, dive deeper. const: "ARRAY"

filter out on field path. Each field is separated by _. Default to []

Regex applied on object path. If matches, use the given name otherwise fallback to route_path + object path as final name

OpenAPIDomainV1 object
name string | boolean | number | integer | null required
basePath string | boolean | number | integer | null
schemas object
2 nested properties

List of regex used to include open api schemas (#/components/schemas). Defaults to ['.*']. 'Includes' is evaluated before 'excludes'

minItems=1

List of regex used to exclude open api schemas (#/components/schemas). Defaults to [].

Describe what to fetch from data connection. Scope: Schema and Data extraction.

minItems=1
OpenAPIV1 object
basePath string | boolean | number | integer | null
formatTypeMapping Record<string, string | boolean | number | integer | null>

Map of string

Describe what to fetch from data connection. Scope: Schema and Data extraction.

minItems=1
OpenAPIsV1 object
ExtractV1Base object
sanitizeAttributeName string
Any of: const: "ON_EXTRACT" const: "ON_EXTRACT", attribute name is sanitized and stored as rename property when attribute's name differs from sanitized name const: "ON_LOAD"
connectionRef string | boolean | number | integer | null
InputRefV1 object

Input for ref object

table string | boolean | number | integer | null required
database string | boolean | number | integer | null
domain string | boolean | number | integer | null
OutputRefV1 object

Output for ref object

database string | boolean | number | integer | null required
domain string | boolean | number | integer | null required
table string | boolean | number | integer | null required
RefV1 object

Describe how to resolve a reference in a transform task

input object required

Input for ref object

3 nested properties
table string | boolean | number | integer | null required
database string | boolean | number | integer | null
domain string | boolean | number | integer | null
output object required

Output for ref object

3 nested properties
database string | boolean | number | integer | null required
domain string | boolean | number | integer | null required
table string | boolean | number | integer | null required
KafkaTopicConfigV1
topicName string | boolean | number | integer | null
maxRead integer

Maximum number of records to read from the topic in a single batch. Default is unlimited

List of fields to extract from Kafka messages

partitions integer

Number of partitions for the Kafka topic when creating it

replicationFactor integer

Replication factor for the Kafka topic when creating it

createOptions Record<string, string | boolean | number | integer | null>

Map of string

accessOptions Record<string, string | boolean | number | integer | null>

Map of string

headers Record<string, object>

HTTP headers to include when accessing Kafka via HTTP proxy

KafkaConfigV1 object
serverOptions Record<string, string | boolean | number | integer | null>

Map of string

topics Record<string, KafkaTopicConfigV1>

Map of topic name to topic configuration

cometOffsetsMode string | boolean | number | integer | null
customDeserializers Record<string, string | boolean | number | integer | null>

Map of string

DagRefV1 object
load string | boolean | number | integer | null
transform string | boolean | number | integer | null
GizmoV1 object
url string

Gizmo server URL. Default is 'http://localhost:10900'

apiKey string

API key for authenticating with the Gizmo server

HttpV1 object
interface string | boolean | number | integer | null
port integer

Port number for the HTTP server. Default is 8080

AppConfigV1 object
env string | boolean | number | integer | null
datasets string | boolean | number | integer | null
incoming string | boolean | number | integer | null
dags string | boolean | number | integer | null
types string | boolean | number | integer | null
macros string | boolean | number | integer | null
tests string | boolean | number | integer | null
prunePartitionOnMerge boolean

Pre-compute incoming partitions to prune partitions on merge statement

writeStrategies string | boolean | number | integer | null
loadStrategies string | boolean | number | integer | null
metadata string | boolean | number | integer | null
metrics object
3 nested properties
path string | boolean | number | integer | null
discreteMaxCardinality integer

Max number of unique values accepted for a discrete column. Default is 10

active boolean

Should metrics be computed ?

validateOnLoad boolean

Validate the YAML file when loading it. If set to true fails on any error

rejectWithValue boolean

Add value along with the rejection error. Not enabled by default for security reason. Default: false

audit object
10 nested properties
path string | boolean | number | integer | null
sink object
15 nested properties
connectionRef string | boolean | number | integer | null

FS or BQ: List of attributes to use for clustering

days number

BQ: Number of days before this table is set as expired and deleted. Never by default.

requirePartitionFilter boolean

BQ: Should be require a partition filter on every request ? No by default.

materializedView const: "TABLE" | const: "VIEW" | const: "MATERIALIZED_VIEW" | const: "HYBRID"

Table types supported by the Sink option

enableRefresh boolean

BQ: Enable automatic refresh of materialized view ? false by default.

refreshIntervalMs number

BQ: Refresh interval in milliseconds. Default to BigQuery default value

id string | boolean | number | integer | null
format string | boolean | number | integer | null
extension string | boolean | number | integer | null

columns to use for sharding. table will be named table_{sharding(0)}_{sharding(1)}

partition string[]

FS or BQ: List of partition attributes

coalesce boolean

When outputting files, should we coalesce it to a single file. Useful when CSV is the output format.

path string

Optional path attribute if you want to save the file outside of the default location (datasets folder)

Map of string

maxErrors string | boolean | number | integer | null
database string | boolean | number | integer | null
domain string | boolean | number | integer | null
domainExpectation string | boolean | number | integer | null
domainRejected string | boolean | number | integer | null
detailedLoadAudit boolean

Create individual entry for each ingested file instead of a global one. Default: false

active boolean

Enable or disable audit logging. Default is true

sql string | boolean | number | integer | null
archive boolean

Should ingested files be archived after ingestion ?

sinkReplayToFile boolean

Should invalid records be stored in a replay file ?

lock object
4 nested properties
path string | boolean | number | integer | null
timeout integer

reserved

pollTime integer

Default 5 seconds

refreshTime integer

Default 5 seconds

defaultWriteFormat string | boolean | number | integer | null
defaultRejectedWriteFormat string | boolean | number | integer | null
defaultAuditWriteFormat string | boolean | number | integer | null
csvOutput boolean

output files in CSV format ? Default is false

csvOutputExt string | boolean | number | integer | null
privacyOnly boolean

Only generate privacy tasks. Reserved for internal use

emptyIsNull boolean

Should empty strings be considered as null values ?

loader string | boolean | number | integer | null
rowValidatorClass string | boolean | number | integer | null
loadStrategyClass string | boolean | number | integer | null
grouped boolean

Should we load of the files to be stored in the same table in a single task or one by one ?

groupedMax integer

Maximum number of files to be stored in the same table in a single task

scd2StartTimestamp string | boolean | number | integer | null
scd2EndTimestamp string | boolean | number | integer | null
area object
7 nested properties
incoming string | boolean | number | integer | null
stage string | boolean | number | integer | null
unresolved string | boolean | number | integer | null
archive string | boolean | number | integer | null
ingesting string | boolean | number | integer | null
replay string | boolean | number | integer | null
hiveDatabase string | boolean | number | integer | null

Map of string

connections Record<string, object>

Map of jdbc engines

jdbcEngines Record<string, object>

Map of jdbc engines

privacy object
1 nested properties

Map of string

root string | boolean | number | integer | null
internal object

configure Spark internal options

5 nested properties
cacheStorageLevel string | boolean | number | integer | null
intermediateBigqueryFormat string | boolean | number | integer | null
temporaryGcsBucket string | boolean | number | integer | null
substituteVars boolean

Internal use. Do not modify.

bqAuditSaveInBatchMode boolean

Should audit logs when using BigQuery be saved in batch or interactive mode ? Interactive by default (false)

accessPolicies object
4 nested properties
apply boolean

Should access policies be enforced ?

location string | boolean | number | integer | null
database string | boolean | number | integer | null
taxonomy string | boolean | number | integer | null
sparkScheduling object
4 nested properties
maxJobs integer

Max number of Spark jobs to run in parallel, default is 1

poolName string | boolean | number | integer | null
mode string | boolean | number | integer | null
file string | boolean | number | integer | null
udfs string | boolean | number | integer | null
expectations object
3 nested properties
path string | boolean | number | integer | null
active boolean

should expectations be executed ?

failOnError boolean

should load / transform fail on expectation error ?

sqlParameterPattern string | boolean | number | integer | null
rejectAllOnError string | boolean | number | integer | null
rejectMaxRecords integer

Maximum number of records to reject when an error occurs. Default is 100

maxParCopy integer

Maximum number of parallel file copy operations during import. Default is 1

kafka object
4 nested properties
serverOptions Record<string, string | boolean | number | integer | null>

Map of string

topics Record<string, KafkaTopicConfigV1>

Map of topic name to topic configuration

cometOffsetsMode string | boolean | number | integer | null
customDeserializers Record<string, string | boolean | number | integer | null>

Map of string

dsvOptions Record<string, string | boolean | number | integer | null>

Map of string

forceViewPattern string | boolean | number | integer | null
forceDomainPattern string | boolean | number | integer | null
forceTablePattern string | boolean | number | integer | null
forceJobPattern string | boolean | number | integer | null
forceTaskPattern string | boolean | number | integer | null
useLocalFileSystem string | boolean | number | integer | null
sessionDurationServe string | boolean | number | integer | null
database string | boolean | number | integer | null
tenant string | boolean | number | integer | null
connectionRef string | boolean | number | integer | null
loadConnectionRef string | boolean | number | integer | null
transformConnectionRef string | boolean | number | integer | null
schedulePresets Record<string, string | boolean | number | integer | null>

Map of string

maxParTask integer

How many job to run simultaneously in dev mode (experimental)

refs RefV1[]

Reference mappings for resolving table references in SQL queries across different environments

dagRef object
2 nested properties
load string | boolean | number | integer | null
transform string | boolean | number | integer | null
forceHalt boolean

Force application to stop even when there is some pending thread.

jobIdEnvName string | boolean | number | integer | null
archiveTablePattern string | boolean | number | integer | null
archiveTable boolean

Enable table archiving before overwrite operations. Default is false

version string | boolean | number | integer | null
autoExportSchema boolean

Automatically export table schemas after load/transform operations. Default is false

longJobTimeoutMs integer

Timeout in milliseconds for long-running jobs. Default is 3600000 (1 hour)

shortJobTimeoutMs integer

Timeout in milliseconds for short-running jobs. Default is 300000 (5 minutes)

createSchemaIfNotExists boolean

Automatically create database schema/dataset if it does not exist. Default is true

http object
2 nested properties
interface string | boolean | number | integer | null
port integer

Port number for the HTTP server. Default is 8080

timezone string | boolean | number | integer | null
maxInteractiveRecords integer

Maximum number of records to return in interactive query mode. Default is 1000

duckdbMode boolean

is duckdb mode active

duckdbExtensions string

Comma separated list of duckdb extensions to load. Default is spatial, json, httpfs

duckdbPath string | boolean | number | integer | null
testCsvNullString string | boolean | number | integer | null
hiveInTest string | boolean | number | integer | null
spark object

Map of string

extra object

Map of string

duckDbEnableExternalAccess boolean

Allow DuckDB to load / Save data from / to external sources. Default to true

syncSqlWithYaml boolean

Update attributes in YAMl file when SQL is updated. Default to true

syncYamlWithDb boolean

Update database with YAML transform is run. Default to true

onExceptionRetries integer

Number of retries on transient exceptions

pythonLibsDir string

Directory containing python libraries to use instead of pip install

gizmosql object
2 nested properties
url string

Gizmo server URL. Default is 'http://localhost:10900'

apiKey string

API key for authenticating with the Gizmo server

StarlakeV1Base object
types TypeV1[]
dag object

Dag configuration.

4 nested properties
template string | boolean | number | integer | null required
filename string | boolean | number | integer | null required
comment string | boolean | number | integer | null

Map of string

load object

A schema in JDBC database or a folder in HDFS or a dataset in BigQuery.

6 nested properties
name string | boolean | number | integer | null
metadata object
20 nested properties
format const: "DATAFRAME" | const: "DSV" | const: "POSITION" | const: "JSON" | const: "JSON_ARRAY" | const: "JSON_FLAT" | const: "XML" | const: "TEXT_XML" | const: "KAFKA" | const: "KAFKASTREAM" | const: "GENERIC" | const: "PARQUET"

DSV by default. Supported file formats are :\n- DSV : Delimiter-separated values file. Delimiter value is specified in the "separator" field.\n- POSITION : FIXED format file where values are located at an exact position in each line.\n- JSON_FLAT : For optimisation purpose, we differentiate JSON with top level values from JSON\n with deep level fields. JSON_FLAT are JSON files with top level fields only.\n- JSON : Deep JSON file. Use only when your json documents contain sub-documents, otherwise prefer to\n use JSON_FLAT since it is much faster.\n- XML : XML files

encoding string | boolean | number | integer | null
multiline boolean

are json objects on a single line or multiple line ? Single by default. false means single. false also means faster

array boolean

Is the json stored as a single object array ? false by default. This means that by default we have on json document per line.

withHeader boolean

does the dataset has a header ? true by default

separator string | boolean | number | integer | null
quote string | boolean | number | integer | null
escape string | boolean | number | integer | null
sink object
directory string | boolean | number | integer | null

recognized filename extensions. json, csv, dsv, psv are recognized by default. Only files with these extensions will be moved to the stage folder.

ack string | boolean | number | integer | null

Map of string

loader string | boolean | number | integer | null
emptyIsNull boolean

Treat empty columns as null in DSV files. Default to false

dagRef string | boolean | number | integer | null
freshness object
nullValue string | boolean | number | integer | null
schedule string | boolean | number | integer | null
writeStrategy object
comment string | boolean | number | integer | null

Set of string to attach to this domain

rename string | boolean | number | integer | null
database string | boolean | number | integer | null
transform object
4 nested properties
name string | boolean | number | integer | null
comment string | boolean | number | integer | null
default object
27 nested properties
name string | boolean | number | integer | null
sql string | boolean | number | integer | null

attach streams to task (Snowflake only)

List of columns that make up the primary key for the output table

database string | boolean | number | integer | null
domain string | boolean | number | integer | null
table string | boolean | number | integer | null

List of columns used for partitioning the output.

List of SQL requests to executed before the main SQL request is run

List of SQL requests to executed after the main SQL request is run

sink object
expectations ExpectationItemV1[]

Expectations to check after Load / Transform has succeeded

Map of rolename -> List[Users].

comment string | boolean | number | integer | null
freshness object
attributes AttributeV1[]

Attributes

python string | boolean | number | integer | null

Set of string to attach to the output table

writeStrategy object
schedule string | boolean | number | integer | null
dagRef string | boolean | number | integer | null
taskTimeoutMs integer

Number of milliseconds before a communication timeout.

parseSQL boolean

Should we parse this SQL make it update the table according to write strategy or just execute it ?

connectionRef string

Used when the default connection ref present in the application.sl.yml file is not the one to use to run the SQL request for this task.

syncStrategy const: "NONE" | const: "ADD" | const: "ALL"

Should this YAML table schema be synchronized with the source table ?

dataset_triggering_strategy string

Dataset triggering strategy to determine when this task should be executed based on dataset changes: & and | operators are allowed (dataset1 & dataset2) | dataset3

task object
27 nested properties
name string | boolean | number | integer | null
sql string | boolean | number | integer | null

attach streams to task (Snowflake only)

List of columns that make up the primary key for the output table

database string | boolean | number | integer | null
domain string | boolean | number | integer | null
table string | boolean | number | integer | null

List of columns used for partitioning the output.

List of SQL requests to executed before the main SQL request is run

List of SQL requests to executed after the main SQL request is run

sink object
15 nested properties
connectionRef string | boolean | number | integer | null

FS or BQ: List of attributes to use for clustering

days number

BQ: Number of days before this table is set as expired and deleted. Never by default.

requirePartitionFilter boolean

BQ: Should be require a partition filter on every request ? No by default.

materializedView const: "TABLE" | const: "VIEW" | const: "MATERIALIZED_VIEW" | const: "HYBRID"

Table types supported by the Sink option

enableRefresh boolean

BQ: Enable automatic refresh of materialized view ? false by default.

refreshIntervalMs number

BQ: Refresh interval in milliseconds. Default to BigQuery default value

id string | boolean | number | integer | null
format string | boolean | number | integer | null
extension string | boolean | number | integer | null

columns to use for sharding. table will be named table_{sharding(0)}_{sharding(1)}

partition string[]

FS or BQ: List of partition attributes

coalesce boolean

When outputting files, should we coalesce it to a single file. Useful when CSV is the output format.

path string

Optional path attribute if you want to save the file outside of the default location (datasets folder)

Map of string

expectations ExpectationItemV1[]

Expectations to check after Load / Transform has succeeded

Map of rolename -> List[Users].

comment string | boolean | number | integer | null
freshness object
2 nested properties
warn string | boolean | number | integer | null
error string | boolean | number | integer | null
attributes AttributeV1[]

Attributes

python string | boolean | number | integer | null

Set of string to attach to the output table

writeStrategy object
8 nested properties

Write strategy type including custom strategies. Allows predefined strategies or custom strategy names

types Record<string, ConvertibleToString>

Map of connection type to write strategy. Allows different strategies per target database

List of columns to use as key(s) for the target table. This is used to update existing records in the target table.

timestamp string | boolean | number | integer | null
queryFilter string | boolean | number | integer | null
on const: "TARGET" | const: "SOURCE_AND_TARGET"
startTs string | boolean | number | integer | null
endTs string | boolean | number | integer | null
schedule string | boolean | number | integer | null
dagRef string | boolean | number | integer | null
taskTimeoutMs integer

Number of milliseconds before a communication timeout.

parseSQL boolean

Should we parse this SQL make it update the table according to write strategy or just execute it ?

connectionRef string

Used when the default connection ref present in the application.sl.yml file is not the one to use to run the SQL request for this task.

syncStrategy const: "NONE" | const: "ADD" | const: "ALL"

Should this YAML table schema be synchronized with the source table ?

dataset_triggering_strategy string

Dataset triggering strategy to determine when this task should be executed based on dataset changes: & and | operators are allowed (dataset1 & dataset2) | dataset3

Map of string

table object

Table Schema definition.

17 nested properties
name string | boolean | number | integer | null required
pattern string | boolean | number | integer | null required
attributes AttributeV1[] required

Attributes parsing rules.

metadata object
20 nested properties
format const: "DATAFRAME" | const: "DSV" | const: "POSITION" | const: "JSON" | const: "JSON_ARRAY" | const: "JSON_FLAT" | const: "XML" | const: "TEXT_XML" | const: "KAFKA" | const: "KAFKASTREAM" | const: "GENERIC" | const: "PARQUET"

DSV by default. Supported file formats are :\n- DSV : Delimiter-separated values file. Delimiter value is specified in the "separator" field.\n- POSITION : FIXED format file where values are located at an exact position in each line.\n- JSON_FLAT : For optimisation purpose, we differentiate JSON with top level values from JSON\n with deep level fields. JSON_FLAT are JSON files with top level fields only.\n- JSON : Deep JSON file. Use only when your json documents contain sub-documents, otherwise prefer to\n use JSON_FLAT since it is much faster.\n- XML : XML files

encoding string | boolean | number | integer | null
multiline boolean

are json objects on a single line or multiple line ? Single by default. false means single. false also means faster

array boolean

Is the json stored as a single object array ? false by default. This means that by default we have on json document per line.

withHeader boolean

does the dataset has a header ? true by default

separator string | boolean | number | integer | null
quote string | boolean | number | integer | null
escape string | boolean | number | integer | null
sink object
directory string | boolean | number | integer | null

recognized filename extensions. json, csv, dsv, psv are recognized by default. Only files with these extensions will be moved to the stage folder.

ack string | boolean | number | integer | null

Map of string

loader string | boolean | number | integer | null
emptyIsNull boolean

Treat empty columns as null in DSV files. Default to false

dagRef string | boolean | number | integer | null
freshness object
nullValue string | boolean | number | integer | null
schedule string | boolean | number | integer | null
writeStrategy object
comment string | boolean | number | integer | null

attach streams to table (Snowflake only)

Reserved for future use.

List of SQL requests to executed after the table has been loaded.

Set of string to attach to this Schema

Row level security on this schema.

expectations ExpectationItemV1[]

Expectations to check after Load / Transform has succeeded

List of columns that make up the primary key

Map of rolename -> List[Users].

rename string | boolean | number | integer | null
sample string | boolean | number | integer | null
filter string | boolean | number | integer | null
patternSample string | boolean | number | integer | null
refs RefV1[]
application object
90 nested properties
env string | boolean | number | integer | null
datasets string | boolean | number | integer | null
incoming string | boolean | number | integer | null
dags string | boolean | number | integer | null
types string | boolean | number | integer | null
macros string | boolean | number | integer | null
tests string | boolean | number | integer | null
prunePartitionOnMerge boolean

Pre-compute incoming partitions to prune partitions on merge statement

writeStrategies string | boolean | number | integer | null
loadStrategies string | boolean | number | integer | null
metadata string | boolean | number | integer | null
metrics object
3 nested properties
path string | boolean | number | integer | null
discreteMaxCardinality integer

Max number of unique values accepted for a discrete column. Default is 10

active boolean

Should metrics be computed ?

validateOnLoad boolean

Validate the YAML file when loading it. If set to true fails on any error

rejectWithValue boolean

Add value along with the rejection error. Not enabled by default for security reason. Default: false

audit object
10 nested properties
path string | boolean | number | integer | null
sink object
maxErrors string | boolean | number | integer | null
database string | boolean | number | integer | null
domain string | boolean | number | integer | null
domainExpectation string | boolean | number | integer | null
domainRejected string | boolean | number | integer | null
detailedLoadAudit boolean

Create individual entry for each ingested file instead of a global one. Default: false

active boolean

Enable or disable audit logging. Default is true

sql string | boolean | number | integer | null
archive boolean

Should ingested files be archived after ingestion ?

sinkReplayToFile boolean

Should invalid records be stored in a replay file ?

lock object
4 nested properties
path string | boolean | number | integer | null
timeout integer

reserved

pollTime integer

Default 5 seconds

refreshTime integer

Default 5 seconds

defaultWriteFormat string | boolean | number | integer | null
defaultRejectedWriteFormat string | boolean | number | integer | null
defaultAuditWriteFormat string | boolean | number | integer | null
csvOutput boolean

output files in CSV format ? Default is false

csvOutputExt string | boolean | number | integer | null
privacyOnly boolean

Only generate privacy tasks. Reserved for internal use

emptyIsNull boolean

Should empty strings be considered as null values ?

loader string | boolean | number | integer | null
rowValidatorClass string | boolean | number | integer | null
loadStrategyClass string | boolean | number | integer | null
grouped boolean

Should we load of the files to be stored in the same table in a single task or one by one ?

groupedMax integer

Maximum number of files to be stored in the same table in a single task

scd2StartTimestamp string | boolean | number | integer | null
scd2EndTimestamp string | boolean | number | integer | null
area object
7 nested properties
incoming string | boolean | number | integer | null
stage string | boolean | number | integer | null
unresolved string | boolean | number | integer | null
archive string | boolean | number | integer | null
ingesting string | boolean | number | integer | null
replay string | boolean | number | integer | null
hiveDatabase string | boolean | number | integer | null

Map of string

connections Record<string, object>

Map of jdbc engines

jdbcEngines Record<string, object>

Map of jdbc engines

privacy object
1 nested properties

Map of string

root string | boolean | number | integer | null
internal object

configure Spark internal options

5 nested properties
cacheStorageLevel string | boolean | number | integer | null
intermediateBigqueryFormat string | boolean | number | integer | null
temporaryGcsBucket string | boolean | number | integer | null
substituteVars boolean

Internal use. Do not modify.

bqAuditSaveInBatchMode boolean

Should audit logs when using BigQuery be saved in batch or interactive mode ? Interactive by default (false)

accessPolicies object
4 nested properties
apply boolean

Should access policies be enforced ?

location string | boolean | number | integer | null
database string | boolean | number | integer | null
taxonomy string | boolean | number | integer | null
sparkScheduling object
4 nested properties
maxJobs integer

Max number of Spark jobs to run in parallel, default is 1

poolName string | boolean | number | integer | null
mode string | boolean | number | integer | null
file string | boolean | number | integer | null
udfs string | boolean | number | integer | null
expectations object
3 nested properties
path string | boolean | number | integer | null
active boolean

should expectations be executed ?

failOnError boolean

should load / transform fail on expectation error ?

sqlParameterPattern string | boolean | number | integer | null
rejectAllOnError string | boolean | number | integer | null
rejectMaxRecords integer

Maximum number of records to reject when an error occurs. Default is 100

maxParCopy integer

Maximum number of parallel file copy operations during import. Default is 1

kafka object
4 nested properties
serverOptions Record<string, string | boolean | number | integer | null>

Map of string

topics Record<string, KafkaTopicConfigV1>

Map of topic name to topic configuration

cometOffsetsMode string | boolean | number | integer | null
customDeserializers Record<string, string | boolean | number | integer | null>

Map of string

dsvOptions Record<string, string | boolean | number | integer | null>

Map of string

forceViewPattern string | boolean | number | integer | null
forceDomainPattern string | boolean | number | integer | null
forceTablePattern string | boolean | number | integer | null
forceJobPattern string | boolean | number | integer | null
forceTaskPattern string | boolean | number | integer | null
useLocalFileSystem string | boolean | number | integer | null
sessionDurationServe string | boolean | number | integer | null
database string | boolean | number | integer | null
tenant string | boolean | number | integer | null
connectionRef string | boolean | number | integer | null
loadConnectionRef string | boolean | number | integer | null
transformConnectionRef string | boolean | number | integer | null
schedulePresets Record<string, string | boolean | number | integer | null>

Map of string

maxParTask integer

How many job to run simultaneously in dev mode (experimental)

refs RefV1[]

Reference mappings for resolving table references in SQL queries across different environments

dagRef object
2 nested properties
load string | boolean | number | integer | null
transform string | boolean | number | integer | null
forceHalt boolean

Force application to stop even when there is some pending thread.

jobIdEnvName string | boolean | number | integer | null
archiveTablePattern string | boolean | number | integer | null
archiveTable boolean

Enable table archiving before overwrite operations. Default is false

version string | boolean | number | integer | null
autoExportSchema boolean

Automatically export table schemas after load/transform operations. Default is false

longJobTimeoutMs integer

Timeout in milliseconds for long-running jobs. Default is 3600000 (1 hour)

shortJobTimeoutMs integer

Timeout in milliseconds for short-running jobs. Default is 300000 (5 minutes)

createSchemaIfNotExists boolean

Automatically create database schema/dataset if it does not exist. Default is true

http object
2 nested properties
interface string | boolean | number | integer | null
port integer

Port number for the HTTP server. Default is 8080

timezone string | boolean | number | integer | null
maxInteractiveRecords integer

Maximum number of records to return in interactive query mode. Default is 1000

duckdbMode boolean

is duckdb mode active

duckdbExtensions string

Comma separated list of duckdb extensions to load. Default is spatial, json, httpfs

duckdbPath string | boolean | number | integer | null
testCsvNullString string | boolean | number | integer | null
hiveInTest string | boolean | number | integer | null
spark object

Map of string

extra object

Map of string

duckDbEnableExternalAccess boolean

Allow DuckDB to load / Save data from / to external sources. Default to true

syncSqlWithYaml boolean

Update attributes in YAMl file when SQL is updated. Default to true

syncYamlWithDb boolean

Update database with YAML transform is run. Default to true

onExceptionRetries integer

Number of retries on transient exceptions

pythonLibsDir string

Directory containing python libraries to use instead of pip install

gizmosql object
2 nested properties
url string

Gizmo server URL. Default is 'http://localhost:10900'

apiKey string

API key for authenticating with the Gizmo server