Abandon use of SQL-rows representation, expand datasets to rows by demand #13

ppKrauss · 2017-11-19T14:01:32Z

(this issue was for v0.1.0, that change from v0.2+, so, use benchmarking of v0.1 as reference)

The benchmark, see Wiki, shows that

the dataset expansion by the jsonb_array_elements() function is so faster tham per-row JSONs.
Only need faster when there are a lot of rows (more tham ~9000)... When need really fast SELECT, use best (~2 to 5 times, not so much) is a table or MATERIALIZED VIEW; and faster is a indexed one.

By other hand, there are some demand to add into dataset.big "any other free JSON" dataset.

Conclusion: the best is "1 dataset per row" at dataset.bit, a free-JSONb dataset with "some JSON-schema" controled by dataset.meta. We denominated it as "JSON TYPE DEFINITION", JTD, to be a set of labels indication standard structures. Only some JTDs are tabular: other need a rule to join/split into rows... The best is to stay all in "1 dataset per row".

Reference Benchmark of v0.1.0

15 datasets, load as "many rows per dataset". In terms of disk-usage, all the dataset-schema sums table_size 6080 kB. If all datasets of dataset.big was translated to usual SQL-tables, the custo will be less (!), ~75% or less.

id	urn	pkey	lang	n_cols	n_rows
15	datasets-br:br_city_codes	state/lexLabel	pt	9	5570
14	datasets-br:br_city_synonyms	state/lexLabel/synonym	pt	5	26
...	... (see more at Wiki) ...	...	...	...	...

nspname	n_tables	total_bytes	table_bytes	table_size
dataset	4	8552448	6225920	6080 kB

Comparing JSON and JSONb

A test with less rows (before "datasets:world_cities"), but good results, for dataset.big total usage:

JSONb with 16138 rows, table disk-usage 3496 kB.
JSON with 16138 rows, same datasets, table disk-usage 3432 kB.

So no expressive advantage to use a JSON, with only ~2% of gain. Better stay with JSONb, a "first-class citizen".

Disk-usage reduction when all rows in one JSONb-array

Checking table_usage with "datasets-br:br_city_codes":

All rows in one JSONb array: 8192 bytes ~ 8kB
One JSONb row per SQL row: 736 kB

Conclusion: "all rows in one" is ~10 times better!

Checking performance

No loss of performance by PostgreSQL's EXPLAIN ANALYZE: as showed at Wiki, the function jsonb_array_elements() is very fast.

The text was updated successfully, but these errors were encountered:

ppKrauss · 2017-11-19T21:52:40Z

Now, with v0.2+ it is changed. Becnhmarking same 15 datasets, but now as "tab-aoa" JTD,

id	urn	pkey	jtd	n_cols	n_rows
1	(2)lexml:autoridade	id	tab-aoa	9	601
4	(2)lexml:evento	id	tab-aoa	9	14
...	... (see more at Wiki) ...	...	...	...	...

Disk-usage:

nspname	n_tables	total_bytes	table_bytes	table_size
dataset	4	1474560	73728	72 kB

Total disk-usage reduction, from 6080 kB to 72 kB, 8440%! (6080=72*84.4). No disk cost for datasets!

ppKrauss changed the title ~~Abandon use of rows, all datasets will be expanded by demand, and any other JSON can be added~~ Abandon use of SQL-rows representation, expand datasets to rows by demand Nov 20, 2017

ppKrauss added enhancement v0.2.X labels Nov 21, 2017

ppKrauss closed this as completed in 2dd2be7 Nov 21, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abandon use of SQL-rows representation, expand datasets to rows by demand #13

Abandon use of SQL-rows representation, expand datasets to rows by demand #13

ppKrauss commented Nov 19, 2017 •

edited

Loading

ppKrauss commented Nov 19, 2017 •

edited

Loading

Abandon use of SQL-rows representation, expand datasets to rows by demand #13

Abandon use of SQL-rows representation, expand datasets to rows by demand #13

Comments

ppKrauss commented Nov 19, 2017 • edited Loading

Reference Benchmark of v0.1.0

Comparing JSON and JSONb

Disk-usage reduction when all rows in one JSONb-array

Checking performance

ppKrauss commented Nov 19, 2017 • edited Loading

ppKrauss commented Nov 19, 2017 •

edited

Loading

ppKrauss commented Nov 19, 2017 •

edited

Loading