Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abandon use of SQL-rows representation, expand datasets to rows by demand #13

Closed
ppKrauss opened this issue Nov 19, 2017 · 1 comment
Closed

Comments

@ppKrauss
Copy link
Contributor

ppKrauss commented Nov 19, 2017

(this issue was for v0.1.0, that change from v0.2+, so, use benchmarking of v0.1 as reference)

The benchmark, see Wiki, shows that

  • the dataset expansion by the jsonb_array_elements() function is so faster tham per-row JSONs.

  • Only need faster when there are a lot of rows (more tham ~9000)... When need really fast SELECT, use best (~2 to 5 times, not so much) is a table or MATERIALIZED VIEW; and faster is a indexed one.

By other hand, there are some demand to add into dataset.big "any other free JSON" dataset.

Conclusion: the best is "1 dataset per row" at dataset.bit, a free-JSONb dataset with "some JSON-schema" controled by dataset.meta. We denominated it as "JSON TYPE DEFINITION", JTD, to be a set of labels indication standard structures. Only some JTDs are tabular: other need a rule to join/split into rows... The best is to stay all in "1 dataset per row".


Reference Benchmark of v0.1.0

15 datasets, load as "many rows per dataset". In terms of disk-usage, all the dataset-schema sums table_size 6080 kB. If all datasets of dataset.big was translated to usual SQL-tables, the custo will be less (!), ~75% or less.

id urn pkey lang n_cols n_rows
15 datasets-br:br_city_codes state/lexLabel pt 9 5570
14 datasets-br:br_city_synonyms state/lexLabel/synonym pt 5 26
... ... (see more at Wiki) ... ... ... ... ...
nspname n_tables total_bytes table_bytes table_size
dataset 4 8552448 6225920 6080 kB

Comparing JSON and JSONb

A test with less rows (before "datasets:world_cities"), but good results, for dataset.big total usage:

  • JSONb with 16138 rows, table disk-usage 3496 kB.
  • JSON with 16138 rows, same datasets, table disk-usage 3432 kB.

So no expressive advantage to use a JSON, with only ~2% of gain. Better stay with JSONb, a "first-class citizen".

Disk-usage reduction when all rows in one JSONb-array

Checking table_usage with "datasets-br:br_city_codes":

  • All rows in one JSONb array: 8192 bytes ~ 8kB
  • One JSONb row per SQL row: 736 kB

Conclusion: "all rows in one" is ~10 times better!

Checking performance

No loss of performance by PostgreSQL's EXPLAIN ANALYZE: as showed at Wiki, the function jsonb_array_elements() is very fast.

@ppKrauss
Copy link
Contributor Author

ppKrauss commented Nov 19, 2017

Now, with v0.2+ it is changed. Becnhmarking same 15 datasets, but now as "tab-aoa" JTD,

id urn pkey jtd n_cols n_rows
1 (2)lexml:autoridade id tab-aoa 9 601
4 (2)lexml:evento id tab-aoa 9 14
... ... (see more at Wiki) ... ... ... ... ...

Disk-usage:

nspname n_tables total_bytes table_bytes table_size
dataset 4 1474560 73728 72 kB

Total disk-usage reduction, from 6080 kB to 72 kB, 8440%! (6080=72*84.4). No disk cost for datasets!

@ppKrauss ppKrauss changed the title Abandon use of rows, all datasets will be expanded by demand, and any other JSON can be added Abandon use of SQL-rows representation, expand datasets to rows by demand Nov 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant