Hacker Newsnew | past | comments | ask | show | jobs | submit | vouwfietsman's commentslogin

except you need flatbuffers to access that blob

Not sure why this got so many upvotes, also the landing page is not great, its better to look at the paper (see link below).

Seems to be a columnar storage format that addresses some shortcomings in parquet. Thing is, though, that of all these formats the real winning feature is compatibility, which is (obviously) very hard to improve on, as anything new immediately loses.

Parquet is unfortunately very good just by virtue of being first, and so widely supported. The most widely used parquet version is the oldest version from 2013 (as per the paper itself), so parquet itself couldn't even supplant parquet. If you want to improve on it, you need to bring some serious results, which I don't think f3 does.

Also, my main gripe with parquet (single table per file) is not even addressed, so, also the name is a bit hyped up.

Also also, it seems to go out of its own way to include a compiled wasm binary for decoding, yet requires flatbuffers to parse that blob? Kind of defeats the purpose.

Its main result seems to be improved random access which, although certainly welcome, is not the point of columnar storage, as columnar storage was invented to exchange random access for something else: fast analytics. F3 seems to sacrifice fast analytics for the wasm decoder. I don't get it.

Maybe I'm being too cynical. Can someone help me out here?

https://dl.acm.org/doi/epdf/10.1145/3749163


> Also, my main gripe with parquet (single table per file) is not even addressed, so, also the name is a bit hyped up.

This is really more of an expectation that has been put on file formats by the query engines. Spark/Datafusion/DuckDB wouldn't really know what to do with a multi-table file.

> Parquet is unfortunately very good just by virtue of being first, and so widely supported

IMO that is not how technology works. It is great that Parquet is so good at a lot of things, but that does not mean just because it came first that it deserves to be the only analytic file format forever.

> Its main result seems to be improved random access which, although certainly welcome, is not the point of columnar storage, as columnar storage was invented to exchange random access for something else: fast analytics

Fast analytics, as well as newer ML-shaped workloads, are inherently mix of batch scans and random access.

Some of the authors of F3 previously authored another paper that goes into the details of the shortcomings of Parquet

https://www.vldb.org/pvldb/vol17/p148-zeng.pdf

All of the newer formats that popped up recently (Vortex, Lance, F3 now) have been working on solving the problems outlined in that paper.

Lance has some interesting ideas, Vortex focuses on extensibility and performance by replacing all of Parquet's black-box encoders with fully transparent encodings. This solves the tradeoff between bulk and element decoding, allowing you to have efficient full scans and really fast random access.

E.g. Langchain recently rebuilt a system that used to be all Parquet files to use Vortex and saw a massive speedup, which they talk about more here: https://www.langchain.com/blog/introducing-smithdb

Disclaimer: I work on Vortex, so a lot of these questions about "what is the point of building a new format" are things that I have grappled with myself.


> DuckDB wouldn't really know what to do with a

Sure it would, you can attach a multi-table sqlite database in duckdb

> that does not mean just because it came first

I agree with most of your points, I am not stating my opinion but my observations. I am the target audience here, I want to use this, but I don't really care too much about the file format itself, at least not as much as I care about the data inside.

That means access, which means compatibility with my tooling.

Compatibility is hard to beat.

This is the concorde of file formats.


That is fair.

FWIW I think if you are just doing pure analytics and nothing else, Parquet will probably continue to do the job for you just fine, and you don't need to touch your workloads at all.

These new formats I think will find a niche where people aren't just running Spark jobs, but doing lots of systems building over large tables. If you're building a PB-scale data warehouse, you care a lot about the file format b/c it is a big factor in your performance curve, and you're willing to ship new experimental codecs in response to new datatypes you want to support that the system wasn't originally designed for, or you want to use a newly invented compressor.


Yeah that point about "random access is not the point of columnar formats" fell flat for me for this same reason. Almost since the first day I started using columnar data, I've been interested in solutions that strike this balance between batch and random access. This comes up all the time (in my experience) in data science / ML, where we have use cases for both access patterns against the same data.

So I'm with you, I'm very unconvinced that parquet (and the various things that are parquet or essentially-parquet under the hood) are the end of the line here.


> > Parquet is unfortunately very good just by virtue of being first, and so widely supported

> IMO that is not how technology works. It is great that Parquet is so good at a lot of things, but that does not mean just because it came first that it deserves to be the only analytic file format forever.

I think you and vouwfietsman (https://news.ycombinator.com/item?id=48649412) are actually saying the same thing in different words—I think their "unfortunately" means "it is unfortunate that, by virtue of coming first, this now has a support lead that will make it difficult for anyone else to catch up."


> my main gripe with parquet (single table per file) is not even addressed

I consider that simplicity to be a feature, not a shortcoming.

I just tar a bunch of parquets if I need multiple tables. It is beautifully simple and easy to read in any language with its tar and parquet libraries.


> Also, my main gripe with parquet (single table per file) is not even addressed, so, also the name is a bit hyped up.

When I was working with parquet, I imagined a .parquetz file format which was just a zip file containing any number of uncompressed parquet files. So you could sling multiple tables around in a single file, and still use range requests to access them.


>Not sure why this got so many upvotes, also the landing page is not great

Frankly it's a change from the usual ChatGPT generated slop that most landing pages are these days.


I don't think you have encountered the problems that this class of formats solves. Try looking up columnar storage formats, the pros and cons are pretty well defined these days. It is not meant for video decoding, indeed.

Knowing little about cpp modules and nothing about Gabriel Dos Reis, I expect a more design-by-committee type explanation for the result: the module system was probably a victim of having to be backwards compatible, abi stable, idiomatic, zero cost abstraction, be compatible with all weird cpp features, not hurt compile time, etc etc etc

I don't think its fair to attribute it to lack of skills or bad intent, unless there's some proof to any of it.


Its not. Any interview question where you are looking for a specific answer is already suspect, but especially if you don't properly provide context for the question in what you would expect, things become a shit show.

If you would ask someone to write a piece of code, and a part of the problem is this conversion, then you would be right to expect they reach for a library, but even if they don't you would be giving them the opportunity to explain themselves, and judge the explanation, not the answer. Also, if your test is "does this person reach for a library at the right time", you could do a lot less esoteric and confusing by just asking them to add 10 days to a date. If you just ask this one specific problem, it is likely they assume you are looking for them to demonstrate the skills involved in actually solving the problem, i.e. leetcode.

This is also why some people give you the blabla answer, because it is indeed very unlikely that someone needs to do this legitimately. This is because its a toy problem. Someone's professional reaction to the problem in isolation should indeed be: this is weird, I've never been asked something like this, what's up?

Finally, even though the question is terrible, I would still rate the "whatsup?" response higher than the "leapyear" response. I would want a developer to triple check that this problem needs solving, before they would solve it themselves.

Finally finally, if there's one answer to one question that, when answered trivially in a way literally taught in most basic programming courses (use the standard library / a third party library), makes them a "guaranteed hire", I also have significant doubts about the level of talent you are bringing in, as any experienced interviewer will tell you that qualified people will get important questions wrong, and unqualified people will get important questions right.

I understand that this reaction might be quite harsh, and I know better than anyone that its hard and time consuming to do good interviews, but please consider that you are rejecting people who may be very confused and sad by this way of rejection.


"an", not "the" alternative.

Consciousness can be not-emergent but also not metaphysical, think sci-fi-type undiscovered physics or matter.


I'm not sure why anybody would at the same time be implementing SoA AND resizing 20 arrays for a single delete, those things seem to be on either ends of the "I care about performance" spectrum.


The point is that a simple SoA implementation requires this - each field in the monster struct is an item in 20 different arrays. So, removing one monster means removing that item from those 20 arrays.

Now, as others have suggested, you can have a more complex implementation, where instead of removing the monster's fields from those arrays, you just mark them as "dead" or whatever and then skip them when consuming the relevant arrays, with some relatively small extra bookkeeping overhead. Of course, this comes with its own drawbacks, especially if the number of monsters is very dynamic and you are memory constrained.

The point is not to say that SoA is never good for performance, it obviously and certainly is, probably even in most cases. It's just not always best for performance, this was all.


> So, removing one monster means removing that item from those 20 arrays.

Removing from an array is not the same as resizing, which is what I commented on. Resizing is a very deliberate, bad, choice.

If you need to support deletes, you can do this without resizing an array. Either by tracking object lifetimes and inserting tombstones, or by swapping to fill in deleted objects. Both of them retain good performance characteristics. Both of them are easy.

This is not "simple vs complex" this is "I misunderstand vs understand SoA"


Although this is possibly true, at any time the dev team could've gone: "loading is slow man, can we just profile it and see if there's anything obvious?". To someone with access to the source code and a debugger, that's probably less than 30 minutes of time to go from zero to hero.

I've done this kind of stuff many times, and something like a json array taking minutes to parse would likely be very very obvious when looking at a trace.


The dev team is usually under immense pressure to deliver. This probably slipped under the radar to let all the other features get through.

The maintenance team probably maintains other games as well.

Then you add in the knowledge loss through the churning of developers and over time the organization forgets how things should be.


Again, you're not wrong, but none of those things are insurmountable. If the team is really so stressed it cannot spend 30 minutes, over the many year of the existence of that bug, that seems like a development environment close to hell.

Knowledge loss is precisely my point: there is very little a-priori knowledge needed to solve this, the guy who found the bug proves that.


Yea development environments are close to hell in game development, won't argue that.


Hint: maybe they would if they would unionize themselves.


It would be prudent if you would dig a bit deeper into how unions came to be. Long story short, capitalism can easily create situations where you agree to be exploited to some degree, to avoid being exploited to another, worse, degree.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: