Usually you would benchmark on difficult tasks, not the easiest one.
For computers, batch IO operations are much faster than random IO and can easily saturate the network.
This benchmark uses large batch size, 64MB, to test. There is nothing new here. Most common file systems can easily do the same.
The difficult task is to read and write lots of small files. There is a term for it, LOSF. I work on SeaweedFS, https://github.com/chrislusf/seaweedfs , which is designed to handle LOSF. And of course, no problem with large files at all.
pradeepchhetri
Thank you for SeaweedFS. Curious if you have benchmarked SeaweedFS against LeoFS (https://github.com/leo-project) It seems like both have similar motivation.
chrislusf
Many systems look similar. Ceph, MooseFS, etc. The difference are in details. Not possible to benchmark everything. And each use case has different access pattern, CPU, memory, network. It's better to benchmark with your own use cases.
LeoFS last release is about 3 years ago (v1.4.3 February 20th, 2019), and seems more complicated. SeaweedFS is still growing and has being released on a weekly basis. Just ask if you need any new feature.
Also, think about how to manage. For example, grow capacity. For SeaweedFS, you just need to add one server and point it to the master. That is it!
pradeepchhetri
Great. Thank you for the response. Curious if you recommend any particular filesystem for running SeaweedFS.
chrislusf
No preference actually.
SeaweedFS is portable, the data file and the metadata.
pradeepchhetri
Thank you for answering my questions.
zurn
There are benchmarks for different purpouses, but for users if you're going to run just one benchmark, the best one is the workload you have, or anticipate.
twoodfin
I don’t live in this space, so maybe I have unreasonable expectations (or I’m reading the benchmark wrong), but is 10GB/sec of 64MB objects streaming out of each 96(48+HT?)-core hefty AWS server particularly impressive?
wmf
10GB/s is about the most you can expect from a 100G NIC (I know line rate is 12.5) and AWS makes you buy the whole machine if you want 100G so maybe they don't need all those cores.
tyingq
You could compare to their December 2019 blog post:
It reads like they roughly doubled the read (GET) throughput for 32 nodes. Though I don't know how much of that would be AWS improvements vs MinIO improvements.
y4m4b4
AWS drive speed has not changed much, neither is their network for the same instance and model.
bharrisonit
I lost my mind for about 300ms trying to figure out how fast 1Tpbs was.
3np
Transaction / bit-second. The unit is normalized for bit-seconds to adjust for the fact that even with the same aggregate size, a single large transaction is different to a large number of small operations.
It was only when GCS reached 1.0 Tpbs that we started shifting workloads over.
However, these benchmarks are only relevant within the same region. As you move further away, roundtrips and speed-of-light will become significant enough that I've seen some people do benchmarks in Tpbms.
somebodythere
Terapit ber second.
MisterTea
If it involves Tp there's no better expert to ask than The Great Cornholio.
unlocksmith
lol, result of typing fast i guess
iJohnDoe
We use MinIO in production as a Git LFS storage backend for GitLab. Knock on wood, works well.
oskenso
We do the same, but with gitea as a backend, works well!
didip
Anyone has production war stories between minio and ceph?
kakarotto
my experience, i dont know if this is comparable, but from my memory (i have not made any notes on that), i've tried min.io in december and switched to seaweed a weeks ago, because my usecase was transition from local file storage to DFS + also enable our developers to transition from local filesystem to s3. Since my resources are limited (vsphere VM) with 3 hosts + different disks, i tried to set up a 3 vm cluster with minio first, after i did some research on different systems (ceph, longhorn.io, ..) i wanted to have an easy setup-able system, which supports s3. I relied a lot on what people measured and chose min.io first because it supported mount via s3. Then i tried to copy over about 34 million files (mostly few bytes, but can also be 1Gbyte), with a mass of about 4.2TB. I tried different methods, rsync, cp, cp with parallelism,.. and i took me about 3 days to copy over 300GB of data at best. Then i also found out that it was impossible to list files. We have one single folder with over 300k projects (guid) beneath (growing). After that i gave seaweed a shot. Why i did not used it firsthand was documentation was a bit confusing and it did not gave me all the answers i needed as fast as minio did.
Now, my seaweed setup is a 3 vm cluster with 3 disks per vm (1TB) each. I configured a wireguard mesh (https://github.com/k4yt3x/wg-meshconf) between the VMs and configured master and volumes server to talk to each other via wireguard IPs securely. I also configured ufw to only allow communication between http/gRPC ports. I also configured a filer (using leveldb3) to use wireguard IPs (master and volumes) and let it communicate with some specific servers on the outside (ufw).
After that i mounted the filer via weed.mount on that specific server and tried to copy over the same files/folders. after 2 days i copied over about 1.5 TB of the data via rsync. There was also no problem with file listing and accessing the filer from different machines while uploading stuff. But there is a overhead when reading and creating lots of small files. File listing is even faster than local btrfs file listing.
chris is also very nice and fast fixing bugs.
chrislusf
Thanks for the details! 2 days with 1.5TB vs 3 days with 300GB, that is good to know! :)
btw: If you use "weed filer.copy", it should be much faster. Rsync needs to go through FUSE mount.
londons_explore
Is a 64MB object size really typical of the usecase most people will be using this for?
I would imagine 1kb objects to be far more common...
manquer
At 1kb or less I would likely store as blob in traditional RDMS or noSQL store. It is not worth the extra network call at 1 KB size to use object storage.
mastazi
I suppose it depends on the type of data? If those are e.g. video files, I don't suppose there are many 1kB files...
chrislusf
Aha, I replied above and you are saying the same thing!
Small objects, 4K ~ 4MB, are very common in machine learning area(audio, video, text), or surveillance, game assets, etc. SeaweedFS saw a lot of usage in these areas.
dikei
No, object store works better with a small number of large object than with a large number of small object.
1kb object would be terribly inefficient if you have too many of them. It's slow even with regular file system, copy 1 million 1kb files would take forever. For small objects, it's best to pack them into larger block before storage, like how a database does with its data.
chrislusf
Actually SeaweedFS does just that, append small objects into larger blocks during write or update, and run garbage collection during compaction.
dikei
SeaweedFS sounds very interesting indeed, though I'm a bit concern about the complexity. For example, there are so many Filer metadata backend options that it's difficult for me to pick one.
I think the document should have a clearer guideline about best practices when deploying a SeaweedFS cluster for different situations: small cluster, large cluster, H/A cluster etc..
chrislusf
This is a fair complaint. :)
For filer metadata, you should just pick the one you are most familiar with.
That's common for partioned data. If you have a couple hundred GB table or log you can split it into a bunch of smaller blocks for parallel processing. I think 64MB tends to be a good medium between overhead of tons of tiny files and inability to parallelize/shard with fewer large ones
moreati
Possibly a difference in terminology. Minio is in the same space as S3, Ceph, and HDFS - sometimes called blob store or block store. Each object might be part of a log, video or other large file/stream.
I would call an object store in which the objects are mostly < 1 kiB a key value store.
FridgeSeal
What are you storing in an object store that fits into 1kb?
64MB is far, far closer to our usecase than 1kb. I don’t even think our schema files/metadata which are arguably the smallest part fit into 1kb.
cpursley
That's pretty impressive. Anyone using minio in production? What's the backup story?
rjzzleep
I've done Kubernetes consulting with some other people and for their on prem solutions I always recommend just buying a TrueNAS with it's s3 api.
The other people I tend to end up with always try to sell people on minio+rook-ceph and then offer support along the way. So basically you buy their kubernetes deployment and then you have to pay them for the rest of your life for troubleshooting. The TrueNAS seems cheaper to me.
I don't see a good backup story, but maybe I just don't know it.
moondev
TrueNAS just packages minio, so it would seem to be the same result but with less flexibility and visibility to the cluster. Although if the strategy is to not hyperconverge storage that could be desired I guess
rjzzleep
You are indeed correct, but minio is actually a lot of components. And in the case of TrueNAS they handle most of the storage related components with a minio object gateway running on top.
A pure kubernetes deployment is more complex(although it's all part of the same binary i think).
I could be completely wrong though.
moondev
In truenas scale (based on linux not freebsd) they actually run most integrated and 3rd party "apps" (including minio) inside a k3s cluster running on truenas itself
rjzzleep
Is TrueNAS Scale stable yet?
moondev
It seems to be under pretty active development but I do like their strategy of releasing early and often. I haven't run into any issues yet with it in my lab.
My biggest wishes for Truenas Scale:
1. NVMe-oF support (over RDMA or TCP)
2. aarch64 uefi iso
I could see Truenas Scale overtaking Proxmox soon in the KVM space, their API and UI are already more enjoyable to use IMO.
willis936
They're in RC now. In terms of data loss: I'm not concerned with what I've seen. Some features are still missing, such as SED control. I think most of things I wish were there aren't present on core either, such as VM VNC clipboard and solutions to the web UI logging out often when 2FA is enabled and some mysterious directory locking issue when pulling dropbox.
nwmcsween
So ZFS on truenas/freebsd is in some weird twilight zone of great stability and horribly broken, we ended up having VM corruption after removing an l2arc cache drive on truenas.
I do agree with the Ceph being much more complex which is why I hope OpenEBS implements zfs-localpv send recv for a poor man's replication.
carterschonwald
Could you elaborate more? I’ve been evaluating using zfs for my new home work station and would like to understand what failure modes I need to be aware of
nwmcsween
ZFS is still the best FS out there but a good rule of thumb is to look over the open and closed issues and see how things are handled and the problems for a given project, a sort of issue-driven-assessment.
carterschonwald
Ah, like the hibernate/suspend and swap file issue for mobile computers?
ochoseis
Truenas with S3 piqued my interest. Does that use minio under the hood to serve objects? This doc indicates it might at least use the browser component:
https://www.truenas.com/docs/core/services/s3/
Havoc
Pretty sure it’s minio for everything.
Mainly use it for gitlab object storage current. Logs docker images etc
ignoramous
> I don't see a good backup story, but maybe I just don't know it.
Where do data-management SaaS companies like Rubrik.com and Druva.com fit in? Are they not popular enough solution to secure min.io deployments?
hardwaresofton
Have you ever seen people using rook-ceph and the built in object gateway Ceph provides? AFAIK Ceph's object store powers some official cloud solutions out there like Digital Ocean spaces
rjzzleep
Yes, one of my clients uses it. It seems okay, but it's all pretty low volume and most people have no real concept of disaster recovery. I think backups and DR are things that most people take for granted and don't really think about.
hardwaresofton
Yep reasonable -- most people these days don't have to, since the clouds do the hard work for most people. The knowledge on how to run those kinds of systems becomes harder to find (in the community) by the year.
Also by the time most people account for 2/3 copies of the data + 1 completely offsite backup of data I think their eyes might start to water at the cost if you want reasonable performance as well. I experiment with this kind of stuff a lot of the time and always am surprised at how much more than you'd expect it costs to get drive, node, and region level redundancy with backups. You need essentially at a minimum 3TB for every 1TB of usable storage assuming regular RAID1.
willis936
That's the cost of privacy until there a byopk cloud service.
bogomipz
I'm curious is the use case for Kubernetes and MinIO mainly just backups or are there are good use-cases for Kubernetes and object storage you are seeing?
ckdarby
Hedge bets against an s3 outage. Cloud agnostic solution with abstraction of object storage. Reducing the cost of s3 by using minio as a cache.
Plenty of good use-cases.
fnord123
Devs can also spin up a local object store to test with instead of hitting some common remote bucket.
outworlder
Of course it's used in production. We use it heavily.
It doesn't have to use local disks. A big part of what it does is to provide a S3-compatible API, which you can use with any number of backends. Even other cloud providers. Say you want to be able to deploy on prem, on AWS, and GCP. You can add MinIO in front of all of these and you app won't care.
Of course, if you are writing to an actual disk you'll have to figure out the backup part.
71a54xd
I've only heard horror stories from people using this on real apps. The last one was a bug where MinIO wouldn't actually assert a file had been updated or something. Maybe they've improved since then, but I wouldn't call this "production ready" for the time being.
I was also surprised by many of the comments. I had only played with minio a bit, and was considering using it. The durability comments were concerning, and there were other issues too.
Basically, it was treating object names like foo//bar in them the same as foo/bar, except for sharding, which thought they were different.
Their fix was just to disallow '//' in an object name, even though other s3-like implementations allow it.
mastazi
Given that MinIO seems to be canonical way to add S3 API compatibility to Azure Blob Storage[1], this is not very encouraging.
We have a lot of products that use S3 API and Azure was the only cloud that did not offer that out of the box - at first I could not believe this was the case, because even smaller providers like Linode or Digitalocean offer S3 API compatibility. But then I found the post linked below.
It's good that there are more alternatives, this seems like it might be a lighter weight alternative compared to MinIO.
Still, I wish Azure just added a built-in compatibility layer, like virtually every other cloud provider - I'm not a big fan of having to spin up a container just for this reason.
upbeat_general
What is the primary use case for storage this fast — distributed training? Even for that is storage throughput a common bottleneck above say 10GiB/s?
nijave
It's also used in big data processing stacks. I think it's currently common to dump all your data in object storage in a structured format and use that as the data store with a query engine on top.
Even for something like just querying logs, you can churn through a few TB pretty quick by searching in parallel
manquer
if you are running S3 compatible service in your stack, at reasonable scale you will end serving 10s-100s req/sec in parallel with even 1000s of concurrent users that could overwhelm a system if r/w are not fast enough.
Not everyone can use s3 directly,for various reasons mix of legacy/ existing NAS systems or compliance/ regulatory requirements, not on AWS stack but most libraries/clients support S3 APIs even if you use GCP/Azure you could have S3 compatible API with minIO.
This benchmark uses large batch size, 64MB, to test. There is nothing new here. Most common file systems can easily do the same.
The difficult task is to read and write lots of small files. There is a term for it, LOSF. I work on SeaweedFS, https://github.com/chrislusf/seaweedfs , which is designed to handle LOSF. And of course, no problem with large files at all.
LeoFS last release is about 3 years ago (v1.4.3 February 20th, 2019), and seems more complicated. SeaweedFS is still growing and has being released on a weekly basis. Just ask if you need any new feature.
Also, think about how to manage. For example, grow capacity. For SeaweedFS, you just need to add one server and point it to the master. That is it!
SeaweedFS is portable, the data file and the metadata.
https://blog.min.io/scaling-minio-more-hardware-for-higher-s...
It reads like they roughly doubled the read (GET) throughput for 32 nodes. Though I don't know how much of that would be AWS improvements vs MinIO improvements.
It was only when GCS reached 1.0 Tpbs that we started shifting workloads over.
However, these benchmarks are only relevant within the same region. As you move further away, roundtrips and speed-of-light will become significant enough that I've seen some people do benchmarks in Tpbms.
Now, my seaweed setup is a 3 vm cluster with 3 disks per vm (1TB) each. I configured a wireguard mesh (https://github.com/k4yt3x/wg-meshconf) between the VMs and configured master and volumes server to talk to each other via wireguard IPs securely. I also configured ufw to only allow communication between http/gRPC ports. I also configured a filer (using leveldb3) to use wireguard IPs (master and volumes) and let it communicate with some specific servers on the outside (ufw).
After that i mounted the filer via weed.mount on that specific server and tried to copy over the same files/folders. after 2 days i copied over about 1.5 TB of the data via rsync. There was also no problem with file listing and accessing the filer from different machines while uploading stuff. But there is a overhead when reading and creating lots of small files. File listing is even faster than local btrfs file listing.
chris is also very nice and fast fixing bugs.
btw: If you use "weed filer.copy", it should be much faster. Rsync needs to go through FUSE mount.
I would imagine 1kb objects to be far more common...
Small objects, 4K ~ 4MB, are very common in machine learning area(audio, video, text), or surveillance, game assets, etc. SeaweedFS saw a lot of usage in these areas.
1kb object would be terribly inefficient if you have too many of them. It's slow even with regular file system, copy 1 million 1kb files would take forever. For small objects, it's best to pack them into larger block before storage, like how a database does with its data.
I think the document should have a clearer guideline about best practices when deploying a SeaweedFS cluster for different situations: small cluster, large cluster, H/A cluster etc..
For filer metadata, you should just pick the one you are most familiar with.
There is a wiki page for production setup. https://github.com/chrislusf/seaweedfs/wiki/Production-Setup
I would call an object store in which the objects are mostly < 1 kiB a key value store.
64MB is far, far closer to our usecase than 1kb. I don’t even think our schema files/metadata which are arguably the smallest part fit into 1kb.
The other people I tend to end up with always try to sell people on minio+rook-ceph and then offer support along the way. So basically you buy their kubernetes deployment and then you have to pay them for the rest of your life for troubleshooting. The TrueNAS seems cheaper to me.
I don't see a good backup story, but maybe I just don't know it.
A pure kubernetes deployment is more complex(although it's all part of the same binary i think).
I could be completely wrong though.
My biggest wishes for Truenas Scale:
1. NVMe-oF support (over RDMA or TCP)
2. aarch64 uefi iso
I could see Truenas Scale overtaking Proxmox soon in the KVM space, their API and UI are already more enjoyable to use IMO.
I do agree with the Ceph being much more complex which is why I hope OpenEBS implements zfs-localpv send recv for a poor man's replication.
Mainly use it for gitlab object storage current. Logs docker images etc
Where do data-management SaaS companies like Rubrik.com and Druva.com fit in? Are they not popular enough solution to secure min.io deployments?
Also by the time most people account for 2/3 copies of the data + 1 completely offsite backup of data I think their eyes might start to water at the cost if you want reasonable performance as well. I experiment with this kind of stuff a lot of the time and always am surprised at how much more than you'd expect it costs to get drive, node, and region level redundancy with backups. You need essentially at a minimum 3TB for every 1TB of usable storage assuming regular RAID1.
Plenty of good use-cases.
It doesn't have to use local disks. A big part of what it does is to provide a S3-compatible API, which you can use with any number of backends. Even other cloud providers. Say you want to be able to deploy on prem, on AWS, and GCP. You can add MinIO in front of all of these and you app won't care.
Of course, if you are writing to an actual disk you'll have to figure out the backup part.
I was also surprised by many of the comments. I had only played with minio a bit, and was considering using it. The durability comments were concerning, and there were other issues too.
One example is this bug: https://github.com/minio/minio/issues/8873
Basically, it was treating object names like foo//bar in them the same as foo/bar, except for sharding, which thought they were different.
Their fix was just to disallow '//' in an object name, even though other s3-like implementations allow it.
We have a lot of products that use S3 API and Azure was the only cloud that did not offer that out of the box - at first I could not believe this was the case, because even smaller providers like Linode or Digitalocean offer S3 API compatibility. But then I found the post linked below.
[1] https://cloudblogs.microsoft.com/opensource/2017/11/09/s3cmd...
https://github.com/gaul/s3proxy
https://ventral.digital/posts/2020/10/11/s3-api-compatibilit...
Still, I wish Azure just added a built-in compatibility layer, like virtually every other cloud provider - I'm not a big fan of having to spin up a container just for this reason.
Even for something like just querying logs, you can churn through a few TB pretty quick by searching in parallel
Not everyone can use s3 directly,for various reasons mix of legacy/ existing NAS systems or compliance/ regulatory requirements, not on AWS stack but most libraries/clients support S3 APIs even if you use GCP/Azure you could have S3 compatible API with minIO.
AWS S3 recently launched strict consistency.