Upcoming webinar: Go beyond text with multimodal AI evaluations

13 d 04 h 47 m

How We Scaled Data Quality at Galileo

How we scaled data quality
Ben Epstein
Ben EpsteinFounding Software Engineer
15 min readDecember 08 2022

At Galileo, we had a simple goal: enable machine learning engineers to easily and quickly surface critical issues in their datasets. This data-centric approach to model development made sense to us but came with a unique challenge that other model-centric ML tools did not face: data is big. Really big. While other tracking tools were logging *meta-*data information such as hyper-parameters, weights, and biases, we were logging embedding and probability vectors, sometimes in the near thousand-dimension space, and sometimes many per input.

To add to that, early on in the life of our platform, we made an important and opinionated decision about the workflow of our users: Users want insights on all of their data, not random samples. And the types of insights we wanted to surface were intelligent, not simply averages and pre-built summary statistics. We wanted to be able to apply the research of cutting-edge papers in data-centric AI, in some cases doing our own research, often heavily based in NumPy, and scale that to users’ datasets. We needed our ML-research team to feel confident that the research they worked hard to achieve would not be stifled by scalability concerns, no matter the complexity.

Finally, we were certain that we wanted to surface these data-centric insights to users via a web UI and simple Python API client. When surfacing data via a UI and API server, responses must be fast. <100ms for nearly all requests.

To recap our requirements:

  1. We had to log lots of data (up to tens of millions of high dimensional vectors per dataset)
  2. We wanted to surface insights on all of the users’ data, not samples
  3. We wanted to build custom insights that would derive true value to our users, often based on new research
  4. We needed these insights to be available with low latency, < 1 second, typically <100ms.

This is an interesting set of requirements and posed a very exciting engineering challenge. How do you scale computations to millions of data points in high-dimensional space with exceedingly low latency?

Note: There were other architectural challenges handling this data size, but in this article we’ll be focusing on computation.

Now, it should be noted that there are many tools out there to handle big data: notably Dask and Spark. These tools are strong, scalable, and (sometimes) easy to use via high-level APIs, but they have a critical flaw for our use case: they are latent. Try averaging some compound, aggregate calculation with Spark, it is slow. Not (typically) on the order of minutes, but on the order of seconds. And when you are trying to surface insights to a UI, seconds are unacceptable.

It was clear we needed a different approach.

Out of Core data

The first idea was to use a tool that could handle out-of-core data. If we could operate on the dataset using a [memory-mappable](https://www.mathworks.com/help/matlab/import_export/overview-of-memory-mapping.html#:~:text=Memory-mapping is a mechanism,way it accesses dynamic memory.) file format (like arrow or hdf5) and quickly process the data in chunks, we could bridge the gap between experimentation and production quite nicely.

Pandas

The most basic approach was to use pandas with chunking. This would allow us to easily scale the familiar syntax of pandas and NumPy to out-of-core datasets. But this idea soon proved too slow. Firstly, pandas doesn’t have native ways to apply lazy execution or multithreading. And even with custom multi-threading code, pandas is implemented mostly in python (vs C++) and so was locked by the GIL.

Pyarrow

With pandas out, we moved to the source and tried pyarrow. The stated goal of this project “is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to store, process and move data fast.” - docs In terms of performance, arrow proved to be quite fast at mapping through a native file, and even performing basic arithmetic operations. The obvious and huge benefit here is that this can all operate with zero memory. This means that even 100GB files can be processed as if the data was an in-memory dataframe without needing 100GB of ram.

The core issues we found with pyarrow were

  • Lack of documentation for custom functions (ie with NumPy)
  • Lack of community support

It was extremely difficult for us to even begin with creating and mapping through arrow/feather files, and there wasn’t much we could find in terms of community to help us with issues.

But the biggest blocker was the lack of support for multidimensional NumPy arrays. Had we known this in the onset, we could have skipped this as an option altogether. This has been on the docket for a long time, and doesn’t seem to be gaining traction (see 1, 2, 3).

At this point, we determined that a new direction was necessary. Instead of focusing on expanding pandas APIs, we should focus directly on NumPy, as this was the core tool our ML team wanted to utilize.

Numpy mmap

Our first stop was to (again) go to the source and try NumPy's memory mapping. We quickly turned away from this option because files were stored in .npy which aren’t really usable by anything else. Although NumPy was a major component of our stack, we didn’t want to be stuck.

Hdf5 and H5py

Our next first stop on this journey was hdf5 and H5py. HDF5 is a popular file format for memory mapping, with very fast performance, and some level of interoperability with NumPy. It is fast, has some support on S/O, and allows us to chunk across NumPy arrays and apply custom functions while maintaining zero-memory copies, and hdf5 is a sufficiently popular format that other tools could use as well. The challenge here was that the API was quite low level, and left a lot of management to us.

We were looking ideally for a higher-level API that provided similar functionality.

Vaex

All of this research finally led us to a little-known project, which (in my opinion at least) has huge potential; Vaex. Vaex, in their own words, is an

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualize and explore big tabular data at a billion rows per second 🚀 ~ https://github.com/vaexio/vaex

That sounds pretty good. My first question was about ease of installation. Most tools harnessing out-of-core and distributed compute need clusters and configs, but Vaex’s philosophy seems to be that you shouldn't need clusters nor configs to scale to billions of data points (as mentioned in their blog comparing Dask with Vaex).

pip install vaex

Well, that is certainly easy.

So how “high level” are these APIs? Well, pretty high level. I wanted to get started quickly, so I checked out their getting started documentation which was a pretty good primer.

import vaex df = vaex.example() display(df) >>> # id x y z vx vy vz E L Lz FeH 0 0 1.2318684 -0.39692867 -0.59805775 301.15527 174.05948 27.427546 -149431.4 407.38898 333.95554 -1.0053853 1 23 -0.16370061 3.6542213 -0.25490645 -195.00023 170.47217 142.53023 -124247.95 890.24115 684.6676 -1.708667 2 32 -2.120256 3.3260527 1.7078403 -48.63423 171.6473 -2.0794373 -138500.55 372.2411 -202.17618 -1.8336141 3 8 4.715589 4.585251 2.2515438 -232.42084 -294.85083 62.85865 -60037.04 1297.6304 -324.6875 -1.4786882 4 16 7.217187 11.994717 -1.0645622 -1.6891745 181.32935 -11.333611 -83206.84 1332.799 1328.949 -1.8570484 ... ... ... ... ... ... ... ... ... ... ... ... 329,995 21 1.9938701 0.7892761 0.2220599 -216.9299 16.12442 -211.24438 -146457.44 457.72247 203.36758 -1.7451677 329,996 25 3.7180912 0.7213376 1.6415337 -185.9216 -117.250824 -105.49866 -126627.11 335.00256 -301.837 -0.9822322 329,997 14 0.36885077 13.029609 -3.6339347 -53.677147 -145.15771 76.7091 -84912.26 817.1376 645.8507 -1.7645613 329,998 18 -0.112592645 1.4529126 2.1689527 179.30865 205.7971 -68.75873 -133498.47 724.00024 -283.69104 -1.8808953 329,999 4 20.79622 -3.3313878 12.188416 42.690002 69.204796 29.542751 -65519.33 1843.0747 1581.4152 -1.1231084

Alright, easy again. How about a simple calculation? They say that the data is represented internally as NumPy arrays, so I started with some basic arithmetic.

In [4]: %time df["mult"] = df.x*df.y CPU times: user 164 µs, sys: 5 µs, total: 169 µs Wall time: 173 µs In [5]: df.mult Out[5]: Expression = mult Length: 330,000 dtype: float32 (column) --------------------------------------- 0 -0.488964 1 -0.598198 2 -7.05208 3 21.6222 4 86.5681 ... 329995 1.57371 329996 2.682 329997 4.80598 329998 -0.163587 329999 -69.2803

That’s fast. But this dataset isn’t huge, it’s 330,000 rows. I wanted to see it really scale against my Macbook pro (2.6 GHz 6-Core Intel Core i7, 32 GB 2400 MHz DDR4). I needed to generate a lot of data. I found that the easiest way to generate huge amounts of data was actually with vaex itself.

First, I wrote the file to disk so I could memory map it, then I concatenated it together and rewrote it to disk

df.export('sample.hdf5') df_disk = vaex.open('sample.hdf5') In [9]: %time df_huge = vaex.concat([df_disk]*5000) # 1,650,000,000 rows CPU times: user 1.13 s, sys: 21.8 ms, total: 1.15 s Wall time: 1.15 s df_huge.export('big_sample.hdf5')

This was pretty quick, roughly 5 minutes. And it generated around 70GB of data, 1.65 billion rows.

Now I could really test it out.

In [2]: %time df = vaex.open('big_sample.hdf5') CPU times: user 308 ms, sys: 140 ms, total: 448 ms Wall time: 1.21 s In [3]: %time df["mult"] = df.x*df.y CPU times: user 606 µs, sys: 193 µs, total: 799 µs Wall time: 1.38 ms

That didn’t seem possible, so I went back to their documentation and read up about lazy execution and quickly realized that vaex wasn’t actually doing the calculation (other than the first and last 5 rows), it was creating an Expression that would only be evaluated when necessary. If I asked for the mean of that column, or wanted to filter by it, only the rows necessary within that column would be calculated.

In [4]: df.virtual_columns Out[4]: {'mult': '(x * y)'}

This is a powerful concept. You can create many of these expressions, chain them together, and execute them all in parallel, only when needed.

In [5]: %time df['multi_sq'] = df.mult ** 2 CPU times: user 132 µs, sys: 0 ns, total: 132 µs Wall time: 139 µs In [6]: df.virtual_columns Out[6]: {'mult': '(x * y)', 'multi_sq': '(mult ** 2)'}

But I wanted to see how fast that calculation was, so I took an average.

In [10]: %time df.multi_sq.mean() CPU times: user 14.8 s, sys: 27.5 s, total: 42.3 s Wall time: 15.7 s Out[10]: array(5785.81929326)

Not quite instant, but pretty fast nonetheless. This was the average of the square of the product of 2 columns, on...

In [11]: len(df) Out[11]: 1650000000

...1.65 billion rows, all done on a Macbook pro. Reading through their examples, tutorials, and API documentation, I learned about caching (among many other things). Vaex is pretty powerful (and magic at times) when it comes to caching, in that it caches calculations.

In [16]: %time df.multi_sq.mean() CPU times: user 5.3 ms, sys: 7 ms, total: 12.3 ms Wall time: 12 ms Out[16]: array(5785.81929326)

Note that if you want that functionality, you have to set your environment variable or run vaex.cache.memory_infinite() # cache on globally before running the first calculation.

After making that call, Expressions can be cached and return results nearly instantly. This will use some RAM, but it may be worth it for important (server-side) calculations.

Another way to make statements faster is using JIT (Just In Time) calculations using numba. It also works on GPUs so your code is nicely portable.

At this point Vaex is performing pretty well for what we need. The next requirement was using it for arbitrary NumPy functions that we created during our ML research.

The first piece of their documentation that helped here was a section called Extending Vaex. The easiest functionality here is using the vaex.register_function() decorator.

In [5]: @vaex.register_function() ...: def complex_function(a, b): ...: res = 2 ** (np.sqrt(a), np.sqrt(1-b)) ...: return res ...: In [6]: # Same as df.func.complex_function(df.x, df.y) or df.x.complex_function(df.y) ...: df['complex_function'] = complex_function(df.x, df.y) In [7]: df['complex_function'] Out[7]: Expression = complex_function Length: 1,650,000,000 dtype: float32 (column) --------------------------------------- 0 1.50797 1 nan 2 2.40071 3 nan 4 2.17844 ... 1649999995 nan 1649999996 nan 1649999997 nan 1649999998 nan 1649999999 2.54421

It’s also worth noting that for simple arithmetic you don’t even need to register functions like this. Basic arithmetic (and some NumPy functions) can be applied directly against Vaex Expressions (columns).

In [18]: df['x_1_simple'] = df.x + 1 In [19]: df.x_1_simple Out[19]: Expression = x_1_simple Length: 1,650,000,000 dtype: float32 (column) --------------------------------------------- 0 2.23187 1 0.836299 2 -1.12026 3 5.71559 4 8.21719 ... 1649999995 2.99387 1649999996 4.71809 1649999997 1.36885 1649999998 0.887407 1649999999 21.7962 In [22]: df['x1_sqrt'] = np.sqrt(df.x) In [23]: df.x1_sqrt Out[23]: Expression = x1_sqrt Length: 1,650,000,000 dtype: float32 (column) --------------------------------------------- 0 1.1099 1 nan 2 nan 3 2.17154 4 2.68648 ... 1649999995 1.41204 1649999996 1.92824 1649999997 0.607331 1649999998 nan 1649999999 4.56029

The tricky thing here is that there is not a well documented list of the NumPy functions available natively to Vaex. I reached out to the creators of Vaex (who are incredibly helpful and nice) and they explained to me that they need to rewrite every NumPy function that they want to support by hand, so it’s not an automatic guarantee. They pointed me to this issue for a list, but it’s quite old and isn’t necessarily up to date.

But for very custom functions that cannot necessarily be implemented by default (like a dot product) you would likely utilize the register_function decorator anyway to create more modular code. Where this became very important for us was with high dimensional arrays.

We noticed that vaex is fairly slow with very high dimension dataframes (ndim > 700). But Vaex was significantly faster with single columns of high dimension vectors.

ie instead of a dataframe with 1000 columns:

In [25]: data = {f'col_{i}': np.random.rand(1000) for i in range(1000)} In [26]: vaex.from_dict(data) Out[26]: # col_0 col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9 col_10 col_11 col_12 col_13 col_14 col_15 col_16 col_17 col_18 col_19 col_20 col_21 col_22 col_23 col_24 col_25 col_26 col_27 col_28 col_29 col_30 col_31 col_32 col_33 col_34 col_35 col_36 col_37 col_38 col_39 col_40 col_41 col_42 col_43 col_44 col_45 col_46 col_47 col_48 col_49 col_50 col_51 col_52 col_53 col_54 col_55 col_56 col_57 col_58 col_59 col_60 col_61 col_62 col_63 col_64 col_65 col_66 col_67 col_68 col_69 col_70 col_71 col_72 col_73 col_74 col_75 col_76 col_77 col_78 col_79 col_80 col_81 col_82 col_83 col_84 col_85 col_86 col_87 col_88 col_89 col_90 col_91 col_92 col_93 col_94 col_95 col_96 col_97 col_98 col_99 ... col_900 col_901 col_902 col_903 col_904 col_905 col_906 col_907 col_908 col_909 col_910 col_911 col_912 col_913 col_914 col_915 col_916 col_917 col_918 col_919 col_920 col_921 col_922 col_923 col_924 col_925 col_926 col_927 col_928 col_929 col_930 col_931 col_932 col_933 col_934 col_935 col_936 col_937 col_938 col_939 col_940 col_941 col_942 col_943 col_944 col_945 col_946 col_947 col_948 col_949 col_950 col_951 col_952 col_953 col_954 col_955 col_956 col_957 col_958 col_959 col_960 col_961 col_962 col_963 col_964 col_965 col_966 col_967 col_968 col_969 col_970 col_971 col_972 col_973 col_974 col_975 col_976 col_977 col_978 col_979 col_980 col_981 col_982 col_983 col_984 col_985 col_986 col_987 col_988 col_989 col_990 col_991 col_992 col_993 col_994 col_995 col_996 col_997 col_998 col_999 0 0.10236124964758109 0.5093856465385784 0.4450118974027252 0.5092643521328203 0.8176919333891499 0.11300546450925608 0.21321012777326298 0.08352135721351028 0.6796178285902513 0.012929958485977444 0.22515311731098286 0.6828144130361411 0.6901917730045839 0.5560968183239383 0.6074617111999907 0.8595827647775355 0.6048585780926642 0.015329665764219902 0.6781643858236135 0.005248547247437285 0.8543775515603313 0.8042144795471172 0.7337608049214995 0.4023851879024728 0.6245408094256124 0.8222147995940808 0.5384389102720495 0.6058104107738097 0.9285931550340868 0.9563374945757057 0.9005045528860056 0.5818852493340096 0.12026368408757693 0.527849751846411 0.3214024737525264 0.48835923262224035 0.28392240587508744 0.5005358139484353 0.9651430779143012 0.686300347721664 0.8906037662317768 0.22377266201980184 0.7792339824696397 0.7355104392733579 0.6201909866512261 0.6222427632525324 0.25967195596847903 0.8783709062472626 0.10543430615315308 0.26732758935283296 0.8476189813221596 0.9011486401987662 0.26799364904684675 0.7553849794005691 0.508074687995959 0.5257981595338108 0.49133310273769126 0.5948126668053267 0.1707970886774216 0.12493764853405931 0.41833978417541584 0.5424827332476311 0.5761292592973504 0.24790773482981165 0.673096364610964 0.33920241537604645 0.6842993550391915 0.9721776573565359 0.417612073864266 0.7963551664037282 0.8797586308377476 0.07843417715007694 0.3478324039710695 0.4544032514925789 0.40421519380950177 0.6871261289874017 0.15819514129232393 0.5303505001906076 0.8336323649562504 0.6705252083849176 0.7278457692881124 0.3491395656723967 0.003311788812994876 0.9148015950274275 0.5857419584613083 0.009833469023059038 0.2955402203950933 0.5322045995658724 0.9315940100999847 0.8338430388955584 0.5186147947825339 0.5690069513860403 0.0273828247962552 0.8428522830292322 0.05001437394759567 0.32280925612044353 0.07462966625813838 0.18596872923153918 0.4255941263163725 0.9512066642287188 ... 0.18743807893198283 0.16681776558287886 0.6971753602093745 0.3723651076299924 0.6888470641463975 0.2940915067155421 0.94162804520905 0.4689981513601845 0.2184335509980626 0.6191621363650976 0.6197226231126712 0.8820401571261538 0.6379956806820923 0.7560544494480148 0.04650538637504176 0.7416414536246898 0.6883044645354635 0.5780945028524938 0.27635046247406114 0.920697183642927 0.06646840768280127 0.5523521909176711 0.6398722619947766 0.9986966479839666 0.707098940880993 0.13937526224441654 0.29067885288309503 0.9779678413470883 0.5152215919490587 0.2332653915255869 0.7236669980080881 0.5955192298304886 0.19031409102832952 0.877797245174134 0.556531695119583 0.5792936845420723 0.8964322562565852 0.1499073069727358 0.6646909109400044 0.18016769668740373 0.28175344525422064 0.4844379260815278 0.49397135167290784 0.1168269437700239 0.7003081369704629 0.1745103596493851 0.6094110431597037 0.737346457318785 0.834143682350943 0.31600821032315585 0.2497302372687329 0.3240265445009767 0.7941781141666877 0.1312120767576066 0.8746047461652912 0.1379834203733742 0.5606148379486746 0.33944711644566594 0.0719060042264682 0.19639864080048552 0.06566624962418388 0.19289410898917536 0.887643814648603 0.3654240965764409 0.18465521466813128 0.10206996494010334 0.5458093149957672 0.35453688111764614 0.12999228965960685 0.9271414354876247 0.29093774636217096 0.1494228345264782 0.13572427195072068 0.48550284525465504 0.7699659809308728 0.8634499799105949 0.30382627768890735 0.7574678832422965 0.8979207078553882 0.7361179617490662 0.6303541678411881 0.15016556439634277 0.9733750467110581 0.7937096962422229 0.17258193445465053 0.368448412632999 0.12175690675196671 0.2569569422928316 0.9627884204744751 0.9019971211768816 0.08979570288915362 0.41056231096627593 0.412638468400812 0.8637071336433553 0.04179366769332804 0.05850086991090864 0.22195314339414285 0.021956578058680454 0.3654184564492047 0.624397 1 0.7444674080840511 0.038125037314928445 0.6972398785928228 0.5642519063439951 0.7265497380796555 0.4190654123893448 0.3481386729977226 0.7830273329804776 0.35441998947542663 0.4010935483526865 0.24371354657180222 0.6571317757856668 0.0024845743552923683 0.10768457211762361 0.6775099891802148 0.9284035214101235 0.5952840630367233 0.6324285838225122 0.03053824054083476 0.32801943361424435 0.9167522090192899 0.0025995733454202696 0.8162634161081429 0.5779951215729106 0.1391775356865228 0.43010104220389744 0.3402555816188717 0.13128937473983315 0.1089700161878826 0.04698651729493453 0.5297287265909014 0.6879743884058845 0.8832868579376465 0.34204605297802226 0.12645591097340825 0.5709369840817028 0.9680280587740105 0.25384253675517754 0.9093264728098487 0.8309734450302667 0.8336294250590759 0.25179400857355094 0.5751599830003564 0.5108045056702686 0.7358526394438336 0.8760407054045165 0.24283061550066254 0.4877153309929063 0.5853468436847877 0.45871133406345554 0.39598311061714997 0.9446122661631118 0.20197095375128038 0.4103536667150721 0.9265118997263504 0.026743457754164424 0.2995502995014635 0.2391580410717976 0.9317358051552563 0.9285645948081863 0.4259724276456438 0.2964039378891329 0.11414314679088244 0.2993634349986771 0.7883457016173718 0.42920421549593424 0.7791050472345304 0.549479311182047 0.8042013315357791 0.343965678912597 0.3893568835750808 0.7797481113847672 0.9133698445297451 0.05735467353801105 0.38668220811747056 0.9064391759159385 0.9810029811347212 0.35092539807463874 0.00705815089919859 0.11591360554033348 0.5781404768517193 0.9623255018754578 0.7365343228764178 0.12115943420456432 0.7512033626239712 0.38500295849964106 0.4881583598533007 0.46931115991887096 0.8677107750737915 0.48278770362324364 0.7656320216797317 0.5778790945480846 0.3283635270309929 0.04973179426560159 0.6246895540817036 0.5422428579792903 0.41355115395513675 0.5625648143759421 0.5990605239320373 0.07558842510131436 ... 0.05667501468917058 0.5551003974631262 0.20552983713800277 0.075587693083072 0.7424717986889655 0.5713218704004218 0.14649313226891358 0.8660961728662631 0.4127393956982214 0.5266816387242591 0.6284136178213762 0.45534795948648776 0.7885228894784916 0.5214059110930684 0.778596711658364 0.46311565980995184 0.5578852198836819 0.10784486921411485 0.3551602761182733 0.7398583413699436 0.4007699175782392 0.30740222815236184 0.9125100881245967 0.41073941325080354 0.7728832113324277 0.8226690330124444 0.8788121043099925 0.39465566927080153 0.3625104939203264 0.29324001548134926 0.6818539268864521 0.060510406664049166 0.30706837148670507 0.31931813315082447 0.4082156799004223 0.7965827788481494 0.17494684859409804 0.48581762647667814 0.10132427631239682 0.9022173095917605 0.5824019617996457 0.5030642691642817 0.6998235914911094 0.6766157076430104 0.025172598979209204 0.97221959459203 0.4313110476548855 0.3902958101073807 0.8496768699717788 0.6282107982696954 0.8230652304739616 0.2740508738119244 0.418892065282214 0.11222835201475512 0.20942642453415217 0.908652359488786 0.31074657551589324 0.3529094204299944 0.12796994796996364 0.7868247239913768 0.26577819751139486 0.8694922321945354 0.18630568406518688 0.1598520449071007 0.12254827642868227 0.3804362033447326 0.7617115830426092 0.3419933338084066 0.951535522674099 0.8478866145989938 0.09611743989823773 0.7364221183045315 0.5070908698511747 0.2274880103998802 0.09690455937106868 0.4871972023901382 0.57143067880997 0.5364320020052818 0.8963898604051563 0.392596831897158 0.48557617729856517 0.7114179735777983 0.8301790371975598 0.08565365199437625 0.8552620766652509 0.8253043835019983 0.540784452658269 0.3514244066364576 0.8806503702697065 0.07425220803720955 0.9647523539823388 0.46392800229810305 0.2296719794242027 0.94237467745118 0.9362734291888001 0.6889343008041674 0.29660511488885044 0.4230581209919745 0.1717309898081565 0.707116 2 0.6315538619242901 0.36523604499405926 0.07149018246460259 0.7636607203832279 0.47330783722928715 0.06905905364542775 0.9037829073289766 0.058724618166836384 0.25493509544475956 0.987683084397542 0.35246017192200485 0.5944726526690071 0.26450237572575286 0.1008025570563944 0.7663139277444105 0.6032380115300847 0.8184673112011263 0.21558641476257467 0.4837113434875818 0.6088912443048751 0.396770528884431 0.582090226826891 0.27268301930155725 0.010150217927483718 0.5432134467412665 0.9706617771966776 0.7779513211282605 0.8475784050461468 0.6473021273980433 0.35313553241684836 0.054372719276033266 0.327774467409521 0.5358845585199206 0.06256140489240847 0.4287869938713348 0.3894715150619884 0.21010193479853756 0.2963604316149204 0.386694083560131 0.8354146095224646 0.038423354775011465 0.4117119052701489 0.5130061643552128 0.1358528536409842 0.6704381174897541 0.4579887733620954 0.16954694222602795 0.8418215730615934 0.8574668749391636 0.6134676945954285 0.12542697923766255 0.8661811932880458 0.7752323797484241 0.5976743597572747 0.45208824719666296 0.7129564079842378 0.9701120937726063 0.4508595309008526 0.4873759256852327 0.853940803313543 0.3989323113293908 0.4818228760086397 0.11817070791300721 0.7903000198909989 0.7516359003228686 0.020109102558922842 0.7967897054570408 0.5345178584580762 0.5827590598078103 0.05083109768102467 0.7236517826636262 0.9427751195206849 0.9786115243825323 0.24635757247458334 0.9005194754704431 0.9591461843013732 0.1544758629299936 0.22547464984720145 0.36762075146527484 0.7110786195377183 0.7757730955924954 0.39101668030915826 0.0066971678862367545 0.6164109989283236 0.034026787909574585 0.9817391259088636 0.981424807786541 0.823417379872224 0.25109148353614086 0.051371751005168664 0.2103759733282885 0.9784164043892631 0.07066204317074454 0.019653926493952145 0.4806583369705566 0.567580849249369 0.025866073488327368 0.6014672538916521 0.8561014325155355 0.6413802677985652 ... 0.131501963621539 0.9305816639579072 0.3840192453213156 0.5142322970829013 0.09209103990941658 0.3960126921548217 0.0534686476482743 0.47500153094527164 0.33828347289265603 0.09804251239986506 0.16544040829389994 0.14364213102495793 0.39611532470495436 0.7700472949773322 0.5211641086012984 0.23308229948765802 0.39759426550846466 0.2766471986340836 0.5748818228954073 0.7471777730439048 0.42397838423364853 0.28535999593158135 0.8827568298402996 0.7896569000312033 0.12168530843979242 0.40761234277162584 0.8780122732542995 0.9833038729804873 0.667371538075843 0.466278941691452 0.9125105131150826 0.16655982790658275 0.6162615513458957 0.329374288806754 0.5696528729472417 0.38749298458507875 0.8685431333348926 0.22153595883131927 0.8448402670467022 0.11902992827669279 0.8438802437750287 0.21203734940521446 0.36235645564322094 0.6835503569104479 0.9763791952754448 0.34499406245544095 0.6755980886121992 0.5950433716270467 0.4264578585704152 0.09642107814634748 0.7781712141093678 0.580145757608044 0.6881049228799347 0.6366706305157707 0.4226981372948674 0.9071396676959904 0.3877206157412021 0.4478225839984452 0.812290073604579 0.6611838012785275 0.09744890574811427 0.8647716376763634 0.7753416827582026 0.48947568764152083 0.11838136233159546 0.96138614318313 0.9031802837073903 0.36040283494744996 0.5173087100374859 0.30253933489711526 0.3138943108335833 0.1525223764049164 0.7969598980179063 0.5435719141269676 0.3490202984097611 0.5826674307399606 0.06406047531563297 0.8483498902948997 0.23892544756615297 0.6801566568637734 0.9242520219750973 0.2243478787504436 0.14022144749196985 0.42212664768746433 0.4007484893931045 0.714756735694014 0.590099406467585 0.5005126701102954 0.043544748826485846 0.8329202956133643 0.3048173987521392 0.5114562181161537 0.39986051204734685 0.22611830276095757 0.250396145485104 0.036371171328681506 0.4558872865118475 0.4255334408526902 0.8372237277897949 0.178605 3 0.11632951669801883 0.6092589427546389 0.9182630520327433 0.7578651309949788 0.2803800012810166 0.5531478959139593 0.4748774422855029 0.8137238878841718 0.528136706185007 0.5967839756407354 0.4169356708773473 0.005508920912039028 0.09381500143325672 0.5714901878759199 0.10822888327383973 0.5180290724155723 0.20683640977530282 0.7684807255844723 0.626255159564085 0.01891197235954667 0.489719919797987 0.8460708781749993 0.27310208491510146 0.34384575656551364 0.6055587475505917 0.4290109823195921 0.844234420442497 0.014219694683537787 0.3769108949933422 0.20973433285691867 0.7236363249981457 0.1540660968918669 0.9533219495106533 0.04645904250772259 0.8661224902801138 0.8102078196714274 0.19868470768122226 0.443075375215751 0.10271859208930423 0.5945561629781723 0.07730329365117494 0.22372071541252891 0.24866718341662397 0.7503178182909354 0.6670818945527296 0.6629259976868892 0.9452784880808593 0.47568058279655867 0.5355656720919826 0.8985116320967412 0.34012410305353913 0.4340383300355648 0.05544362569331163 0.10729290827789706 0.46217332478649187 0.9269589928109284 0.1267574341983424 0.2571055486544277 0.9027815181606281 0.15247939760628848 0.760103160216365 0.4849830517943493 0.7173509395535965 0.19597885029929363 0.8482994812083132 0.7219092419177644 0.10580970428501224 0.9869843360910071 0.17628137060649363 0.9285991623341531 0.26500067460711674 0.6844708458065272 0.37497117862237883 0.1928261204563385 0.04024622614811957 0.06780846010960973 0.17139492560449665 0.881539716654944 0.6060797215436575 0.47514138891596314 0.2204244703099042 0.7786508273506784 0.4961898136968973 0.5861814310620578 0.11934578076514735 0.4625664710693329 0.02638839960632222 0.19787249918937977 0.7689995676148436 0.8229992122407902 0.970985823438296 0.11792092397585818 0.9038534972183276 0.7770959622882267 0.2628499996458207 0.5001650940936321 0.10389569076924077 0.7555572389008308 0.10071300115198367 0.04934915637920689 ... 0.44620581034422124 0.34526584365841806 0.730806145621877 0.13026112063874984 0.9803206483405563 0.15974612113060893 0.5597774006319258 0.8287455847722581 0.33381005677108444 0.5047093909862811 0.05907449632373096 0.5449283767603322 0.33552833284096095 0.9008769994934854 0.4835777386485297 0.05100677082680205 0.6721317309890518 0.6466950107360327 0.7064190511480856 0.23380306776033488 0.030289778233268727 0.3110185778140617 0.47988946209327743 0.48485780689830693 0.5700587960846786 0.12279076044898696 0.6436079944673152 0.3241691000273591 0.10706815109876766 0.2654615040052364 0.5010135103636926 0.04186628623038868 0.1375208773361405 0.5546624298865837 0.6844195698964805 0.5088546891326609 0.10082192008301527 0.30622760747953204 0.1352482594946267 0.9801050468396374 0.46532586627031336 0.21999534419334044 0.6773486778116312 0.3529270186539115 0.4509353879535154 0.8371751776195583 0.6854072564970777 0.3331430791628005 0.24870584106095472 0.00896751461040357 0.30352308174505493 0.7233002967467707 0.3425518777466613 0.9162664699540166 0.9228722612113137 0.42116147084405764 0.4546700200684083 0.09387424209804462 0.828996944699913 0.351291005459085 0.050723005772696395 0.3886685529685868 0.20463838524951117 0.4509596207730824 0.9721908762958147 0.755256883959392 0.8412909151437552 0.28330211988512777 0.031692304453429965 0.7779680441553805 0.0055680490373435365 0.8439397555530108 0.6071085652188721 0.6658636054898662 0.6796971211022772 0.05333631775725034 0.557761053665675 0.9107912566047308 0.8612664066094143 0.06327021440541525 0.3434110672805596 0.7671178579040677 0.08145931623892 0.039396257603137586 0.009328044831326987 0.5944342911609108 0.6275664492518248 0.5346512274198825 0.5248332187574428 0.5180113047205421 0.4085378970093759 0.7803530822196251 0.04372577558924762 0.29374524937522895 0.019536501637608228 0.43180458637436525 0.9905767848953536 0.6277809867383295 0.1730331399035261 0.456886 4 0.9540263082647602 0.48320835857968625 0.6651730100402153 0.5211585007186196 0.045411134106118856 0.32805951412117906 0.7781705573014294 0.08033022419901659 0.429508172352819 0.13709355852969518 0.38785870305335446 0.38133731100991264 0.6338845921617079 0.5881708096573881 0.6378566012148162 0.45278801434778326 0.23001045286977517 0.6796357645338544 0.29173135362858305 0.029659100531269855 0.03758719393685306 0.5394890969640372 0.9223806507306608 0.3730045262724503 0.02457738793233588 0.3727220917834684 0.30115399434129786 0.931438128620332 0.7930309665673373 0.6197556225722716 0.6587766510593736 0.6980627627862889 0.520014966802464 0.695021738968596 0.4838893252057611 0.03455371455831868 0.6456054571333923 0.7205363518703579 0.7879968796125904 0.5685609238643053 0.6981355947226453 0.5511444668272358 0.955149206849561 0.8655056695944527 0.2898151197218928 0.1318870305359897 0.2167335036386191 0.013335770537428071 0.9928719462290005 0.6289010313340152 0.5366906061937924 0.995283078627732 0.7373367529838363 0.7870391337132857 0.9437224262550469 0.5918659606236795 0.8603127306021661 0.2650154264610578 0.8530417108740258 0.4678051105308346 0.9954688901675107 0.4157992100672858 0.3904626339622018 0.8921191395064612 0.8806036507647462 0.5507470443181662 0.05509401283247628 0.9904372339818756 0.4131871816114543 0.13237076679808413 0.051050356508925976 0.30652142377124314 0.35396554343130704 0.05897715637587042 0.02407951804690933 0.2294997377129694 0.26265728560336543 0.3813136521789685 0.4799630477764635 0.2851791341852429 0.22305039097243884 0.9730372398463232 0.3217411912764556 0.42940646448896647 0.8483216551653435 0.46533862258119396 0.2712838444531315 0.4834814164478881 0.5456455996278523 0.17947594830766012 0.6518249007792851 0.9597188902307162 0.9857897844844354 0.017629264606785267 0.4302780420572129 0.029198785506055813 0.0615541071300616 0.9428719878619952 0.918977190542938 0.4610298125876867 ... 0.5618407565670157 0.706651860910063 0.6042922070536648 0.4021778614328432 0.6352373388374757 0.11107311694469113 0.7821503059185356 0.7481480724325673 0.4632782289437867 0.3275989302167256 0.39886304912474335 0.8128001312573995 0.123502592707883 0.72042634021588 0.583692794153407 0.37009201868783204 0.020861159110223793 0.9839354781826385 0.233766945247735 0.9321531643397426 0.349794173373823 0.7596877065313576 0.10556840676502355 0.9853877134451297 0.5469492786697298 0.536119920005695 0.4215837087934421 0.9022594962266155 0.34834915489046137 0.836008173368122 0.7391777916247735 0.2559447121534968 0.39303950210885386 0.4280928556696666 0.8355268860463818 0.9770104342509344 0.2084932844604186 0.6688076777538107 0.9004042586905554 0.4762790289500517 0.5444397084932665 0.4654680221945815 0.3925607206336803 0.08520020064776301 0.9665606954352084 0.9965064407717859 0.45630295272756705 0.13624756639806357 0.9155637187838511 0.12841753708278492 0.29099949916929235 0.36993183571642074 0.05396658962141254 0.2436841125680428 0.08633769494691712 0.9984831037224684 0.9104863227769195 0.7012226086412475 0.8853923486941151 0.2094060654205847 0.16262406466853452 0.3634628366985536 0.2589810292485263 0.4476139329989052 0.693439070099085 0.8718605576208103 0.5576191684047682 0.8387790030615047 0.8259049159454802 0.4824843583316928 0.19688778784702887 0.011046132060915093 0.5519858475577503 0.4397992635196116 0.651878064176194 0.6354581285114438 0.8763615690065594 0.23410403285907644 0.4482979195964729 0.5353507547817273 0.9872199905155876 0.6227644467538648 0.9680821641954859 0.3880760227034924 0.08653287028402612 0.07630315156232093 0.7055319003419328 0.24682316567978613 0.6793687292066626 0.6388662979894033 0.66767709706506 0.18006182355089517 0.7883001464428053 0.39364711503339656 0.25308867859113915 0.3476893507018376 0.15199265120877203 0.0560179044866328 0.3883280685434993 0.451648 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 0.95529 995 0.543414376456619 0.20993525286857806 0.2790960702529046 0.8493617327750558 0.003389708103334388 0.5837332637369641 0.3367216180757837 0.5032548528554954 0.2849344338668083 0.4073770675103616 0.5180378841332682 0.5662519700993828 0.09025556575876392 0.22871420410912002 0.8930603804352552 0.4881250548390883 0.9889227104434976 0.5567684516827847 0.2048830710088142 0.6886315332757316 0.6802859001754582 0.14525320983525303 0.836849207178502 0.12508042353757898 0.4038660809406561 0.807626130107281 0.007343244083784173 0.07523882436086204 0.1390713380328582 0.4143369178732059 0.37134087361544055 0.12584134540246705 0.46143315943874175 0.3609389981690626 0.3210945433025557 0.10530819182388595 0.2017680367707111 0.21457223651677382 0.8403604267229892 0.6971460273301798 0.03617813117569213 0.6507492690121628 0.5807719118384956 0.7469472656533214 0.15308082040313598 0.96982959704485 0.5516286861805461 0.7217069995587886 0.29233939485176463 0.3956912787334709 0.9371763947904542 0.8015930728077135 0.7287831702086038 0.49574067803410693 0.9869217387258595 0.3754766498692592 0.7656739937768676 0.4960144427809867 0.9820376182073945 0.6683487683094909 0.8030846987716477 0.42107600950172663 0.0643228738622702 0.8392651710984095 0.7607743040084002 0.205813379269013 0.9684933301774584 0.9295863705089613 0.2822490331841767 0.4965472409344521 0.4562153492120796 0.8953459461566997 0.8799915433339562 0.8802602857877909 0.9461704772640072 0.5936718866073932 0.3134021808183647 0.9444749824083322 0.9957029923743954 0.624393241440854 0.5161282624018275 0.7521096072582736 0.9812408232618473 0.05991392976358534 0.8757893516914941 0.9322834274441159 0.8150720115193776 0.8752730050896925 0.6053339937842217 0.9897025003571451 0.16508205609828563 0.05547427678278971 0.9586690362172618 0.13220674393362186 0.2172650098131318 0.05188376529604233 0.8521725128572779 0.4753866176513236 0.38201154198280074 0.9402849607246156 ... 0.7414449255967214 0.036562246982214996 0.4959512062214184 0.35700225122439744 0.6148132335483971 0.9644756675803494 0.9624657803084803 0.49176705466569814 0.0354868478913315 0.281649869868561 0.4122546627222382 0.7679961338989865 0.09815297929813716 0.7421673328384508 0.15472820992856573 0.40271278749622863 0.9352632020763769 0.7742295907235763 0.37527155663850986 0.8857487537480567

Vaex seems to perform much faster with a single, 1000 dimension vector column:

In [33]: data = np.random.rand(1000,1000) In [34]: df = vaex.from_dict({'col_1':data}) Out[34]: # col_1 0 'array([0.4130965 , 0.66934835, 0.96184307, 0.36... 1 'array([0.92401972, 0.90957336, 0.04339088, 0.88... 2 'array([0.62221928, 0.91017708, 0.972217 , 0.19... 3 'array([5.93042356e-01, 9.47436880e-01, 4.572206... 4 'array([3.78698246e-01, 7.31229353e-01, 5.399795... ... ... 995 'array([0.19432387, 0.32577784, 0.72759998, 0.02... 996 'array([2.54693033e-02, 7.42768910e-01, 8.896702... 997 'array([1.31301447e-01, 5.77822165e-01, 9.088000... 998 'array([1.54449406e-01, 8.94574582e-01, 2.915612... 999 'array([0.33798548, 0.17620267, 0.32973956, 0.32...

What’s really nice about this secondary format (besides anecdotal performance gains) is that you can really treat this column as a (massive out-of-core) NumPy array.

In [38]: df.col_1[:, 1] # Slicing the array Out[38]: Expression = getitem(col_1, 1) Length: 1,000 dtype: float64 (expression) ----------------------------------------- 0 0.669348 1 0.909573 2 0.910177 3 0.947437 4 0.731229 ... 995 0.325778 996 0.742769 997 0.577822 998 0.894575 999 0.176203

The other huge benefit with this format, as you may already be thinking, is the vaex.register_function decorator. We can create any arbitrary NumPy expression, register it with Vaex, and apply it to arbitrarily large amounts of data.

In [40]: @vaex.register_function() ...: def dot_product(a: np.ndarray, b: np.ndarray) -> np.ndarray: ...: return np.dot(a, b) ...: ...: vec = np.random.rand(1000) ...: df['dot_p'] = df.col_1.dot_product(vec) ...: df['dot_p'] Out[40]: Expression = dot_p Length: 1,000 dtype: float64 (column) ------------------------------------- 0 247.236 1 256.003 2 252.146 3 258.306 4 246.305 ... 995 263.425 996 249.267 997 255.197 998 259.378 999 255.449

If this dataframe comes from a file that is memory-mappable (like hdf5) Vaex will memory-map and chunk this calculation, ensuring you don’t go over ram limits. You can even have tighter controls by using their chunk_size attribute df.executor.chunk_size = 32000

  • [x] We had to log lots of data (up to tens of millions of high dimensional vectors per dataset)
  • [x] We wanted to surface insights on all of the users’ data, not samples
  • [x] We wanted to build custom insights that would derive true value to our users, often based in new research
  • [x] We needed these insights to be available with low latency, < 1 second, typically <100ms.

Bonus: Ease of experimentation

Because Vaex runs on a single machine, moving from production back to local testing is incredibly easy. In fact, it’s no work at all. When I’m debugging issues in our server, I immediately move to local Jupyter notebooks, where I can run the code as-is from the server, with incredibly fast results. In fact, I can even do some basic performance testing locally before pushing any code, because I can work with >100GB files and utilize Vaex’s rich progress bars to see how efficient I’m being with my implementations. This wasn’t a strict requirement, but after getting this out of the box, it’s hard to imagine how I would work without it.

It seems we’ve met all of our criteria. One thing that wasn’t strictly necessary but is worth mentioning is documentation and community support. The community for Vaex is small, and as you use the tool in a more advanced way (what we’re really doing behind the scenes at Galileo), you encounter many little “gotchas” that you wouldn’t expect. Not everything is covered in the docs and you may spend hours to find what seems like a simple question. But after a few weeks of learning, it becomes a bit like second nature.

But this curve is dramatically reduced because of the founders (Maarten, Jovan) who are absolutely incredible. I joined their slack channel and they are unbelievably helpful and responsive to my questions and bug finds. Opening github issues, PRs, and posting in Slack is a great way to get involved and learn about the tool.

Since picking our tool and building our platform, other awesome projects like polars have gained momentum. We are digging into those now to see where they may help us on our journey. But today, Vaex is still powering our platform.