According to Dmitry Kobak, some details in these figures are merely convergence artifacts, and no longer produced when using more recent versions of UMAP.

Beware that the first tweet uses t-SNE, which is an older algorithm that UMAP tries to improve. There's also an image with a newer version of UMAP further down and while the big squiggly line artifacts are reduced, a lot of the structure remains and it looks much less like the random numbers image from the blog or the t-SNE version. Still, I think it's safe to say that any fancy structure here is more likely a result of the algorithm and less of an actual structure in the numbers.

nerdponx 18 days ago [-]

I think the point of using t-SNE was to suggest that if the structure were legitimate that t-SNE would find at least some of it.

This is both an interesting/fun visualization exercise and a cautionary story. Apparently UMAP has a tendency to render blobs as rings or loops!

etangent 18 days ago [-]

Of course they are artifacts -- everything in that picture is an artifact, by definition of how it is produced. But these artifacts are produced under specific conditions, which presents a certain window into the structure of what is being visualized. t-SNE is a different, older, method compared to UMAP, and it over-emphasizes local structure at the expense of global structure.

Example of what could cause a swirly chain in UMAP: if A related to B and B relates to C but A does not relate to C and so on. IMO that's a valid structure to visualize as a swirly chain. If re-run multiple times, of course you will get that chain in different locations and so on. But it is interesting that it is there.

nerdponx 18 days ago [-]

I wonder if we can somehow figure out what those convergence artifacts actually were, and find a way to replicate them for cool digital art effects.

teo_zero 18 days ago [-]

You mean, like the Mandelbrot set?

2-718-281-828 18 days ago [-]

what are you trying to say?

teo_zero 18 days ago [-]

The structures that emerge from a series of transformations applied to an initial field (be it the natural numbers or the complex set), could be due to the transformations, or intrinsic of the underlying field. The parent comment stated that we're in the former case, i.e. we are seeing artifacts due to the transformation itself and we're not in front of some new properties of the numbers. It implied that this is a bad thing, or at least that's how I read that 'merely'. My (admittedly cryptic) reply was meant to show that similar results are worth attention as well, just like the Mandelbrot set, and should not be quickly dismissed as an unwanted effect.

2-718-281-828 18 days ago [-]

so, your point is that the mandelbrot set isn't merely an artifact, if i understand you correctly.

teo_zero 18 days ago [-]

Wait... I'm not sure where this conversation is going.

I say that the beauty (or value or worthiness) of the pictures of the Mandelbrot set comes from the transformations we apply to uninteresting complex numbers.

Similarly, the beauty of the pictures in the article may come from some hidden properties of the underlying prime numbers, or from the transformations themselves, and I don't think that either case would be better than the other.

I said this in reply to a comment that seemingly stated "what a pity, these images are 'merely' due to the transformations". I was objecting to the tone of disappointment that I read in that message.

expazl 18 days ago [-]

Convergence artifacts are due to the algorithm converging poorly. This is why it's a "pity" the interesting structure you see in the picture has nothing to do with the distribution of numbers and is just a bug in the algorithm that happens to look nice.

So no, it's not like the Mandelbrot set. It's more like if you wrote a script to visualize the Mandelbrot set and created a bug that made part of the visual you created look like there was an interesting structure by accident, then shared your pictures with a wide audience going "look at this interesting structure I found in the mandelbrot set!" and then someone replies on Twitter with "you have bug in line 124 and when I correct it the structure disappears". Which is why it's a pity.

krick 18 days ago [-]

I don't get it. So, do all these clusters mean anything at all? It feels like that should be telling us something eerily important about numbers. Granted, I don't really understand UMAP, but still, if it's any good for dimensionality reduction at all (and it appears to be), then clusters are clusters. This seems to have way too distinct structure to be essentially just a weird artifact of UMAP itself. Or is it?

tuukkah 18 days ago [-]

It tells about the factorizations if anything, as that's the input dimensions they are using. If you haven't seen factorization diagrams, it's worth checking them out first: https://mathlesstraveled.com/factorization/

But visualisations can always deceive you into seeing something that's not there, e.g. correlation vs causation.

krick 18 days ago [-]

Uh… what does it have to do with these "factorization diagrams"? Maybe I'm missing something, but I don't even see why are they "worth checking out". As far as I can see following your link, these are just arranging a number of dots into (pre-determined) shapes that are humanly recognizable. I.e., these are literally just some caveman technologies for writing a number before a more convenient (i.e. arabic) number system was invented. If anything, that combination of squares and triangles is less readable way to write 2²×3³×5, and it can be constructed only for "convenient" numbers (well, you can arrange into blocks 23×11×2 and 23×13×2 too, but good luck telling them apart). That's just silly, and tells me absolutely nothing about relationships between numbers.

However, all numbers from 1 to 1 000 000 forming a distinct clusters when being mapped to 2 dimensions with UMAP… I don't know. It might be nothing (like a representation of something trivial, like an observation, that multiples of 100003 are less common than multiples of 3 in the set of first 1 000 000 integers), and all these clusters may just disappear (converge) as we go closer to infinity. But there's definitely something a bit eerie about the possibility of it not being "nothing". Normally, you wouldn't expect any patterns to form like that.

Or, well, it may be more that a "nothing" but less than "interesting" for a mathematician — maybe there actually is some pattern that becomes more visible in this visualization, but it's already well-known among number theorists. I just have no idea, that's why I'm asking. It's just weird to see any clustering at all here.

(And, yeah, BTW, there's no such thing as "correlation vs causation" in number theory.)

nerdponx 18 days ago [-]

It's the same because both are visualizations of prime factorizations. And if you check out the Twitter thread posted in the comments here, you'll see that the "loops" are probably convergence artifacts (they should just be blobs) and that the clusters seem to correspond to the number of prime factors and the largest prime factor: https://twitter.com/hippopedoid/status/1318917905736716288, which makes sense because those are the input features to the algorithm.

eximius 18 days ago [-]

I wonder why it was chosen to represent everything, well, not unit vectors, but nothing higher than 1.

Why should 2 and 4 both be [1 0 ...] instead of [2 0 ...], etc?

jerpint 18 days ago [-]

It might have to do with UMAP making dot products and assumptions about the inputs. If everything is 0s and 1s, the vectors will have a normal distribution of magnitudes (more or less). Otherwise the magnitudes will just explode and I don’t think UMAP will work.

nerdponx 18 days ago [-]

It's also very common in general to use this "one-hot encoding" in statistics and machine learning.

In many cases using all 2s or all -10s would produce the exact same result in theory, but with more work by the optimizing algorithm, possibly with adverse results as described above.

It's easy to reason about mathematically, too. If the input vector is all 1s and 0s, it's easy to read off the result of multiplying that vector with another vector. Norms of binary vectors and dot products between binary vectors are super-easy, and have a nice correspondence with counting the appearances of elements.

It also corresponds to an array of Boolean values which is conceptually appealing, because that's basically how it's constructed.

With a couple of specific exceptions, there's little reason not to use all 1s.

vanderZwan 18 days ago [-]

That kind of sounds like normalisation in other contexts ("hey let's just say lightspeed is 1, simplifies the equations"), is that a fair analogy to make?

nerdponx 18 days ago [-]

Normalization of that kind is also used in machine learning and statistics, but it's not quite the same thing. A related technique is "standardization", where are you subtract the mean and divide by the standard deviation, yielding unitless quantities of "standard deviations away from the mean".

However this encoding technique is a bit more like choosing 1 and 0 to represent Boolean values in C: it's convenient, it's easy to reason about, it's mathematically simpler than any other option, and there's no compelling reason to choose anything else anyway.

For binary variables in particular, you sometimes see people using -1 and 1 instead of 0 and 1, to get symmetry around 0. I think this was mostly only used for encoding the labels/outputs of SVM models, where it's mathematically appealing as representing two sides of a hyperplane.

There are a handful of other schemes for encoding "categorical" or "nominal" data of this kind, used in certain statistical applications such as the design and analysis of experiments. These encoding schemes are called contrasts in the stats literature, because they emphasize the differences (the "constrasts") between categories.

tunnuz 19 days ago [-]

This is one of the most beautiful things I have seen in 2022 on the Internet, and there were some very good contenders. This provokes in me an immediate sense of beauty, and I'm compelled to read and understand as much of it as I can. Thanks for sharing this, I'm fascinated and amazed.

quakeguy 18 days ago [-]

Sadly merely artifacts .

glotchimo 18 days ago [-]

Can’t those artifacts be beautiful in and of themselves? Whether or not this represents some deep reality about dimensions, it’s cool to see something like this emerge from a program when working with a bunch of numbers.

expazl 18 days ago [-]

Sure, but that's no more interesting than a ML generated picture of a cat with a speech bubble saying "I can has numbers?"

jerpint 18 days ago [-]

What about the section that shows how dull random numbers are?

nerdponx 18 days ago [-]

I still think the t-SNE blobs are beautiful and interesting!

aamoscodes 18 days ago [-]

Artifacts are a natural state of our universe

adolph 19 days ago [-]

The gist has comments with some additional pretty graphs:

And map that high dimensional space back down to two dimensions (using some technique I haven't dug into yet). Colors are assigned by some scheme, later images help to illustrate how the particular clusterings happen like one where primes are rendered in white.

nerdponx 18 days ago [-]

The UMAP and tSNE algorithms both use fancy math to find a 2-dimensional representation of the data that tries to preserve local structure as much as possible. So blobs in the original space should be represented as blobs in the reduced space, but the relative positions of any two blobs in the reduced space is not meaningful.

This isn’t exactly it, but roughly: Make a vector space with primes as a basis (instead of “x, y, z,…” use “2, 3, 5, …”). For a number, find its prime factors. Make a vector and set elements corresponding to the prime factors of the number equal to 1. Apply some algorithm to map a high-dimensional (more than 2 elements) vector into a two dimensional image where color has some significance.

jerpint 18 days ago [-]

Numbers are placed in an image according to their prime numbers

2-718-281-828 18 days ago [-]

2 and 4 are mapped to the same vector?

_nalply 18 days ago [-]

Yes. I wondered about that, too.

nerdponx 18 days ago [-]

Yes, 4 and 2 have the same set of prime factors: 2.

LanceH 19 days ago [-]

The whole thing seemed elaborately overly-formally-described and that just lessened the value of the cool graph.

alanbernstein 19 days ago [-]

"I have a friend who’s an artist and has sometimes taken a view which I don’t agree with very well. He’ll hold up a flower and say “look how beautiful it is,” and I’ll agree. Then he says “I as an artist can see how beautiful this is but you as a scientist take this all apart and it becomes a dull thing,” and I think that he’s kind of nutty. First of all, the beauty that he sees is available to other people and to me too, I believe. Although I may not be quite as refined aesthetically as he is … I can appreciate the beauty of a flower. At the same time, I see much more about the flower than he sees. I could imagine the cells in there, the complicated actions inside, which also have a beauty."

It's beautiful. When I think about infinite amount of numbers and dimensions with this representation, the form of the universe appeared on my eyes.

jerpint 18 days ago [-]

I wonder what would happen if you do the same thing but using an autoencoder instead of UMAP

maCDzP 19 days ago [-]

It reminds me of the first illustration of a cell I saw in school. Beautiful.

tyronely 18 days ago [-]

If an artificial intelligence could open its pipes and vomit onto a newly-allocated two-dimensional array, I imagine its undigested bits would look very much like this.

nerdponx 18 days ago [-]

This should be easily doable! Grab any model, pull off the output layer, run data through it to get a dataset of vector embeddings, and then run UMAP on it.

SeriousM 18 days ago [-]

This is based on our common 10-based number system. I wonder how this would look in a 12-based system!

teo_zero 18 days ago [-]

I don't read anything suggesting that. Primes are primes in whichever base.

https://twitter.com/hippopedoid/status/1318917878364672001?l...

This is both an interesting/fun visualization exercise and a cautionary story. Apparently UMAP has a tendency to render blobs as rings or loops!

Example of what could cause a swirly chain in UMAP: if A related to B and B relates to C but A does not relate to C and so on. IMO that's a valid structure to visualize as a swirly chain. If re-run multiple times, of course you will get that chain in different locations and so on. But it is interesting that it is there.

I say that the beauty (or value or worthiness) of the pictures of the Mandelbrot set comes from the transformations we apply to uninteresting complex numbers.

Similarly, the beauty of the pictures in the article may come from some hidden properties of the underlying prime numbers, or from the transformations themselves, and I don't think that either case would be better than the other.

I said this in reply to a comment that seemingly stated "what a pity, these images are 'merely' due to the transformations". I was objecting to the tone of disappointment that I read in that message.

So no, it's not like the Mandelbrot set. It's more like if you wrote a script to visualize the Mandelbrot set and created a bug that made part of the visual you created look like there was an interesting structure by accident, then shared your pictures with a wide audience going "look at this interesting structure I found in the mandelbrot set!" and then someone replies on Twitter with "you have bug in line 124 and when I correct it the structure disappears". Which is why it's a pity.

But visualisations can always deceive you into seeing something that's not there, e.g. correlation vs causation.

However, all numbers from 1 to 1 000 000 forming a distinct clusters when being mapped to 2 dimensions with UMAP… I don't know. It might be nothing (like a representation of something trivial, like an observation, that multiples of 100003 are less common than multiples of 3 in the set of first 1 000 000 integers), and all these clusters may just disappear (converge) as we go closer to infinity. But there's definitely something a bit eerie about the possibility of it not being "nothing". Normally, you wouldn't expect any patterns to form like that.

Or, well, it may be more that a "nothing" but less than "interesting" for a mathematician — maybe there actually is some pattern that becomes more visible in this visualization, but it's already well-known among number theorists. I just have no idea, that's why I'm asking. It's just weird to see any clustering at all here.

(And, yeah, BTW, there's no such thing as "correlation vs causation" in number theory.)

Why should 2 and 4 both be [1 0 ...] instead of [2 0 ...], etc?

In many cases using all 2s or all -10s would produce the exact same result in theory, but with more work by the optimizing algorithm, possibly with adverse results as described above.

It's easy to reason about mathematically, too. If the input vector is all 1s and 0s, it's easy to read off the result of multiplying that vector with another vector. Norms of binary vectors and dot products between binary vectors are super-easy, and have a nice correspondence with counting the appearances of elements.

It also corresponds to an array of Boolean values which is conceptually appealing, because that's basically how it's constructed.

With a couple of specific exceptions, there's little reason

notto use all 1s.However this encoding technique is a bit more like choosing 1 and 0 to represent Boolean values in C: it's convenient, it's easy to reason about, it's mathematically simpler than any other option, and there's no compelling reason to choose anything else anyway.

For binary variables in particular, you sometimes see people using -1 and 1 instead of 0 and 1, to get symmetry around 0. I think this was mostly only used for encoding the labels/outputs of SVM models, where it's mathematically appealing as representing two sides of a hyperplane.

There are a handful of other schemes for encoding "categorical" or "nominal" data of this kind, used in certain statistical applications such as the design and analysis of experiments. These encoding schemes are called

contrastsin the stats literature, because they emphasize the differences (the "constrasts") between categories.https://gist.github.com/johnhw/dfc7b8b8519aac530ac97da226c17...

localstructure as much as possible. So blobs in the original space should be represented as blobs in the reduced space, but the relative positions of any two blobs in the reduced space is not meaningful.exactlyit, but roughly: Make a vector space with primes as a basis (instead of “x, y, z,…” use “2, 3, 5, …”). For a number, find its prime factors. Make a vector and set elements corresponding to the prime factors of the number equal to 1. Apply some algorithm to map a high-dimensional (more than 2 elements) vector into a two dimensional image where color has some significance.- Richard Feynman