> The theory of finite fields is based on the theory of prime numbers, because the finite fields are sets of residues modulo a prime number or modulo a power of a prime number.
It is note quite correct that the finite field of order p^k is the set of residues modulo p^k when k > 1. Instead this field is obtained as a splitting field of the field of order p (which is the set of residues mod p).
This idea of giving "meaning" to a set of axioms is precisely captured by the notion of "interpretation" in logic [1]. The rough idea is to map the symbols of the formal language to some pre-existing objects. As you say, this gives one way of formalizing truth: a sentence (string of symbols that respect the syntax of your language) is true if it holds for the objects the sentence is referring to (via a chosen interpretation). This notion of truth is sometimes referred to as semantic truth.
An alternative approach is purely syntactic and sees a logical system as collection of valid transformation rules that can be applied to the axioms. In this view, a sentence is true if it can be obtained from the axioms by applying a sequence of valid transformation rules. This purely syntactic notion of truth is known as “provability”.
Then the key question is to ask whether the two notions coincide: one way to state Godel's first incompleteness theorem is that it shows the two notions do not coincide.
> Then the key question is to ask whether the two notions coincide: one way to state Godel's first incompleteness theorem is that it shows the two notions do not coincide
It's even more subtle than that. They do coincide in a sense, which is proven by Gödel's completeness theorem (well, at least in First-Order Logic). That one just says that a sentence is provable from some axioms exactly iff it's true in every interpretation that satisfy the axioms.
So one thing that Gödel's first incompleteness theorem shows it's that it's impossible to uniquely characterise even a simple structure such as the natural numbers by some "reasonable"[0] axioms - precisely because there will always be sentences that are correct in some interpretations but not in others.
Unless you use second-order logic - in which case the whole enterprise breaks down for different reasons (because completeness doesn't hold for second order logic).
[0] reasonable basically means that it must be possible to verify whether a sentence is an axiom or not, otherwise you could just say that "every true sentence is an axiom"
Agreed, thanks for the clarifications. Another result worth mentioning, which also shows that you cannot hope to uniquely characterize a structure by "reasonable" axioms is the Löwenheim–Skolem theorem which predates Godel's incompleteness (although the history of these results is somewhat convoluted).
There, the obstacle is in some sense of a simplest nature: if your set of axioms admits a countable model, then it admits models of all infinite cardinalities. In other words, it shows that there is something fundamentally impossible in trying to capture an infinite structure (like numbers) by finite means (e.g. recursively axiomatizable).
> In other words, it shows that there is something fundamentally impossible in trying to capture an infinite structure (like numbers) by finite means (e.g. recursively axiomatizable).
Let me just point out for other readers that Löwenheim-Skolem applies to ANY first order theory (in a countable language, or also in an uncountable language if stated in the form that a theory with an infinite model with cardinality at least that of the language has infinite models in all cardinalities at least as big as that of the language), it doesn't care about how complex the axioms are from a computability point of view
You're correct, but I think the spirit of what GP wrote is still true.
You can't capture an infinite structure fully by "finitary" methods, either you use FOL and then you run into L-S, or you use higher-order logic (for which L-S doesn't apply) but then you don't have a complete proof system anymore.
To tie it all together, L-S and incompleteness are about different flavours of "not being able to capture something". L-S is about models of different cardinalities. These models do still all satisfy exactly the same sentences. Incompleteness is about different models actually satisfying different sentences.
The article is a bit oversimplifying in summarizing the axiomatic crisis as being problem with sentences like “this statement is false“.
This being said, your intuition is absolutely correct, the crux of the issue is with ‘this‘. What mathematicians realized is that if you are not careful with your choice of axioms, the resulting logical system becomes too “powerful” in the sense that it becomes self-referential: you can construct sentences that refer to themselves in a self-defeating manner.
As others have mentioned, this is the idea underlying Gödel's incompleteness theorem but also, to some extent, Russel's paradox that came before and is what the article is referring to. In Russel's paradox, the contradiction comes from constructing the set of all sets that contain themselves.
Maybe in analogy to Russell's paradox and how you can "fix" it by distinguishing sets and classes, you can "fix" a Gödel sentence by adding it as an axiom, but then you'll just get a new Gödel sentence... and so on.
Isn't this what is suggested below the second picture?
> Really, it’s harder than this picture suggests, because many experiences are based on other students. If I want you as my project partner but you want to forget I exist, then something has to give.
The realization that functions can be treated as elements in an abstract vector space (with infinitely many dimensions) is a turning point in the history of mathematics that led to the emergence of the sub-field known as functional analysis.
The significance of this paradigm shift is that it allowed mathematicians to apply some of the geometric intuition developed from the study of finite-dimensional spaces (such as the 3D Euclidean space) to difficult questions involving functions, such as the existence of solutions to certain differential equations.
The history of this change of perspective is absolutely fascinating and can be traced back to the end of the 19th century and beginning of the 20th century. At the time, work on axiomatic foundations of mathematics was driving a systematization of the study of mathematical objects by capturing their structure with a concise list of axioms. This is for example how the concept of an abstract vector space was born, encompassing not only Euclidean spaces but also infinite-dimensional spaces of functions.
An early reference already demonstrating this change of perspective, albeit in a primitive form, is a memoir by Vito Volterra from 1889 [1]. The PhD thesis of Maurice Fréchet from 1906 [2] is arguably the work that was most influential in crystalizing the new paradigm and presenting it in a modern form that served as a key reference for the first half of the 19th century. Of course, these are only two among a multitude of works around that time. Looking at later developments in the 19th century, it is hard not to also mention the book by Stefan Banach from 1932 [3].
My friend, you don't even need it to be in vector space for functional analysis. Truly what is needed is just an inner product. I will grant you the inner product must be linear and hence in a vector space.
I dont understand the point of this comment. You obviously need it to be a vector space before you can define an inner product. Inner product spaces are very special examples of vector spaces.
I agree. The GP comment contains some inaccuracies: most of the spaces of functions considered in functional analysis do not have an inner product defined on them, but are still vector spaces. The existence of an inner product presupposes a vector space structure, but the converse is not true…
Perhaps the most famous example is provided by the Lp spaces [1] consisting of functions whose pth power is absolutely integrable. For p≥1, these spaces are Banach spaces (complete normed spaces) but it is only when p=2 that the norm is associated with an inner product.
Not saying that the vector space bit isn't neat, but it's called functional analysis because you can take limits of various forms and define (semi-) continuity, have completions of spaces, and all that has nice properties.
So to me, a crucial thing is that these vector spaces are indeed topological.
Among many other things, SUB Göttingen maintains the Göttingen Center for Retrospective Digitization (GDZ) [1] which hosts a very impressive collection of digitized documents. It seems that the managerial changes that prompted the open letter could pose a serious threat to the project.
Regarding stochastic gradient descent, I think there has been an increased understanding in recent years, that the randomness introduced by the random sampling/batching is not only helpful in reducing the computational cost (compared to computing the full gradient) but also in adding noise to escape local minima. Some variants of stochastic gradient descent in fact add some additional random noise to amplify this latter effect and some theoretical guarantees have started to emerge.
I agree that "continuous" is a poor choice of word. Reading the article, it is clear that the author does not care about continuity in the mathematical sense of the term, but rather about allowing the design parameters to vary as a function of the user's environment parameters.
As mentioned in some other comment, a constant function is continuous but is obviously not what the article is about.
Yep, it was also known to eastern mystics. And there is also a parallel in Sefer Yetzira about "stones" (letters) building "houses" (permutations), which also refers to such generative grammars.
There are a few inaccuracies here: using automatic differentiation basically makes computing the gradient of the objective function as efficient as computing the function itself. But the main goal of algorithms being used in machine learning nowadays (stochastic gradient descent and variants thereof) is to avoid having to compute the objective function or its gradient altogether: instead, the gradient is computed at a single data point (example) which provides an approximation of the true gradient.
The important thing is that what is considered expensive is not to compute the objective function or its gradient, but to compute it over the entire dataset. Line-search would require evaluating the function (and its gradient) over the entire dataset several times, which completely defeats the purpose of stochastic gradient descent.
One could imagine doing a line search using the approximate function or its gradient (coming from the evaluation at a single data point) as the basis for a line-search, but intuitively it does not make much sense to fine-tune the step size to a single example, and this would in any case destroy the guarantees provided by the Strong Wolfe conditions.
Finally, there are convergence guarantees for stochastic gradient descent with fixed learning rate when the objective functions are convex.
1. Reverse mode in automatic differentiation is not as efficient at the function evaluation. Even discounting certain costs, and depending on how you count, the theoretic cost is 4-5 times a function evaluation. Practically speaking, operator overloading approaches run somewhere between 20-40 times function evaluation whereas source code transformation tools run at 10-20 times. This is fantastic, but the function evaluation is cheaper.
2. I also don't believe that stochastic gradient descent requires the entire function and gradient to be revaluated in the manner that you describe. One way to view stochastic gradient descent in the context of least squares fitting is through the use of Johnson-Lindenstrauss, which means that the data set can be randomly projected once per iteration. This means that the gradient and line-search parameters can be consistently evaluated at the per iterations level. Practically speaking, this means we randomly add our data together and then proceed as normal changing the randomization each iteration. As such, there should not be an increase in cost by doing a line-search over the already discounted cost.
3. As far as if the Wolfe conditions are destroyed, kind of sort of. In order to guarantee convergence, the amount of reduction that we use must also be reduced. Meaning, we can't project down the data quite as much if we really want to achieve convergence. However, practically speaking, I believe it to matter, a lot.
It is note quite correct that the finite field of order p^k is the set of residues modulo p^k when k > 1. Instead this field is obtained as a splitting field of the field of order p (which is the set of residues mod p).