• 1 Post
  • 120 Comments
Joined 1 year ago
cake
Cake day: June 8th, 2023

help-circle
  • My intuition:

    • There’re “genuine” instances of hapax legomena which probably have some semantic sense, e.g. a rare concept, a wordplay, an artistic invention, an ancient inside joke.
    • There’s various noise because somebody let their cat on the keyboard, because OCR software failed in one small spot, because somebody was copying data using a noisy channel without error correction, because somebody had a headache and couldn’t be bothered, because whatever.
    • Once a dataset is too big to be manually reviewed by experts, the amount of general noise is far far far larger than what you’re looking for. At the same time you can’t differentiate between the two using statistics alone. And if it was manually reviewed, the experts have probably published their findings, or at least told a few colleagues.
    • Transformers are VERY data-hungry. They need enormous datasets.

    So I don’t think this approach will help you a lot even for finding words and phrases. And everything I’ve said can be extended to semantic noise too, so your extended question also seems a hopeless endeavour when approached specifically with LLMs or big data analysis of text.







  • I wonder how much Beckett was inspired by this while writing Rough for Theatre II:

    B: [Hurriedly.] ‘… morbidly sensitive to the opinion of others at the time, I mean as often and for as long as they entered my awareness–’ What kind of Chinese is that? A: [Nervously.] Keep going, keep going! B: ‘… for as long as they entered my awareness, and that in either case, I mean whether such on the one hand as to give me pleasure or on the contrary on the other to cause me pain, and truth to tell–’ Shit! Where’s the verb? A: What verb? B: The main! A: I give up. B: Hold on till I find the verb and to hell with all this drivel in the middle. [Reading.] ‘… were I but … could I but …’ –Jesus!–‘… though it be … be it but…’–Christ!–ah! I have it–‘… I was unfortunately incapable …’ Done it! A: How does it run now? B: [Solemnly.] ‘… morbidly sensitive to the opinion of others at the time …’–drivel drivel drivel–‘… I was unfortunately incapable–’ [The lamp goes out. Long pause.]






  • Because we have tons of ground-level sensors, but not a lot in the upper layers of the atmosphere, I think?

    Why is this important? Weather processes are usually modelled as a set of differential equations, and you want to know the border conditions in order to solve them and obtain the state of the entire atmosphere. The atmosphere has two boundaries: the lower, which is the planet’s surface, and the upper, which is where the atmosphere ends. And since we don’t seem to have a lot of data from the upper layers, it reduces the quality of all predictions.