• DaveA
    link
    fedilink
    English
    arrow-up
    2
    ·
    1 month ago

    That sucks. What was the novel search engine approach?

    • Onno (VK6FLAB)@lemmy.radio
      link
      fedilink
      English
      arrow-up
      4
      ·
      1 month ago

      Using the idea of six degrees of separation to get to any person on the planet, I came up with the idea to use a word cloud that would represent the top N words in all documents.

      When you click on a word, (say “alpha”) the resulting word cloud would represent the top N words for all the documents with “alpha” in it.

      As you click, bravo -> charlie, etc. the list of documents gets smaller and smaller, until just your required document remains.

      This has several advantages, you don’t need to distinguish between words and numbers or need to “understand” the meaning of a word or interpret the user intent.

      More importantly, the user doesn’t need to know the relevant words or vocabulary, since they’re all represented in the UI.

      Enhancements include allowing for negative words, as-in, exclude documents with this word.

      • DaveA
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 month ago

        Ah that sounds really interesting! Does it scale OK? I guess you could index at a word level and filter quite quickly for quick searches, but it seems you’re going to have to store the full text of every website?

        • Onno (VK6FLAB)@lemmy.radio
          link
          fedilink
          English
          arrow-up
          2
          ·
          1 month ago

          You store just the word count for each word on each URL.

          The search is pretty trivial in database terms since you don’t need to do any wildcard or like matching.

          • DaveA
            link
            fedilink
            English
            arrow-up
            1
            ·
            1 month ago

            Ah of course!

            I guess one of the things the Google originally solved was that the internet if full of crap and not all sites should have equal weighing. With AI spam sites these days, you’d probably also need a method of weighting results?

            • Onno (VK6FLAB)@lemmy.radio
              link
              fedilink
              English
              arrow-up
              2
              ·
              1 month ago

              We never got that far to test that kind of issue and while I’ve been reimplementing it locally to search through employment advertising, I’m not at a point where I’d be able to test such a thing.

              The original implementation used a data store written by another team member and it made the original project much too complicated.

              Today I’d likely use duckdb to implement it. My local version uses text files for a proof of concept implementation.

              • DaveA
                link
                fedilink
                English
                arrow-up
                1
                ·
                1 month ago

                It sounds like a really cool project regardless!