• Onno (VK6FLAB)@lemmy.radio
    link
    fedilink
    English
    arrow-up
    2
    ·
    2 months ago

    You store just the word count for each word on each URL.

    The search is pretty trivial in database terms since you don’t need to do any wildcard or like matching.

    • DaveA
      link
      fedilink
      English
      arrow-up
      1
      ·
      2 months ago

      Ah of course!

      I guess one of the things the Google originally solved was that the internet if full of crap and not all sites should have equal weighing. With AI spam sites these days, you’d probably also need a method of weighting results?

      • Onno (VK6FLAB)@lemmy.radio
        link
        fedilink
        English
        arrow-up
        2
        ·
        2 months ago

        We never got that far to test that kind of issue and while I’ve been reimplementing it locally to search through employment advertising, I’m not at a point where I’d be able to test such a thing.

        The original implementation used a data store written by another team member and it made the original project much too complicated.

        Today I’d likely use duckdb to implement it. My local version uses text files for a proof of concept implementation.

        • DaveA
          link
          fedilink
          English
          arrow-up
          1
          ·
          2 months ago

          It sounds like a really cool project regardless!