Run Code  | API  | Code Wall  | Users  | Misc  | Feedback  | Login  | Theme  | Privacy  | Patreon 

Full text search using Linqdb

 run  | edit  | history  | help 0

Building full text search engine without using any search specific libraries is tough. I've built stackse as a hobby project, the site searches Stack Overflow and allows special characters. The first version had a crawler (which is tough by itself) and complicated search library. The entire project was rather tangled and hard to maintain. Lessons learned were:

  • complexity is enemy, make things simple first and then complicate if necessary. Complex projects without any experience won't be done right the first time anyway.
  • complex projects must be made of small autonomous and testable components.
  • crawling is difficult to implement and is slow, parsing contents is slow, writing to disk is very slow
  • real time search of entire large data sets is slow too, most time is spent reading from disk and deserializing data. Therefore with limited hardware partial search is the solution.
  • surprisingly, cheap title search is most often the most useful one

In an unrelated project, being dissatisfied with sql server on one hand and amazed with Leveldb on the other, I've attempted to build small relational database based on Leveldb with the goal to have something efficient, yet relational, goals that are somewhat mutually exclusive. The result is Linqdb which is something in between two worlds.

At some point I've realized that Linqdb can be used to build full text search engine and it's quite simple.

So first of all I've downloaded data available here. Then a data model is needed, the final looks like this:

public class PostFragment
    public int Id { get; set; }
    public int QuestionId { get; set; }
    public string Text { get; set; }
public class AnswerFragment
    public int Id { get; set; }
    public byte[] Text { get; set; }

This is what we eventually need to get. PostFragment is small text fragment that we are actually going to search, so the search words are close to each other. We make PostFragment by scanning post text in a sliding window fashion, so for example for post of 300 words we get say 3 PostFragment: one with words 0-150, second 75-225 and third 150-300. AnswerFragment is something we return as a result when doing title search. The entire import code is here, after it we have FINAL_DATA db which for 2.4 mln posts takes 22 Gb of space and 10 hours to populate. We have 5 machines so in total 12 mln posts are indexed. So much for preparation.

Now the entire search algorithm looks like this:

public static List<ResultItem> SearchFragments(Db db, string query)

    var results = new List<ResultItem>();
    var interm = new List<IntermResult>();

    int max_step = db.Table<PostFragment>().LastStep();
    int step = 50;
    Stopwatch sp = new Stopwatch();
    for (int i = 0; i <= max_step; i += step)
        //partial search, check linqdb documentation for more info
        var res = db.Table<PostFragment>().Search(f => f.Text, query, false, i, step)
                    .Select(f => new { f.Id, f.Text, f.QuestionId })
                    .Select(f => new IntermResult
                                        Id = f.Id,
                                        Text = f.Text,
                                        QuestionId = f.QuestionId
        if (sp.ElapsedMilliseconds > 1000 || interm.GroupBy(f => f.QuestionId).Count() >= 10)
    foreach (var r in interm.GroupBy(f => f.QuestionId)
                            .OrderByDescending(f => f.First().Score) //not used
        var first_id = r.First().QuestionId;
        var item = new ResultItem()
            Fragment = r.First().Text,
            Id = r.First().QuestionId,
            Title = db.Table<WholePost>()
                      .Where(f => f.Id == first_id)
                      .Select(f => new { f.Title }).First().Title

    return results;

And title search:

public static List<ResultItem> SearchTitles(Db db, string query)
    var result_items = new List<ResultItem>();
    int count = 0;
    //WholePost is another class not mentioned above, which contains post title
    var res = db.Table<WholePost>().Search(f => f.Title, query)
                                   .OrderByDescending(f => f.Votes)
                                   .Select(f => new { f.Id, f.Title }, out count);
    foreach (var r in res)
        var afr = db.Table<AnswerFragment>()
                    .Where(f => f.Id == r.Id)
                    .Select(f => new { f.Text })
        var ri = new ResultItem()
            Title = r.Title,
            Fragment = afr != null ? Encoding.UTF8.GetString(afr.Text) : "",
            Id = r.Id


    return result_items.OrderByDescending(f => f.Score).Take(5).ToList();

And that's it! A lot could be done to improve search quality but the basics are laid and in a simple, efficient manner.

See it in action here. The code is available here. Linqdb project: here.

  by  renz, 5 years ago

Please log in to post a comment.