Run Code  | Code Wall  | Users  | Misc  | Feedback  | About  | Login  | Theme 

Full text search using Linqdb

      
            
                
            
 run  | edit  | history  | help 0

Building full text search engine without using any search specific libraries is tough. I've built stackse as a hobby project, the site searches Stack Overflow and allows special characters. The first version had a crawler (which is tough by itself) and complicated search library. The entire project was rather tangled and hard to maintain. Lessons learned were:

  • complexity is enemy, make things simple first and then complicate if necessary. Complex projects without any experience won't be done right the first time anyway.
  • complex projects must be made of small autonomous and testable components.
  • crawling is difficult to implement and is slow, parsing contents is slow, writing to disk is very slow
  • real time search of entire large data sets is slow too, most time is spent reading from disk and deserializing data. Therefore with limited hardware partial search is the solution.
  • surprisingly, cheap title search is most often the most useful one

In an unrelated project, being dissatisfied with sql server on one hand and amazed with Leveldb on the other, I've attempted to build small relational database based on Leveldb with the goal to have something efficient, yet relational, goals that are somewhat mutually exclusive. The result is Linqdb which is something in between two worlds.

At some point I've realized that Linqdb can be used to build full text search engine and it's quite simple.

So first of all I've downloaded data available here. Linqdb supports simple full text search on string properties. The search is very simple - all words in a query must be present in text. Also Linqdb supports partial search, i.e. we can search only first 1000 documents or only second 1000 documents and so on. Besides doing main post search we are also going to do title search and then combine results.

We could construct one text property where entire post would be, however such search would return results where words are far away from each other. Therefore smaller text fragments need to be constructed. The final data model looks like this:

public class PostFragment
{
    public int Id { get; set; }
    public int QuestionId { get; set; }
    public string Text { get; set; }
}
public class AnswerFragment
{
    public int Id { get; set; }
    public byte[] Text { get; set; }
}

This is what we eventually need to get. PostFragment is small text fragment that we are actually going to search, so the search words are close to each other. We make PostFragment by scanning post text in a sliding window fashion, so for example for post of 300 words we get say 3 PostFragment: one with words 0-150, second 75-225 and third 150-300. AnswerFragment is something we return as a result when doing title search. The entire import code is here. Importing 2.4 million posts results in 3 dbs (totaling 70Gb) and takes around 10 hours. Only the FINAL_DATA (22 Gb) is used for search. We have 5 machines so in total 12 million posts are indexed. So much for preparation.

Now the entire search algorithm looks like this:

public static List<ResultItem> SearchFragments(Db db, string query)
{

    var results = new List<ResultItem>();
    var interm = new List<IntermResult>();

    int max_step = db.Table<PostFragment>().LastStep();
    int step = 50;
    Stopwatch sp = new Stopwatch();
    sp.Start();
    for (int i = 0; i <= max_step; i += step)
    {
        //partial search, check linqdb documentation for more info
        var res = db.Table<PostFragment>().Search(f => f.Text, query, false, i, step)
                    .Select(f => new { f.Id, f.Text, f.QuestionId })
                    .Select(f => new IntermResult
                                    {
                                        Id = f.Id,
                                        Text = f.Text,
                                        QuestionId = f.QuestionId
                                    });
        interm.AddRange(res);
        if (sp.ElapsedMilliseconds > 1000 || interm.GroupBy(f => f.QuestionId).Count() >= 10)
        {
            break;
        }
    }
    foreach (var r in interm.GroupBy(f => f.QuestionId)
                            .OrderByDescending(f => f.First().Score) //not used
                            .Take(5))
    {
        var first_id = r.First().QuestionId;
        var item = new ResultItem()
        {
            Fragment = r.First().Text,
            Id = r.First().QuestionId,
            Title = db.Table<WholePost>()
                      .Where(f => f.Id == first_id)
                      .Select(f => new { f.Title }).First().Title
        };
        results.Add(item);
    }

    return results;
}

And title search:

public static List<ResultItem> SearchTitles(Db db, string query)
{
    var result_items = new List<ResultItem>();
    int count = 0;
    //WholePost is another class not mentioned above, which contains post title
    var res = db.Table<WholePost>().Search(f => f.Title, query)
                                   .OrderByDescending(f => f.Votes)
                                   .Take(10)
                                   .Select(f => new { f.Id, f.Title }, out count);
    foreach (var r in res)
    {
        var afr = db.Table<AnswerFragment>()
                    .Where(f => f.Id == r.Id)
                    .Select(f => new { f.Text })
                    .FirstOrDefault();
        var ri = new ResultItem()
        {
            Title = r.Title,
            Fragment = afr != null ? Encoding.UTF8.GetString(afr.Text) : "",
            Id = r.Id
        };

        result_items.Add(ri);

    }
    return result_items.OrderByDescending(f => f.Score).Take(5).ToList();
}

And that's it! There are 5 machines, each searching 2.4 million posts in this way. The master site gets results, aggregates and presents to the user. A lot could be done to improve search quality but the basics are laid and in a simple, efficient manner.

See it in action here. The code is available here. Linqdb project: here.

  by  renz, 1 months ago




Please log in to post a comment.