mercredi 12 avril 2017

Asynchronous Webcrawler: keeping track of object across multiple websites

I'm currently writing a webcrawler for personal learning and to become familar with asynchronous projects. Currently I have a list of WebClients, that download the specified initial-urls and I use the event-based async pattern. That means, I get the downloaded data back and store it in an BlockingCollection. Than a seperate Thread pulls out the data and processes it. However, during processing, I need to follow links and create new urls on that. The problem im facing right now is that when creating these sub-urls and storing them back in the download-list, how do I keep track of the data of all the different websites that belong to one information-object?

The intermediate objects that carry information in the crawler:

public interface ICrawlingDownloadTask
{
    int Id { get; set; }
    Uri Uri { get; set; }
}

// User can modify this for a specific domain
public class CrawlingDownloadTask : ICrawlingDownloadTask
{
    public int Id { get; set; }
    public Uri Uri { get; set; }
}

public class CrawlingDownloadTaskResult
{
    public ICrawlingDownloadTask CrawlingDownloadTask { get; set; }
    public string Result { get; set; }
}

The Crawler class (as pseudo code):

public class WebCrawler
{
    private BlockingCollection<ICrawlingDownloadTask> CrawlingDownloadTasks { get; set; }
    private BlockingCollection<CrawlingDownloadTaskResult> CrawlingDownloadTaskResults { get; set; }
    private List<WebClient> WebClientQueue { get; set; }

    public void Start() =>
        Task.WaitAll( Task.Factory.StartNew( this.StartProcess  ) ,
                      Task.Factory.StartNew( this.StartDownload ) );

    private void StartProcess()
    {
        while( !this.HasFinished )
        {
            var crawlingDownloadTaskResult = default( CrawlingDownloadTaskResult );

            if( this.CrawlingDownloadTaskResults.TryTake( out crawlingDownloadTaskResult ) )
                Task.Factory.StartNew( x => this.OnProcessCrawlingDownloadTaskResult( crawlingDownloadTaskResult ) );
        }
    }

    private void StartDownload()
    {
        while( !this.HasFinished )
        {
            var webClient = this.WebClientQueue.FirstOrDefault( x => !x.IsBusy );
            var crawlingDownloadTask = default( ICrawlingDownloadTask );

            if( this.CrawlingDownloadTasks.TryTake( out crawlingDownloadTask ) && webClient != null )
                webClient.DownloadStringAsync( crawlingDownloadTask.Uri , crawlingDownloadTask );
        }
    }

    // set to WebClient during initialization
    public void OnDownloadStringCompleted( object sender , DownloadStringCompletedEventArgs e )
    {
        if( !e.Cancelled && e.Error == null )
            this.CrawlingDownloadTaskResults.TryAdd( new CrawlingDownloadTaskResult {
                CrawlingDownloadTask = e.UserState as ICrawlingDownloadTask ,
                Result = e.Result ,
            } );
    }

    public void OnProcessCrawlingDownloadTaskResult( CrawlingDownloadTaskResult crawlingDownloadTaskResult )
    {
        if( IsRoot( crawlingDownloadTaskResult.CrawlingDownloadTask.Uri ) ) // root of object, create new sub-urls
        {
            this.CrawlingDownloadTasks.TryAdd( new CrawlingDownloadTask {
                Uri = CreateUrl( crawlingDownloadTaskResult.CrawlingDownloadTask.Id , "img1" ) , // maybe generates links for all images 
                Id = crawlingDownloadTaskResult.CrawlingDownloadTask.Id ,
            } );
        }
    }
}

How I call it with initial Urls:

public static void Main()
{
    var crawler = new Crawler();

    crawler.SetCrawlingDownloadTasks( Enumerable.Range( 0 , 10 )
        .Select( idx => new CrawlingDownloadTask {
            Id = idx ,
            Uri = new Uri( $"http://ift.tt/2p5sz3B}" ) ,
        } ) );

    crawler.Start();
}

I do not want to store the data in a DB and look it up everytime I process a sub-url, I'd like to keep it in memory for as long as needed, but as short as possible.

Is there a common pattern to kind of add an dependency from one object (images of stackoverflow) to its parent (eg stackoverflow root site) one? And how do I notify the parent object on error/success?

Aucun commentaire:

Enregistrer un commentaire