I'm currently writing a webcrawler for personal learning and to become familar with asynchronous projects. Currently I have a list of WebClients, that download the specified initial-urls and I use the event-based async pattern. That means, I get the downloaded data back and store it in an BlockingCollection. Than a seperate Thread pulls out the data and processes it. However, during processing, I need to follow links and create new urls on that. The problem im facing right now is that when creating these sub-urls and storing them back in the download-list, how do I keep track of the data of all the different websites that belong to one information-object?
The intermediate objects that carry information in the crawler:
public interface ICrawlingDownloadTask
{
int Id { get; set; }
Uri Uri { get; set; }
}
// User can modify this for a specific domain
public class CrawlingDownloadTask : ICrawlingDownloadTask
{
public int Id { get; set; }
public Uri Uri { get; set; }
}
public class CrawlingDownloadTaskResult
{
public ICrawlingDownloadTask CrawlingDownloadTask { get; set; }
public string Result { get; set; }
}
The Crawler class (as pseudo code):
public class WebCrawler
{
private BlockingCollection<ICrawlingDownloadTask> CrawlingDownloadTasks { get; set; }
private BlockingCollection<CrawlingDownloadTaskResult> CrawlingDownloadTaskResults { get; set; }
private List<WebClient> WebClientQueue { get; set; }
public void Start() =>
Task.WaitAll( Task.Factory.StartNew( this.StartProcess ) ,
Task.Factory.StartNew( this.StartDownload ) );
private void StartProcess()
{
while( !this.HasFinished )
{
var crawlingDownloadTaskResult = default( CrawlingDownloadTaskResult );
if( this.CrawlingDownloadTaskResults.TryTake( out crawlingDownloadTaskResult ) )
Task.Factory.StartNew( x => this.OnProcessCrawlingDownloadTaskResult( crawlingDownloadTaskResult ) );
}
}
private void StartDownload()
{
while( !this.HasFinished )
{
var webClient = this.WebClientQueue.FirstOrDefault( x => !x.IsBusy );
var crawlingDownloadTask = default( ICrawlingDownloadTask );
if( this.CrawlingDownloadTasks.TryTake( out crawlingDownloadTask ) && webClient != null )
webClient.DownloadStringAsync( crawlingDownloadTask.Uri , crawlingDownloadTask );
}
}
// set to WebClient during initialization
public void OnDownloadStringCompleted( object sender , DownloadStringCompletedEventArgs e )
{
if( !e.Cancelled && e.Error == null )
this.CrawlingDownloadTaskResults.TryAdd( new CrawlingDownloadTaskResult {
CrawlingDownloadTask = e.UserState as ICrawlingDownloadTask ,
Result = e.Result ,
} );
}
public void OnProcessCrawlingDownloadTaskResult( CrawlingDownloadTaskResult crawlingDownloadTaskResult )
{
if( IsRoot( crawlingDownloadTaskResult.CrawlingDownloadTask.Uri ) ) // root of object, create new sub-urls
{
this.CrawlingDownloadTasks.TryAdd( new CrawlingDownloadTask {
Uri = CreateUrl( crawlingDownloadTaskResult.CrawlingDownloadTask.Id , "img1" ) , // maybe generates links for all images
Id = crawlingDownloadTaskResult.CrawlingDownloadTask.Id ,
} );
}
}
}
How I call it with initial Urls:
public static void Main()
{
var crawler = new Crawler();
crawler.SetCrawlingDownloadTasks( Enumerable.Range( 0 , 10 )
.Select( idx => new CrawlingDownloadTask {
Id = idx ,
Uri = new Uri( $"http://ift.tt/2p5sz3B}" ) ,
} ) );
crawler.Start();
}
I do not want to store the data in a DB and look it up everytime I process a sub-url, I'd like to keep it in memory for as long as needed, but as short as possible.
Is there a common pattern to kind of add an dependency from one object (images of stackoverflow) to its parent (eg stackoverflow root site) one? And how do I notify the parent object on error/success?
Aucun commentaire:
Enregistrer un commentaire