vendredi 11 septembre 2020

Handling data quality issues on medium csv report. Best Practices

Need help with a better pratices question

I have an azure function that brings data form differents APIs and match them toguether to create a final csv report. I have a poblation of 60k-100k and 30 columns

For the sake of the explanation, I'm going to use a small School example.

public Student {
    string Grade {get; set;}
    Name   LegName {get; set;}
    string FatherName {get; set;}
    string TeacherId {get; set;}
    string SchoolId {get; set;}
}

public Name {
    string FirstName {get; set;}
    string LastName {get; set;}
}

Before constructing the report, I create two Dictionary with <Id, Name> from two APIs that expose Schools and Teachers information. And of course, a list of Student that comes from the Student APIs. I have no control of this trhee APIs, design, data quality, nothing.

Now, when I have all the data, I start to create the report.

string GenerateTXT(Dictionary<string, string> schools, Dictionary<string,  string> teachers, Student students){
    StringBuilder content = new StringBuilder();

    foreach(var student in students){
        content.Append($"{student.Grade}\t");
        content.Append($"{student.LegName.FirstName}\t");
        content.Append($"{student.LegName.LastName}\t");
        content.Append($"{schools.TryGetValue(student.TeacherId)}\t");
        content.Append($"{teachers.TryGetValue(student.SchoolId)}t";        
        content.Append($"{student.FatherNme}\t");
        content.AppendLine();
    }

    return content.ToString();    
}

Now here comes the problem. I started noticing data quality issues so the function started throwing exceptions. For example, students who do not have a valid school or teacher, or a student who does not have a name. I tried to solve expected scenarios and exception handling.

string GenerateTXT(Dictionary<string, string> schools, Dictionary<string,  string> teachers, Student students){
    StringBuilder content = new StringBuilder();
    var value = string.Empty;
    foreach(var student in students){
        try {
            content.Append($"{student.Grade}\t");
            content.Append($"{student.LegName.FirstName}\t");
            content.Append($"{student.LegName.LastName}\t");            
            if(teachers.TryGetValue(student.TeacherId))
                content.Append($"{teachers[student.TeacherId]}\t");
            else
                content.Append($"\t");
            if(schools.TryGetValue(student.SchoolId))
                content.Append($"{schools[student.SchoolId]}\t");
            else
                content.Append($"\t");            
            content.Append($"{student.FatherNme}\t");
            content.AppendLine();
        }
        catch(Exception ex) {
            log.Error($"Error reading worker {student.FirstName}");
        }
        
    }
    return content.ToString();
}

The problem with this is that when an unexpected error happens, I stop reading the next columns of data that maybe I have and instead jump to the next worker. Therefore, if a student for some random reason does not have a name, that row in the report will only have the grade, and nothing else, but I actually had the rest of the values. So here comes the question. I could put a try catch on each column, but remember that my real scenario has like 30 columns and could be more... so I think it's a really bad solution. Is there a pattern to solve this in a better way?

Thanks in advance!

Aucun commentaire:

Enregistrer un commentaire