mardi 26 mai 2015

Generate CSV test data at random from template

I am going to have to generate many CSV files that will contain random-ish data. There will be rules about the fields, such as some will be integers, some should be names picked from a particular list, some will be text generated from a Markov chain with a given source, etc.

I would like to make this flexible, so the CSV specification/template could be changed without needing to make any coding changes. My first thought is to have the template itself also be a CSV which will map the field name to the rules for generating data, so a template might be something like the following:

device id,randint,unique
description,markov,tech.source
vendor,choice,vendors.txt

And from that I could generate a CSV with 3 fields, the first being an ID made of random ints with a "unique" constraint, the second being a narrative description from a Markov chain using "tech.source" as the chain definition, and the vendor being a random selection from those defined in another text file.

Since there could be 80 or so fields in a single generated CSV, and there might be many related CSV files that refer to elements in each other (e.g., there may be a list of people, and hardware may have an owner who must be in the list of people), I'm inclined to create an class that encapsulates each line of the CSV as generated.

That class will store a dict mapping field names to values, then those can be passed to a csv.DictWriter (this is likely to be implemented in Python) that was created with a list of the fields in the correct order. The classes will be able to implement comparison operators to help map the different types, and to track internal consistency about which software is on which host and who is responsible for it and such things.

The main visceral hangup is the mapping of field names to generation rules. Using a CSV and factory seems to the only option I can come up with, but there's just something nagging me that somehow all this can be done a bit more elegantly. Can anyone help me figure if there is a cleaner, easier to extend in the future, pattern or other structure that I should be considering?

Perhaps one thing that feels nagging about this is creating the CSV templates described above. Once I have a definition of how to implement "device id" it seems like it could become painful to have to copy and paste that or retype it every time I want to make another template. Perhaps, then, I should define the rules for generating everything in a common file, then each template just provides the field names.

Aucun commentaire:

Enregistrer un commentaire