Reading and writing Avro files using Apache.Avro library (C#)

Apache Avro is a data serialization library. It is ported to many languages including C#. It was difficult to find examples of how to read and write objects from files. And even more difficult to optimize it for performance. Most of the examples are in Java, and though the code is pretty similar there are still differences.

All the code in this article can be found in: here

When using Avro reader and writer you can use them either with GenericRecord or SpecificRecord. GenericRecord can be used when the schema is not known at compile time. You can access the fields of this record either by their name or by the index (index is faster). SpecificRecord on the other hand is ideal when you know the schema at compile time. You can inherit from SpecificRecord and have explicit access to your fields.

Reading a file using a generic records

The code first open a DataFileReader. Notice that we use a fairly complicated constructor. This constructor allows us to provide a DatumReader. This class is responsible for reading a single object from the file. GenericDatumReader is faster than the default reader. We don't specify a reader schema (null), which means that we will read the data using the schema it was written with.
static void Main(string[] args)
{
    using (var stream = File.OpenRead(@"weather.avro"))
    {
        using (var reader = DataFileReader<GenericRecord>.OpenReader(stream, null, (ws, rs) => new GenericDatumReader<GenericRecord>(ws, rs)))
        {
            PrintHeader(reader);

            foreach (var entry in reader.NextEntries)
            {
                Print(entry);
            }
        }
    }
}

private static void PrintHeader(IFileReader<GenericRecord> reader)
{
    var schema = (RecordSchema)reader.GetSchema();

    foreach(var field in schema.Fields)
    {
        Console.Write(field.Name + ", ");
    }
    Console.WriteLine();

}

private static void Print(GenericRecord entry)
{
    for(int i = 0; i < entry.Schema.Count; i++)
    {
        Console.Write(entry.GetValue(i) + ", ");
    }
    Console.WriteLine();
}

Writing a file using GenericRecord

To write to a avro file you first need to define a json schema. The code defines a schema, creates a GenericRecord, fills it with data and finally adds it to the DataFileWriter. Same with the reader, we specify here the GenericDatumReader which is the faster DatumReader.
using(var stream = File.OpenWrite(@"users.avro"))
{
    var schemaJson = "{\"type\" : \"record\", " +
        "\"namespace\" : \"myNameSpace\", " +
        "\"name\" : \"User\", " +
        "\"fields\" : " +
        "[" +
        "{ \"name\" : \"Name\" , \"type\" : \"string\" }," +
        "{ \"name\" : \"ID\" , \"type\" : \"int\" }]}";

    var recordSchema = (RecordSchema)Schema.Parse(schemaJson);
    using (var writer = DataFileWriter<GenericRecord>.OpenWriter(new GenericDatumWriter<GenericRecord>(recordSchema), stream))
    {
        var record = new GenericRecord(recordSchema);
        record.Add(0, "user1");
        record.Add(1, 1234);
        writer.Append(record);
    }
}

Comments

Popular Posts