Today data is available in various formats, like, CSV, JSON, TXT etc. Python provides all the necessary tools to parse these files. In this article we will learn how to parse a large CSV file using Python.

We will use Python csv package and pandas to parse the CSV file. I have taken a CSV file of baby names from here. This file contains 250K records of baby names. I am using this data to generate user database to test the login API I have built.

Reading with Python csv package

The goal is read the name from the existing CSV file and create an email which is name + 4 digit random number + and password, a ten digit random string. I will be using Python string and random packages to achieve this. Let’s look at random string function.

The above method generates a random string of default 10 characters. Following code will iterate over the record in CSV and create above mentioned fields using the random_string function.

Finally we will use the above two functions to generate the user data we need.

The above code will read the input CSV file and write back the name, email and password to the output CSV file. The complete source code is available at Github.

Reading with pandas package

pandas provides methods to read and manipulate CSV files in few lines of code. The following snippet will show how to load a CSV file and display the top five records:

The input CSV file contains four columns, year, name, percent and sex. We need only the name field, so we drop other fields.

The drop method will drop the fields from the dataframe, axis=1 specifies we want to remove the column and inplace=True will change the current data frame instead of creating a new one with the deleted columns. head will display the top 5 records in the dataframe.

Now we add two new fields to dataframe, email and password created in the same way as mentioned above:

The complete source code is available at Github

