Today data is available in various formats, like, CSV, JSON, TXT etc. Python provides all the necessary tools to parse these files. In this article we will learn how to parse a large CSV file using Python.
We will use Python
csv package and
pandas to parse the CSV file. I have taken a CSV file of baby names from here. This file contains 250K records of baby names. I am using this data to generate user database to test the login API I have built.
Reading with Python
The goal is read the name from the existing CSV file and create an
4 digit random number +
password, a ten digit random string. I will be using Python
random packages to achieve this. Let’s look at random string function.
The above method generates a random string of default 10 characters. Following code will iterate over the record in CSV and create above mentioned fields using the
Finally we will use the above two functions to generate the user data we need.
The above code will read the input CSV file and write back the
password to the output CSV file. The complete source code is available at Github.
pandas provides methods to read and manipulate CSV files in few lines of code. The following snippet will show how to load a CSV file and display the top five records:
The input CSV file contains four columns,
sex. We need only the
name field, so we drop other fields.
drop method will drop the fields from the dataframe,
axis=1 specifies we want to remove the column and
inplace=True will change the current data frame instead of creating a new one with the deleted columns.
head will display the top 5 records in the dataframe.
Now we add two new fields to dataframe,
password created in the same way as mentioned above:
The complete source code is available at Github