
Introduction
This blog deals with the analysis of San Francisco Incidents Dataset and extraction of some meaningful insights using the OpenRefine tool. OpenRefine (formerly Google Refine) is a powerful tool which is meant for working with messy data, cleansing and transforming it from one format to other, like CSV, TSV, XML, HTML table, ODF spreadsheet, and Excel. Please refer to our SFO City Crime Analysis with R blog, where we have cleansed the San Francisco Incidents dataset using R programming.
In OpenRefine, the following operations can be performed without any complications :
- Import data in various formats like TSV, CSV, *SV, Excel (.xls and .xlsx), JSON, XML, RDF as XML, and Google Data documents
- Export datasets into other formats (CSV, TSV, XML, HTML table, ODF spreadsheet, Excel) in a matter of seconds
- Apply basic and advanced cell transformations
- Easily clustering the cell data
- Fetch Cell data from external APIs or URLs
- Filter and partition data easily using regular expressions
- Perform advanced data operations with the General Refine Expression Language (GREL)
Use Case
Let’s reuse the same SFPD (San Francisco Police Department) Crime Incident Reporting system dataset, for the calendar year 2013 as discussed in the SFO City Crime Analysis with R blog. This SFPD crime incident dataset contains close to 130K records which are classified into the type of crime, date and time of the incident, day of the week, latitude and longitude of the incident.
In this use case we are going to perform cleansing operations using the OpenRefine tool. We will analyze this dataset and extract some meaningful insights with the help of OpenRefine tool.
What we need to do:
- Prerequisites
- Download the Crime Incident Dataset
- Data Extraction & Exploration
- Data Manipulation
- Export cleansed datasets and Data Visualization
- How to use Macros
Solution
Prerequisite
- Install JDK 1.6+
- Download OpenRefine tool server stable version
- Download OpenRefine tool: http://openrefine.org/download.html
- After the download process, extract the zip file. Run the OpenRefine Server in the command prompt
Download The Crime Incident Dataset
- To download and understand the SFPD Crime Incident Reporting system dataset for the calendar year 2013, follow the steps explained in the section given below, from our previous blog “SFO City Crime Analysis with R”.
- Download Crime Incident Dataset (Ignore Install Packages sub topic).
Data Extraction & Exploration
- Open ‘OpenRefine‘ (http://127.0.0.1:3333/) in the browser.
- From the OpenRefine home page, select Create Project -> choose file “sfpd_incident_2013.csv” , from the downloaded location and click on Next. It will lead to the project preview page.
- On the Project preview page, rename the project as “sfpd_incident_2013”, and if possible, configure the other parsing options as well. For this use case, no other task, apart from renaming the project needs to be done. Next, click on the “Create Project >>” button.
- After clicking on the Create Project button, the csv file gets uploaded into the heap memory for a while, and then gets stored in the workspace location.
- Once the loading process completes, the project will open up in the browser as shown below.
Data Manipulation
Data Manipulation 1: Remove Duplicates – based on IncidntNum column
- To sort the IncidentNum in ascending order:
#Sort rows based on IncidntNum Column IncidntNum -> Sort... Select numbers and smallest first in the popup menu then click ok # Arrange sorted rows permanently Select Sort -> Reorder rows permanently
- To find and remove the duplicates rows based on IncidentNum:
#Blank out duplicate value IncidntNum -> Edit Cells -> Blank down #Separating records with blank values and numeric values IncidntNum -> Facet -> Customized facets -> Facet by blank #after select with facet by blank, you can see the two options true and false click on true # Remove all selected rows All -> Edit rows -> Remove all matching rows
Data Manipulation 2: Create new incident_time column with the format (HH:mm:ss) based on existing the Time column
# Create incident_time column with format (HH:mm:ss) based on Time column incident_time -> Edit column -> Add Column based on this column Column Name: incident_time GREL Expression: value+":00"
Data Manipulation 3: Create the new incident_date column with the format (yyyy-MM-dd) based on existing Date column
# Create incident_date column with format (yyyy-MM-dd) based on Date column Date -> Edit column -> Add Column based on this column Column Name: incident_date GREL Expression: toString(toDate(value),"yyyy-MM-dd")
Data Manipulation 4: Create new incident_date_time column by merging both incident_date and incident_time columns
# Create incident_date_time column by merging incident_date and incident_time incident_date -> Edit column -> Add Column based on this column Column Name: incident_date_time GREL Expression: cells["incident_date"].value + " " + cells["incident_time"].value
Data Manipulation 5: Create Address column based on PdDistrict and location columns
#Create Column Address with TitleCase based on PdDistrict and location column PdDistrict -> Edit column -> Add Column based on this column Column Name: Address GREL Expression: cells["Location"].value.toTitlecase() + " - " + cells["PdDistrict"].value.toTitlecase()
Data Manipulation 6: Rename X and Y as longitude and latitude
# Rename X column name into longitude X -> Edit column -> Rename this column Column Name: longitude # Rename Y column name into latitude Y -> Edit column -> Rename this column Column Name: latitude
Data Manipulation 7: Grouping the incidents based on incident_time column and create a new column incident_time_tag
#Grouping incidents by Early Morning (00:00 – 05:59), Morning (06:00 – 11:59), Evening (12:00 – 17:59), Night (18:00 – 23:59) incident_time -> Edit column -> Add Column based on this column Column Name: incident_time_tag GREL Expression: if((value.match(/(0[0-5]:[0-5][0-9]:[0-5][0-9])/)[0]!=null),"Early Morning",if((value.match(/((0[6-9]|1[01]):[0-5][0-9]:[0-5][0-9])/)[0]!=null),"Morning",if((value.match(/(1[2-7]:[0-5][0-9]:[0-5][0-9])/)[0]!=null),"Evening",if((value.match(/((1[8-9]|2[0-3]):[0-5][0-9]:[0-5][0-9])/)[0]!=null),"Night",value))))
Data Manipulation 8: Grouping the Category column and create new column crime_category
#Create new column crime_category by grouping the incident’s category Category -> Edit column -> Add Column based on this column Column Name: crime_category GREL Expression: if(contains("MISSING PERSON, KIDNAPPING",value),"KIDNAPPING",if(contains("SEX OFFENSES, FORCIBLE, PROSTITUTION, SEX OFFENSES, NON FORCIBLE, PORNOGRAPHY/OBSCENE MAT",value),"Sex",if(contains("DRIVING UNDER THE INFLUENCE, DRUG/NARCOTIC, DRUNKENNESS, LIQUOR LAWS",value), 'DRUGS',if(contains("SEX OFFENSES, FORCIBLE, PROSTITUTION, SEX OFFENSES, NON FORCIBLE, PORNOGRAPHY/OBSCENE MAT",value),"Sex",if(contains("DRIVING UNDER THE INFLUENCE, DRUG/NARCOTIC, DRUNKENNESS, LIQUOR LAWS",value), 'DRUGS',if(contains("FORGERY/COUNTERFEITING, FRAUD, BAD CHECKS",value),"FRAUD",if(contains("BURGLARY, ROBBERY, STOLEN PROPERTY, EXTORTION",value),"ROBBERY",if(contains("NON-CRIMINAL, SUICIDE",value),"NON-CRIMINAL",if(contains("BRIBERY, DISORDERLY CONDUCT, FAMILY OFFENSES, GAMBLING, LOITERING, RUNAWAY, OTHER OFFENSES, SUSPICIOUS OCC",value),"OTHER OFFENSES",if(contains("VANDALISM, ARSON",value),"ARSON",if(contains("LARCENY/THEFT, VEHICLE THEFT, RECOVERED VEHICLE, EMBEZZLEMENT, RECOVERED VEHICLE",value),"THEFT",value)))))))))))
Data Manipulation 9: Remove unwanted columns
# Remove description, PdDistrict,location, Date and Time columns All -> Edit columns -> Re-order / remove columns Drag and place all description, PdDistrict, location, Date and Time into Drop columns here to remove
Export cleansed datasets and Data Visualization
- Export Data: OpenRefine provides a method to export the cleansed datasets in various file formats like TSV, CSV, HTML table etc.
- To export, select Export -> comma-separated value (If we want to export into other formats, select the corresponding option from the drop down menu).
- Data Visualization: OpenRefine provides Scatterplot facet option to visualize data. This option is useful to plot the visuals for columns which have only numeric values. For this use case, we plot visuals for the latitude and longitude columns.
- To plot the visual, Select IncidntNum Facet -> Scatterplot facet.
How to use Macros
- In this blog, we have done the cleansing operations for only SFPD Crime Incident Report for the calendar year 2013.
- We can perform similar multiple operations for several other years (e.g. SFPD Crime Report for 2011, 2012, 2014) by applying a single macro. We need not repeat all the steps.
- After completing data manipulation, we can easily extract the macro for all above steps and apply for other datasets which are similar to this dataset.
- We applied those macros on new datasets to manipulate data at once, which help us avoid repetition of steps.
To extract the macro:
- Click on Undo/Redo Tab -> Extract… on the current project page.
- Select macros by choosing the checkbox and copy the macros from the clipboard and put them into use.
To apply the macro:
- Goto New Project which is similar to this dataset
- Click on Undo/Redo Tab -> Apply…
- Paste the macro in the text area and click on Perform Operations.
Conclusion
OpenRefine tool is especially created for a Non-Programmer and End-Users for cleansing their datasets according to their requirements. OpenRefine is one of the popular tools that is capable of performing Statistical Analysis, Text analysis, Classification, Clustering of data and for exporting data into other formats.
There are lots of public datasets, ranging across domains such as City Management Ethics, Transportation, Health and social service that can be used for data manipulation with the OpenRefine tool.
References
OpenRefine:
http://openrefine.org/
http://enipedia.tudelft.nl/wiki/OpenRefine_Tutorial
http://casci.umd.edu/wp-content/uploads/2013/12/OpenRefine-tutorial-v1.5.pdf
Prepare SQL using OpenRefine:
http://googlerefine.blogspot.ca/2014/04/prepare-sql-query-using-openrefine.html
GREL Functions:
https://github.com/OpenRefine/OpenRefine/wiki/GREL-Functions
OpenRefine Documentation :
Developer: https://github.com/OpenRefine/OpenRefine/wiki/Documentation-For-Developers
Users: https://github.com/OpenRefine/OpenRefine/wiki/Documentation-For-Users