Quantcast
Channel: Flamingo
Viewing all articles
Browse latest Browse all 98

SFO City Crime Analysis with OpenRefine

$
0
0

Introduction

This blog deals with the analysis of San Francisco Incidents Dataset and extraction of some meaningful insights using the OpenRefine tool. OpenRefine (formerly Google Refine) is a powerful tool which is meant for working with messy data, cleansing and transforming it from one format to other, like CSV, TSV, XML, HTML table, ODF spreadsheet, and Excel. Please refer to our SFO City Crime Analysis with R blog, where we have cleansed the San Francisco Incidents dataset using R programming.

In OpenRefine, the following operations can be performed without any complications :

  • Import data in various formats like TSV, CSV, *SV, Excel (.xls and .xlsx), JSON, XML, RDF as XML, and Google Data documents
  • Export datasets into other formats (CSV, TSV, XML, HTML table, ODF spreadsheet, Excel) in a matter of seconds
  • Apply basic and advanced cell transformations
  • Easily clustering the cell data
  • Fetch Cell data from external APIs or URLs
  • Filter and partition data easily using regular expressions
  • Perform advanced data operations with the General Refine Expression Language (GREL)

Use Case

Let’s reuse the same SFPD (San Francisco Police Department) Crime Incident Reporting system dataset, for the calendar year 2013  as discussed in the SFO City Crime Analysis with R blog. This SFPD crime incident dataset contains close to 130K records which are classified into the type of crime, date and time of the incident, day of the week, latitude and longitude of the incident.

In this use case we are going to perform cleansing operations using the OpenRefine tool. We will analyze this dataset and extract some meaningful insights with the help of OpenRefine tool.

What we need to do:

  • Prerequisites
  • Download the Crime Incident Dataset
  • Data Extraction & Exploration
  • Data Manipulation
  • Export cleansed datasets and Data Visualization
  • How to use Macros

Solution

Prerequisite

  • Install JDK 1.6+
  • Download OpenRefine tool server stable version
  • After the download process, extract the zip file. Run the OpenRefine Server in the command prompt
    CMD

Download The Crime Incident Dataset

  • To download and understand the SFPD Crime Incident Reporting system dataset for the calendar year 2013, follow the steps explained in the section given below, from our previous blog “SFO City Crime Analysis with R”.
    • Download Crime Incident Dataset (Ignore Install Packages sub topic).

Data Extraction & Exploration

  • Open ‘OpenRefine‘ (http://127.0.0.1:3333/) in the browser.
  • From the OpenRefine home page, select Create Project -> choose file “sfpd_incident_2013.csv” , from the downloaded location and click on Next. It will lead to the project preview page.
  • On the Project preview page, rename the project as “sfpd_incident_2013”, and if possible, configure the other parsing options as well. For this use case, no other task, apart from renaming the project needs to be done. Next, click on the “Create Project >>” button.
  • After clicking on the Create Project button, the csv file gets uploaded into the heap memory for a while, and then gets stored in the workspace location.
  • Once the loading process completes, the project will open up in the browser as shown below.
    create

Data Manipulation

Data Manipulation 1: Remove Duplicates – based on IncidntNum column

  • To sort the IncidentNum in ascending order:
    #Sort rows based on IncidntNum Column
    IncidntNum -> Sort...
    Select numbers and smallest first in the popup menu then click ok
    # Arrange sorted rows permanently
    Select Sort -> Reorder rows permanently

manip

  • To find and remove the duplicates rows based on IncidentNum:
    #Blank out duplicate value
    IncidntNum -> Edit Cells -> Blank down
    #Separating records with blank values and numeric values
    IncidntNum -> Facet -> Customized facets -> Facet by blank
    #after select with facet by blank, you can see the two options true and false
    click on true
    # Remove all selected rows
    All -> Edit rows -> Remove all matching rows

dup

  • After removing the blank records, we can see the unique values by clicking on false facet.facet

Data Manipulation 2: Create new incident_time column with the format (HH:mm:ss) based on existing the Time column

# Create incident_time column with format (HH:mm:ss) based on Time column
incident_time -> Edit column -> Add Column based on this column
Column Name: incident_time
GREL Expression: value+":00"

hhmm

Data Manipulation 3: Create the new incident_date column with the format (yyyy-MM-dd) based on existing Date column

# Create incident_date column with format (yyyy-MM-dd) based on Date column
Date -> Edit column -> Add Column based on this column
Column Name: incident_date
GREL Expression: toString(toDate(value),"yyyy-MM-dd")

yymm

Data Manipulation 4: Create new incident_date_time column by merging both incident_date and incident_time columns

# Create incident_date_time column by merging incident_date and incident_time
incident_date -> Edit column -> Add Column based on this column
Column Name: incident_date_time
GREL Expression: cells["incident_date"].value + " " + cells["incident_time"].value

MERG

Data Manipulation 5: Create Address column based on PdDistrict and location columns

#Create Column Address with TitleCase based on PdDistrict and location column
PdDistrict -> Edit column -> Add Column based on this column
Column Name: Address
GREL Expression: cells["Location"].value.toTitlecase() + " - " + cells["PdDistrict"].value.toTitlecase()

d5

Data Manipulation 6: Rename X and Y as longitude and latitude

# Rename X column name into longitude
X -> Edit column -> Rename this column
Column Name: longitude
# Rename Y column name into latitude
Y -> Edit column -> Rename this column
Column Name: latitude

d6

Data Manipulation 7: Grouping the incidents based on incident_time column and create a new column incident_time_tag

#Grouping incidents by Early Morning (00:00 – 05:59), Morning (06:00 – 11:59), Evening (12:00 – 17:59), Night (18:00 – 23:59)
incident_time -> Edit column -> Add Column based on this column
Column Name: incident_time_tag
GREL Expression: if((value.match(/(0[0-5]:[0-5][0-9]:[0-5][0-9])/)[0]!=null),"Early Morning",if((value.match(/((0[6-9]|1[01]):[0-5][0-9]:[0-5][0-9])/)[0]!=null),"Morning",if((value.match(/(1[2-7]:[0-5][0-9]:[0-5][0-9])/)[0]!=null),"Evening",if((value.match(/((1[8-9]|2[0-3]):[0-5][0-9]:[0-5][0-9])/)[0]!=null),"Night",value))))

d7

Data Manipulation 8: Grouping the Category column and create new column crime_category

#Create new column crime_category by grouping the incident’s category
Category -> Edit column -> Add Column based on this column
Column Name: crime_category
GREL Expression: if(contains("MISSING PERSON, KIDNAPPING",value),"KIDNAPPING",if(contains("SEX OFFENSES, FORCIBLE, PROSTITUTION, SEX OFFENSES, NON FORCIBLE, PORNOGRAPHY/OBSCENE MAT",value),"Sex",if(contains("DRIVING UNDER THE INFLUENCE, DRUG/NARCOTIC, DRUNKENNESS, LIQUOR LAWS",value), 'DRUGS',if(contains("SEX OFFENSES, FORCIBLE, PROSTITUTION, SEX OFFENSES, NON FORCIBLE, PORNOGRAPHY/OBSCENE MAT",value),"Sex",if(contains("DRIVING UNDER THE INFLUENCE, DRUG/NARCOTIC, DRUNKENNESS, LIQUOR LAWS",value), 'DRUGS',if(contains("FORGERY/COUNTERFEITING, FRAUD, BAD CHECKS",value),"FRAUD",if(contains("BURGLARY, ROBBERY, STOLEN PROPERTY, EXTORTION",value),"ROBBERY",if(contains("NON-CRIMINAL, SUICIDE",value),"NON-CRIMINAL",if(contains("BRIBERY, DISORDERLY CONDUCT, FAMILY OFFENSES, GAMBLING, LOITERING, RUNAWAY, OTHER OFFENSES, SUSPICIOUS OCC",value),"OTHER OFFENSES",if(contains("VANDALISM, ARSON",value),"ARSON",if(contains("LARCENY/THEFT, VEHICLE THEFT, RECOVERED VEHICLE, EMBEZZLEMENT, RECOVERED VEHICLE",value),"THEFT",value)))))))))))

d8

Data Manipulation 9: Remove unwanted columns

# Remove description, PdDistrict,location, Date and Time columns
All -> Edit columns -> Re-order / remove columns
Drag and place all description, PdDistrict, location, Date and Time into Drop columns here to remove

d9

Export cleansed datasets and Data Visualization

  • Export Data: OpenRefine provides a method to export the cleansed datasets in various file formats like TSV, CSV, HTML table etc.
  • To export, select Export -> comma-separated value (If we want to export into other formats, select the corresponding option from the drop down menu).

exp

  • Data Visualization: OpenRefine provides Scatterplot facet option to visualize data. This option is useful to plot the visuals for columns which have only numeric values. For this use case, we plot visuals for the latitude and longitude columns.
  • To plot the visual, Select IncidntNum Facet -> Scatterplot facet.
    Lang vs Lat

How to use Macros

  • In this blog, we have done the cleansing operations for only SFPD Crime Incident Report for the calendar year 2013.
  • We can perform similar multiple operations for several other years (e.g. SFPD Crime Report for 2011, 2012, 2014) by applying a single macro. We need not repeat all the steps.
  • After completing data manipulation, we can easily extract the macro for all above steps and apply for other datasets which are similar to this dataset.
  • We applied those macros on new datasets to manipulate data at once, which help us avoid repetition of steps.

To extract the macro:

  • Click on Undo/Redo Tab -> Extract… on the current project page.
  • Select macros by choosing the checkbox and copy the macros from the clipboard and put them into use.

3.6

To apply the macro:

  • Goto New Project which is similar to this dataset
  • Click on Undo/Redo Tab -> Apply
  • Paste the macro in the text area and click on Perform Operations.

newpj

Conclusion

OpenRefine tool is especially created for a Non-Programmer and End-Users for cleansing their datasets according to their requirements. OpenRefine is one of the popular tools that is capable of performing Statistical Analysis, Text analysis, Classification, Clustering of data and for exporting data into other formats.
There are lots of public datasets, ranging across domains such as City Management Ethics, Transportation, Health and social service that can be used for data manipulation with the OpenRefine tool.

References

OpenRefine:
http://openrefine.org/
http://enipedia.tudelft.nl/wiki/OpenRefine_Tutorial
http://casci.umd.edu/wp-content/uploads/2013/12/OpenRefine-tutorial-v1.5.pdf

Prepare SQL using OpenRefine:
http://googlerefine.blogspot.ca/2014/04/prepare-sql-query-using-openrefine.html

GREL Functions:
https://github.com/OpenRefine/OpenRefine/wiki/GREL-Functions

OpenRefine Documentation :
Developer: https://github.com/OpenRefine/OpenRefine/wiki/Documentation-For-Developers
Users: https://github.com/OpenRefine/OpenRefine/wiki/Documentation-For-Users

 


Viewing all articles
Browse latest Browse all 98

Latest Images

Trending Articles



Latest Images