Batch Streaming to Automatically Read in Data Counts

Working with large datasets is unevitable these days, and so being able to work with the data in batches can significantly lessen analysis times. This program uses PySpark to import data from a directory and run counts.

The Data

All files were provided by the Bellevue Univeristy DSC650 GitHub repository.

Technologies

Completed in Jupyter Notebook, the following packages are used

PySpark
Shutil
Time

Launch

All necessary code is included in the Jupyter notebook. The data files can be found in the data folder.

Share on

Twitter Facebook LinkedIn

Batch Streaming

Batch Streaming to Automatically Read in Data Counts

The Data

Technologies

Launch

Share on

You may also enjoy

Text Mining Visualizations

Air Quality

Twitter Sentiment

Taxi Fare Estimation