Data Science and developers.
In today’s data science mania, here is a quick view of how data is moved from its raw stage to useful report. As developer you can find where you or your skill fit in this flow.
Image courtesy : Oreilly
1. Source / Scrape : Like wiki pages, todays data source could be anything. Splunk query, Server log, any device which produces data in any format (.csv, .log, .json, .xml, .xls etc..)
2. Data Cleansing : Pandas is a good option, but its not limited to that. Community Edition of Pentaho PDI can help in data cleansing. SSDT (former SSIS) has transformations which help in data cleansing. MySQL with RegEX will help in cleaning data.
2a. Database : Database is key. Any RDBMS such as Microsoft SQL Server, Postgres, MySQL, Oracle) will do, but my vote is for MySQL 8, as it has NOSQL with API capability.
3. Explore : Jupyter / Anaconda / IPython + Pandas + Matplotlib is a good combination. SPARK with Zeppelin will also work, but too much work setting up the cluster.
4. Deliver : REST API is one option, accessing via DB (using ORM tools such as SQL Alchemy) or direct SQL will be time saving option.