- 7 Years Big Data Engineer Experience
- An accomplished Big Data Engineer with over seven years of experience bringing a strong working knowledge of big data system architecture and ETL pipeline building techniques to bear on a variety of real - world business problems to yield lean, actionable results and insights for improvement.
- A highly organized and efficient individual whose leadership and thorough, precise approach to projects has yielded excellent results.
- Expert Python / Java / Scala developer specializing in developing and deploying big data solutions.
- Always on top of the current trends in relevant technologies, shifts in the big data climate, and improvements in existing methodologies.
- Strong leadership skills with specific experience in the Agile framework.
- Ability to take data engineering beyond proof of concept stage and into full productions and deployment.
- Extensive experience with third party cloud resources: AWS, Google Cloud, Azure
- Expertise in all common data engineering techniques: A/B testing, Data Fusion and integration, data mining, machine learning, natural language processing, statistics.
- Strong proficiency with Hadoop ecosystem, utilizing tools both on prem and on cloud platforms.
- Proficiency with a variety of python and java libraries such as: Boto3, numpy, pandas, matplotlib, pySpark, DeepLearning4j, JSAT, MLlib, JDMP.
- Experience in developing pipelines geared for scalability, performance, easy to maintain, and creating monitoring and alert systems.
- Expertise in batch and real time processing, creating end to end pipeline solutions for various ecosystems including AWS, Azure, and on Prem platforms.
- Python (8 years)
- Java (6 years)
- Scala (6 years)
- Data Mining
- ETL Pipelines
- Fault tolerant system building
- Cloud Development
- Big Data Analytics
- Communication & Leadership
Libraries: Kafka-python, pySpark, numpy, Pandas, DL4J, ND4J, JSAT, JAVA-ML, MLlib, RankLib, Retina, JDMP, Encog, pymysql, boto3
Big Data Tools: Kafka, Spark, Storm, Cassandra, Flink, Cloudera, HortonWorks, HPCC, Qubole, Statwing, CouchDB, Pentaho, Openrefine, Rapidminer, Data Cleaner, Hive, MapReduce MongoDB, Flume, Elasticsearch, Hadoop, Xplenty, AWS Glue, Alooma, Talend, Stitch, Infosphere, Airflow, Kuburnetes, Neo4J, SAMOA, Openrefine, Zookeeper, Avro, Apex, SQL, PIG, Sqoop
Big Data Methods: Batch and Real time data pipelines, Lambda and step function architecture, author schedule and monitor workflows with DAGs (apache airflow), Data transformation, HTTP / MQTT endpoints, map-reduce batch compute, stream computations, machine learning frameworks, low latency data store, deployment
Data Visualization: Tableau, Matplotlib, Seaborn, Altair, ggplot2, Plotly
NLP: NLTK, Gensim, AWS Transcribe, Comprehend, Glove, SpaCy, OpenNLP, AllenNLP
Version Control: GitHub, Git, SVN, Mercurial, AWS CodeCommit, Azure DevOps Repos
IDE: Jupyter Notebook, PyCharm, Visual Studio, Spyder, Eclipse, Atom, IntelliJ IDEA
Big Data Ecosystems: Hadoop, SnowFlake, Oracle ExaData, Vertica, Teradata, Pivotal Greenplum, SAP IQ
SQL RDBMS: Microsoft SQL, MySQL, Oracle DB, AWS RDS, T-SQL, PostgreSQL, IBM DB2, Amazon Aurora, Azure SQL, MariaDB, SQLite, Microsoft Access
NoSQL ONDMs: PyMongo, HappyBase, Boto3 (DynamoDB), EclipseLink, Hibernate
NOSQL Database: MongoDB, Cassandra, Redis, HBase, Neo4j, Oracle NoSQL, Amazon DynamoDB, Couchbase, CouchDB
SR BIG DATA ENGINEER
Confidential, Redmond, WA
- Utilized Azure Kubernetes Services (AKS) for data ingestion clusters management
- Worked with Azure Designer to design and upgrade existing data pipelines
- Automated key end to end dataflow transformations and load balancing
- Assisted in creation of multiple endpoint API’s for Cortana services
- Created new API triggers using Azure Functions providing simple solutions for complex orchestration challenges
- Transformed data sent to Azure SQL data warehouse for easy accessibility.
- Management of docker containers via Kubernetes to ensure coordination of node clusters at scale in production.
- Utilized Numpy, Pandas for exploratory data analysis
- Used libraries NLTK, Gensim, Glove for NLP preprocessing and embedding
- Utlized Apache Spark based Azure Databricks to ingest data with Azure Data Factory in batches and real time using Kafka.
- Optimized dashboards on Power BI to ensure stable workflow and updated visualizations.
- Lead a team of five to ensure proper work distribution and meeting project deadlines
- Participated in daily scrum stand up meetings, presented my teams accomplishments and future goals
- Utilized Ingress Controllers in Azure for route HTTP traffic to different applications
- Made use of multiple cognitive API’s including speech, language, Bing Search, QnA services.
- Optimization and redeployment of core and value add services surrounding Cortana on multiple platforms such as Windows, smartphones, Xbox console, Edge Browser, and VR headsets
- Managed code repository using Git to ensure code integrity is stable at all times and ready to deploy
BIG DATA ENGINEER
Confidential, Bloomington, IL
- Analyzed and processed complex data sets using advanced querying, visualization and analytics tools.
- Used AWS Kinesis for batch and real time streaming of data
- Utilized Amazon Elastic MapReduce (EMR) for fast parallel computations
- Created workflows with AWS Lambda and Step Functions for efficient pipeline flow
- Worked in AWS Glue to create fully managed ETL pipelines for integration with Athena, Redshift and EMR
- Worked with SQL and NOSQL databases such as RDS and MongoDB for different data pipelines including text and customer data.
- Built Data virtualization layer (DENODO Base and Derived views), Data visualization using Tableau and accessed aggregations using SQL Clients PostgreSQL & SQL-Workbench.
- Queried data from AWS RDS using Aurora Query Editor.
- Collaborated with data science team and e-commerce team to successfully deploy and integrate the models.
- Engineered an automated ETL pipeline for data ingestion and feature engineering using AWS Sagemaker.
- Manage code repository using Git to ensure integrity of code base is maintained at all times
- Used AWS tools such as Transcribe, Comprehend, Sagemaker, to update and improve framework of Phone Virtual Assistant.
- Ensured system architecture met business requirements, constantly worked with different teams to ensure every aspect of architecture is beneficial to the company
BIG DATA ENGINEER
- Worked in a Cloudera Hadoop environment, utilizing apache tech stack
- Utilized apache Kafka for data streaming sensor data from flight recorders
- Transformed, mapped data utilizing Spark and MapReduce for parallel computation
- Ran sensor data through several filters to eliminate noise for more accurate data modeling
- Relational data stored into hive tables, which were easily queried by data scientists
- Managed data flow with apache airflow, ensuring proper and efficient scheduling and task execution
- Assisted data scientist in creating dashboard utilizing tableau for dynamic maintenance scheduler
- Managed compute clusters using Kubernetes for efficient container orchestration
- Used Jenkins for continuous integration automation, to ensure new flight recorder data streams can easily integrate with newly developed pipeline builds.
- Worked in Agile Scrum environment, participating in daily scrum meetings and showcasing team contributions and accomplishments
- Built data ingestion workflows using apache NiFi, Schema Registry, and spark streaming
- Worked extensively with shell scripting to ensure proper execution with docker containers
- Created data management policies, procedures and set new standards to be used in future development
- Algorithm development on high performance systems, orchestrating workflows within a contained environment
- Worked with end user to ensure transformation of data to knowledge in very focused and meaningful ways
- Implemented and configured data pipelines was well as tuning processes for performance and scalability
- Used the R package dplyr for data manipulation and analyzing
- Maintained and contributed to many internal R packages used for building and diagnosing models, and automated reporting
- Used R to perform ad-hoc analyses and deeper drill downs into spend categories of particular interest to clients on a project-to-project basis
- Performed large data cleaning and preparation tasks using R and SQL to gather information from disparate and incompatible data sources from across a client’s entire enterprise to provide a complete view of all indirect spend
- Helped to Maintain a large database of commodity and vendor information using SQL
- Maintained various visualization tools and dashboards used to provide data-driven insights