We provide IT Staff Augmentation Services!

Data Engineer Resume

New York City, NY

SUMMARY:

  • Over 7 years of experience in Spark and Hadoop Developer using Scala and Python cross platform technologies using Big data with Cloudera and Hortonworks platform.
  • In depth knowledge on Big Data Stack like Hadoop ecosystem Hadoop, Map Reduce, YARN, Sqoop, Flume, Kafka, Spark, Spark Data Frames, Spark SQL, Spark Streaming, etc.
  • Exploring with Spark using Scala improving the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark - SQL, Data Frame, pair RDD's, Spark YARN.
  • Good knowledge in importing and exporting data using Sqoop from HDFS to Relational Database Systems (RDBMS) and vice-versa.
  • Development experience in different IDE's like Eclipse, IntelliJ and STS.
  • Sound knowledge in data ingestion using Kafka and Flume.
  • Excellent understanding of relational databases. Created normalized databases, wrote stored procedures, used JDBC to communicate with database. Experienced with MySQL, and SQL Server.
  • Understanding of using S3 and Data storage buckets in AWS.
  • Good knowledge of AWS services like EC2, S3, Cloud Front, RDS, Dynamo DB, Elastic Search.
  • Working knowledge in PostgreSQL and NOSQL databases like Cassandra and HBase.
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.
  • Experience in Machine Learning, Data mining with large data sets of Structured and Unstructured data, Data Acquisition, Data Validation, Predictive modeling, Data Visualization, and programming languages like R and Python including Big Data technologies like Hadoop, Spark.
  • Proficient in managing entire data science project life cycle and actively involved in all the phases of project life cycle including data acquisition, data cleaning, data engineering, features scaling, features engineering, statistical modeling, dimensionality reduction using Principal Component Analysis, testing and validation using ROC plot, K- fold cross validation and data visualization.
  • Adept and deep understanding of Statistical modeling, Multivariate Analysis, model testing, problem analysis, model comparison and validation.
  • Experience in using various packages in R and python like scikit-learn ggplot2, caret, dplyr, plyr, pandas, numpy, seaborn, scipy, matplotlib, Beautiful Soup, Rpy2.

TECHNICAL SKILLS:

Hadoop Eco System: Hadoop, Spark, Impala, Hive, Oozie, Ambari, Sqoop, Map-Reduce, HDFS.

Machine Learning Algorithms: Regression, Classification, Azure Machine Learning, PySpark, SparkML lib.

Programming Languages: Python, Scala, R.

Reporting and Visualization: Tableau, Power BI

Databases and Query Languages: Cassandra, SQL and MySQL, Spark SQL, HiveQL.

Streaming Frameworks: Flume, Kafka, Spark Streaming.

Tools: R Studio, PyCharm, Jupyter Notebook, IntelliJ, Eclipse, NetBeans.

Platforms: Linux, Windows and OS X.

Methodologies: Agile and Waterfall Models.

WORK EXPERIENCE:

Confidential, New York City, NY

Data Engineer

Responsibilities:

  • Responsible for ingestion of data from various APIs and writing modules to store data in S3 buckets.
  • Transformation of batch and stream data to encrypt fields and store in data warehouse for ad-hoc query and analysis.
  • Validating data fields from downstream source to ensure uniformity of data.
  • Converting ingested data (csv, xml, Json) to parquet file format in compressed form.
  • Wrote Spark applications for data validation, cleansing, transformation, and custom aggregation.
  • Worked on Hadoop ecosystem in PySpark on Amazon EMR and Databricks.
  • Responsible for writing Unit Tests and deploy production level code through the help of Git version control.
  • Building Data Pipeline which involved ingesting of data from disparate data sources to a unified platform.
  • Constructed robust, high volume data pipelines and architecture to prepare data for analysis by client.
  • Designed a custom ETL and data warehouse solution to centrally store, associate and aggregate data from across multiple domains and analytics platforms.
  • Architected complete, scalable data warehouse and ETL pipelines to ingest and process millions of rows daily from 30+ data sources, allowing powerful insights and driving daily business decisions.
  • Worked on encrypting data and persisting data to S3 buckets.
  • Experience developing Airflow workflows for scheduling and orchestrating the ETL process.
  • Deployed all scheduling, processing and warehousing to a full AWS stack for maximum uptime and reliability.
  • Implemented optimization techniques for data retrieval, storage, and data transfer.

Environment: Microsoft Azure, Spark 1.6, H Base 1.2, Tableau 10, Power BI, Python 3.4, Scala, PySpark, HDFS, Flume 1.6, Cloudera Manager, MongoDB, SQL, GitHub, Linux, Spark SQL, Kafka, Sqoop 1.46, AWS (S3) .

Confidential, New York, NY

Big Data Developer/ Scientist

Responsibilities:

  • Created Data Lake by extracting customer’s data from various data sources into This includes data from Teradata, Mainframes, RDBMS, CSV and Excel.
  • Developed the code for Importing and exporting data into HDFS and Hive using Sqoop.
  • Responsible for importing log files from various sources into HDFS using Flume.
  • Load real time data into HDFS using Kafka and structured batch data using Sqoop.
  • Also involved in managing entire data science project life cycle and actively involved in all the phases of project life cycle including data acquisition, data cleaning, data engineering, features scaling, features engineering, statistical modeling, dimensionality reduction using Principal Component Analysis, testing and validation using ROC plot, K- fold cross validation and data visualization.
  • Involved in design and development of Data transformation framework components to support ETL process, which gets the Single Complete Actionable View of a customer.
  • Developed an ingestion module to ingest data into HDFS from heterogeneous data sources.
  • Built distributed in-memory applications using Spark and Spark SQL to do analytics efficiently on huge data sets.
  • Involved in converting Hive/SQL queries into Spark transformations using SparkRDD's and Scala.
  • These applications were built using Spark Scala API and Pyspark API.
  • Efficiently used spark transformation and actions to build simple/ quick and complex ETL applications.
  • Exploring with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
  • Design and develop state-of-the-art deep-learning / machine-learning algorithms for analyzing image and video data among others.
  • Developed Spark scripts by using Scala shell commands as per the requirement.
  • Also, involved in Model development and generating reports on Tableau, Power BI as per the requirement of the client.

Environment: Microsoft Azure, Spark 1.6, H Base 1.2, Tableau 10, Power BI, Python 3.4, Scala, PySpark, HDFS, Flume 1.6, Cloudera Manager, MongoDB, SQL, GitHub, Linux, Spark SQL, Kafka, Sqoop 1.46, AWS (S3) .

Confidential, Chattanooga, Tennessee

Big Data and Analytics Developer

Responsibilities:

  • Loaded the data into Spark RDD and do in memory data Computation to generate the Output response.
  • Developed Impala queries to pre-process the data required for running the business process.
  • Actively involved in design analysis, coding and strategy development.
  • Developed Hive scripts for implementing dynamic partitions and buckets for history data.
  • Developed Spark scripts by using Scala per the requirement to read/write JSON files.
  • Involve in converting SQL queries into Spark transformations using Spark RDDs and Scala.
  • Analyzed the SQL scripts and designed the solution to implement using Scala.
  • Worked on creating data ingestion pipelines to ingest huge amount of Stream and customer application data into Hadoop in various file formats like raw text files, CSV, ORC from applications.
  • Worked extensively on integrating Kafka (Data Ingestion) with Spark streaming to achieve high performance real time processing system.
  • Application of various machine learning algorithms like decision trees, regression models, neural networks, SVM, clustering to identify fraudulent profiles using scikit-learn package in P ython. Used clustering technique K-Means to identify outliers and to classify unlabeled data.
  • Used Spark API using Scala over Cloudera Hadoop YARN to perform analytics on data in Hive.
  • Create a Hadoop design which replicates the Current system design.
  • Developed Scala scripts using both Data frames/SQL and RDD/MapReduce in Spark for Data aggregation and queries.

Environment: R 3.0, Hadoop, MapReduce, HDFS, Hive, Sqoop, HBase, Flume, Spark, Spark-Streaming, MapReduce, Kafka, R -Studio, AWS, Tableau 8, MS Excel, Apache, Spark ML lib, TensorFlow, Amazon Machine Learning (AML)Python.

Confidential

Big Data Engineer

Responsibilities:

  • Extracted the data from Teradata into HDFS using Sqoop.
  • Created Data Quality Scripts using SQL and Hive to validate successful data load and quality of the data. Created several types of data visualizations using Python and Tableau.
  • Used Flume to collect, aggregate and store the web log data from different sources like web servers, mobile and network devices and pushed into HDFS.
  • Worked on Apache Hive to transform, sort, group, and process the HDFS data for enrichment.
  • Involved in creating Hive tables and loading and analyzing data using hive queries.
  • Create aggregates and Analyzed large data sets by running Hive queries, Spark, SparkSQL.
  • Used Spark to handle multiple joins within subject area tables and across subject areas.
  • Developed Scala scripts using both Data frames/SQL/Data sets and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
  • Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.

Environment: Hadoop, MapReduce, HDFS, Hive, Sqoop, HBase, Flume, Spark, Spark-Streaming, MapReduce, Kafka, R-Studio Python 2.7, Tableau 7, Oracle 11g.

Hire Now