We provide IT Staff Augmentation Services!

Data Engineer Resume

4.00/5 (Submit Your Rating)

Boise, IdahO

SUMMARY

  • Over 7+ years of Professional experience as a Big Data Developer with expertise in Python, Spark, Hadoop Ecosystem, and cloud services etc.
  • Development, Implementation, Deployment and Maintenance using Bigdata technologies in designing and implementing complete end - to-end Hadoop based data analytical solutions using HDFS, MapReduce, Spark, Scala, Yarn, Kafka, PIG, HIVE, Sqoop, Flume, Oozie, Impala, HBase, Nifi, Ambari.
  • Extensive experience in developing applications that performDataProcessing tasks using Teradata, Oracle, SQL Server, and MySQL databases.
  • Experience in Extraction, Transformation and Loading (ETL) data from various sources into Data Warehouses, as well as data processing like collecting, aggregating, and moving data from various sources using Apache Flume, Kafka and PowerBI.
  • Acquired profound knowledge in developing production ready Spark applications utilizing Spark Core, Spark Streaming, Spark SQL, Data Frames, Datasets and Spark-ML.
  • Profound experience in creating real time data streaming solutions using PySpark/Spark Streaming, Kafka.
  • Worked on NoSQL databases including HBase, Cassandra and Mongo DB.
  • Strong Hadoop and platform support experience with all the entire suite of tools and services in major Hadoop Distributions - Cloudera, Amazon EMR, Azure HDInsight, and Hortonworks.
  • In-depth Knowledge of Hadoop Architecture and its components such as HDFS, Yarn, Resource Manager, Node Manager, Job History Server, Job Tracker, Task Tracker, Name Node, Data Node, and MapReduce.
  • Experience developing iterative algorithms using Spark Streaming in Scala and Python to builds near real-time dashboards.
  • Experienced in loading data to Hive partitions and created buckets in Hive and developed MapReduce jobs to automate transfer the data from HBase.
  • Expertise in working with AWS cloud services like EMR, S3, Redshift, EMR, Lambda, DynamoDB, RDS, SNS, SQS, Glue, Data Pipeline, Athena for big data development.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, Spark SQL Azure Data Lake Analytics.
  • Worked with various file formats such as CSV, JSON, XML, ORC, Avro, and Parquet file formats.
  • Worked on data processing and transformations and actions in spark by using Python (Spark) language.
  • Experienced Orchestrating, scheduling, and monitoring job tools like Oozie, and Airflow.
  • Extensive experience utilizing Sqoop to ingest information from RDBMS - Oracle, MS SQL Server, Teradata, PostgreSQL, and MySQL.
  • Worked with different ingestion services with Batch and Real-time data handling utilizing Spark streaming, Kafka, Storm, Flume and Sqoop.
  • Expertise in writing DDLs and DMLs scripts in SQL and HQL for analytics applications in RDBMS.
  • Expertise in python scripting and Shell scripting.
  • Acquired experience in Spark scripts in Python, Scala, and SQL for advancement in development and examination through analysis.
  • Proficient in building PySpark and Scala applications for interactive analysis, batch processing, and stream processing.
  • Involved in all the phases of Software Development Life Cycle (Requirements Analysis, Design, Development, Testing, Deployment, and Support) and Agile methodologies.
  • Strong understanding ofDataModeling and ETL process indatawarehouse environment such as star schema, snowflake schema.
  • Developed mappings in Informatica to load the data including facts and dimensions from various sources into the Data Warehouse using different transformations like Source Qualifier, JAVA, Expression, Lookup, Aggregate, Update Strategy and Joiner.
  • Strong working knowledge across the technology stack including ETL, data analysis, data cleansing, data matching, data quality, audit, and design.
  • Experienced working on Continuous Integration & build tools such as Jenkins and GIT, SVN for version control.
  • Utilized Kubernetes and Docker for the runtime environment for the CI/CD system to build, test, and deploy.

TECHNICAL SKILLS

Big Data Eco System: HDFS, Spark, MapReduce, Hive, Pig, Sqoop, Flume, HBase, Kafka Connect, Impala, Stream sets, Oozie, Airflow, Zookeeper, Amazon Web Services.

Hadoop Distributions: Apache Hadoop 1x/2x, Cloudera CDP, Hortonworks HDP

Languages: Python, Scala, Java, Pig Latin, HiveQL, Shell Scripting.

Software Methodologies: Agile, SDLC Waterfall.

Databases: MySQL, Oracle, DB2, PostgreSQL, DynamoDB, MS SQL SERVER, Snowflake.

NoSQL: HBase, MongoDB, Cassandra.

ETL/BI: Power BI, Tableau, Informatica.

Version control: GIT, SVN, Bitbucket.

Operating Systems: Windows (XP/7/8/10), Linux (Unix, Ubuntu), Mac OS.

Cloud Technologies: Amazon Web Services, EC2, S3, SQS, SNS, Lambda, EMR, Code Build, CloudWatch. Azure HDInsight (Databricks, Data Lake, Blob Storage, Data Factory, SQL DB, SQL DWH, Cosmos DB, Azure DevOps, Active Directory).

PROFESSIONAL EXPERIENCE

Confidential

Data Engineer

Responsibilities:

  • Migrated terabytes of data from the data warehouse into the cloud environment in an incremental format.
  • Worked on creating data pipelines with Airflow to schedule PySpark jobs for performing incremental loads and used Flume for weblog server data. Created Airflow Scheduling scripts in Python.
  • DevelopedSparkjobs onDatabricksto perform tasks like data cleansing, data validation, standardization, and then applied transformations as per the use cases.
  • Created and provisioned different Databricks clusters needed for batch and continuous streaming data processing and installed the required libraries for the clusters.
  • Created several Databricks Spark jobs with PySpark to perform several tables to table operations.
  • Developed High Speed BI layer on Hadoop platform with PySpark & Python.
  • Performed data cleansing and applied transformations usingDatabricksandSparkdata analysis.
  • Developed spark applications in PySpark and Scala to perform cleansing, transformation, and enrichment of the data.
  • Utilized Spark-Scala API to implement batch processing of jobs.
  • Developed Spark-Streaming applications to consume the data from Snowflake and to insert the processed streams to DynamoDB.
  • Utilized Spark in Memory capabilities, to handle large datasets.
  • Used Broadcast variables in Spark, effective & efficient Joins, transformations, and other capabilities for data processing.
  • Developed custom aggregate functions using Spark SQL and performed interactive querying.
  • Fine-tuning spark applications/jobs to improve the efficiency and overall processing time for the pipelines.
  • Creating Hive tables, loading and analyzing data using hive scripts.
  • Created Partitioned and Bucketed Hive tables in Parquet File Formats with Snappy compression and then loaded data into Parquet hive tables from Avro hive tables.
  • Worked with various formats of files like delimited text files, click stream log files, Apache log files, Avro files, JSON files, XML Files. Mastered in using different columnar file formats like ORC and Parquet formats.
  • Ingested data in mini-batches and performs RDD transformations on those mini-batches of data by using Spark Streaming to perform streaming analytics in Databricks.
  • Designed custom-built input adapters using Spark, Hive, and Sqoop to ingest and analyze data (Snowflake, MS SQL, MongoDB) into HDFS.
  • Developed automatic job flows and ran through Oozie daily and when needed which runs MapReduce jobs internally.
  • Extracted Tables and exported data from Teradata through Sqoop and placed in Cassandra.
  • Created Spark JDBC APIs for importing/exporting data from Snowflake to S3 and vice versa.
  • Experienced in working with EMR cluster and S3 in AWS cloud.
  • Automated the data pipeline to ETL all the Datasets along with full loads and incremental loads of data.
  • Utilized AWS services like EMR, S3, Glue Meta store and Athena extensively for building the data applications.
  • Involved in creating Hive external tables to perform ETL on data that is produced on daily basis.
  • Developed various Mappings with the collection of all Sources, Targets, and Transformations using Informatica Designer.
  • Validated the data being ingested into Hive for further filtering and cleansing.
  • UsedJenkinspipelines to drive all micro-services builds out to the Docker registry and then deployed to Kubernetes, Created Pods and managed using Kubernetes.
  • Performed all necessary day-to-day GIT support for different projects, Responsible for design and maintenance of the GIT Repositories, and the access control strategies.

Environment: Python, SQL, Oracle, Hive, Scala, Power BI, Docker, Mongo DB, Kubernetes, SQS, PySpark, Kafka, Data Warehouse, Big Data, MS SQL, Hadoop, Airflow, Spark, EMR, EC2, S3, Git, Lambda, Glue, ETL, Databricks, Snowflake, AWSDataPipeline.

Confidential -Boise, Idaho

Data Engineer

Responsibilities:

  • Hands on experience in installing, configuring, and using Hadoop ecosystem components like Hadoop MapReduce, HDFS, HBase, Hive, Spark, Sqoop, Pig, Zookeeper and Flume.
  • Data warehouse, Business Intelligence architecture design and develop. Designed the ETL process from various sources into Hadoop/HDFS for analysis and further processing of data modules.
  • Designing and Creating Azure Data Factory (ADF) extensively for ingesting data from different source systems like relational and non-relational to meet business functional requirements.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, Spark SQL Azure Data Lake Analytics.
  • Data Ingestion to at least one Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
  • Created, provisioned numerous Databricks clusters needed for batch and continuous streaming data processing and installed the required libraries for the clusters.
  • Developed ADF Pipelines to load data from on prem to AZURE cloud Storage and databases.
  • Creating pipelines, data flows and complex data transformations and manipulations using ADF and Pyspark with Databricks.
  • Extensively chipped on Spark Context, Spark-SQL, RDD's Transformation, Actions and Data Frames.
  • Developed custom ETL solutions, batch processing and real-time data ingestion pipeline to move data in and out of Hadoop using Pyspark and shell scripting.
  • Ingested gigantic volume and variety of data from disparate source systems into Azure Data Lake Gen2 using Azure Data Factory V2 by using Azure Cluster services.
  • Created Spark RDDs from data files and then performed transformations and actions to other RDDs.
  • Created Hive Tables with dynamic and static partitioning including buckets for efficiency. Also Created external tables in HIVE for staging purposes.
  • Loaded HIVE tables with data, wrote hive queries that run on MapReduce and Created customized BI tool for management teams that perform query analytics using HiveQL.
  • To meet specific business requirements wrote UDF’s in Scala and PySpark.
  • Experience in developing Spark applications using Spark-SQL inEMR for data extraction, transformation, and aggregation from multiple file formats for Analyzing& transforming the data to uncover insights into the customer usage patterns.
  • Utilized Spark in Memory capabilities, to handle large datasets.
  • Used Broadcast variables in Spark, effective & efficient Joins, transformations, and other capabilities for data processing.
  • Experienced in working with EMR cluster and S3 in AWS cloud.
  • Creating Hive tables, loading and analyzing data using hive scripts.
  • Implemented Partitioning (both dynamic Partitions and Static Partitions) and Bucketing in HIVE.
  • Involved in continuous Integration of application using Jenkins.
  • Lead in Installation, integration, and configuration of Jenkins CI/CD, including installation of Jenkins plugins.
  • Implemented aCI/CDpipeline withDocker,Jenkins,andGitHubby virtualizing the servers using Docker for the Dev and Test environments by achieving needs through configuring automation using Containerization.
  • Installing, configuring, and administering Jenkins CI tool using Chef on AWS EC2 instances.
  • Performed Code Reviews and responsible for Design, Code, and Test signoff.
  • Worked on designing and developing the Real - Time Tax Computation Engine usingOracle, Stream Sets, Kafka, Spark Structured Streaming.
  • Validated data transformations and performed End-to-End data validations for ETL workflows loading data from XMLs to EDW.
  • Extensively utilized Informatica to create complete ETL process and load data into database which was to be used by Reporting Services.
  • Created Tidal Job events to schedule the ETL extract workflows and to modify the tier point notifications.

Environment: Python, SQL, Oracle, Hive, Scala, Power BI, Azure Data Factory, Data Lake, Docker, Mongo DB, Kubernetes, PySpark, SNS, Kafka, Data Warehouse, Sqoop, Pig, Zookeeper, Flume, Hadoop, Airflow, Spark, EMR, EC2, S3, Git, GCP, Lambda, Glue, ETL, Databricks, Snowflake, AWSDataPipeline.

Confidential -St louis, MO

Big Data Developer

Responsibilities:

  • DevelopedMapReducejobs in bothPIGandHivefor data cleaning and pre-processing.
  • Imported Legacy data from SQL Server and Teradata into Amazon S3.
  • Created consumption views on top of metrics to reduce the running time for complex queries.
  • Exported Data into Snowflake by creating Staging Tables to load Data of different files from Amazon S3.
  • Compare the data in a leaf level process from various databases when data transformation or loading takes place.
  • Developed spark code and spark-SQL/streaming for faster testing and processing of data.
  • Closely involved in scheduling Daily, Monthly jobs with Precondition/Post condition based on the requirement.
  • Monitor the Daily, Weekly, Monthly jobs and provide support in case of failures/issues.
  • Worked on analyzing Hadoop Cluster and different big data analytic tools.
  • Worked with various HDFS file formats like Avro, Sequence File, Json.
  • Working experience with data streaming process with Kafka, Apache Spark, Hive.
  • Developed and ConfiguredKafka brokersto pipeline server logs data into spark streaming.
  • Developed Spark scripts by usingScalashell commands as per the requirement.
  • Imported the data fromCASSANDRAdatabases and stored it intoAWS.
  • Involved in convertingHive/SQLqueries into Spark transformations using Spark RDDs.
  • Used AmazonCLIfor data transfers to and fromAmazon S3 buckets.
  • ExecutedHadoop/Sparkjobs onAWS EMRusing programs and data is stored inS3 Buckets.
  • Implemented the workflows using ApacheOozieframework to automate tasks.
  • ImplementedSpark RDDtransformations, actions to implement business analysis.
  • DevelopedSparkscriptsby usingScalashell commands as per the requirement.
  • Interacted with the infrastructure, network, database, application, and BA teams to ensure data quality and availability.
  • Worked on Ingesting high volumes of tuning events generated by client Set-top boxes from elastic search in batch mode and from Amazon Kinesis Streaming in real-time via Kafka brokers to Enterprise Data Lake using Python and NiFi.
  • Devised PL/SQL Stored Procedures, Functions, Triggers, Views, and packages. Made use of Indexing, Aggregation and Materialized views to optimize query performance.
  • Developed data pipelines with NiFi, which can consume any format of real-time data from a Kafka topic and pushing this data into enterprise Hadoop environment.
  • Used Spark-Streaming APIs to perform necessary transformations and actions on the data got from Kafka.
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.

Environment: HDFS, MapReduce, Snowflake, Pig, Nifi, Hive, Kafka, Spark, PL/SQL, AWS, S3 Buckets, EMR, Scala, Sql Server, Cassandra, Oozie.

Confidential -Dania Beach, FL

Big Data Engineer

Responsibilities:

  • Developed Spark streaming model which gets transactional data as input from multiple sources and create multiple batches and later processed for already trained fraud detection model and error records.
  • Extensive knowledge in Data transformations, Mapping, Cleansing, Monitoring, Debugging, performance tuning and troubleshooting Hadoop clusters.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
  • DevelopedDDLsandDMLsscripts inSQLandHQLfor creating tables and analyze the data inRDBMSandHive.
  • Used Sqoop to import and export data fromHDFStoRDBMSand vice-versa.
  • Created Hive tables and involved in data loading and writing Hive UDFs.
  • Exported the analyzed data to the relational databaseMySQLusingSqoopfor visualization and to generate reports.
  • Loaded the flat files data using Informatica to the staging area.
  • Researched and recommended suitable technology stack for Hadoop migration considering current enterprise architecture.
  • Worked on ETL process to clean and load large data extracted from several websites (JSON/ CSV files) to the SQL server.
  • Performed Data Profiling, Data pipelining, and Data Mining, validating, and analyzing data (Exploratory analysis / Statistical analysis) and generating reports.
  • Responsible for building scalable distributed data solutions using Hadoop.
  • Selected and generated data into csv files and stored them into AWS S3 by using AWS EC2 and then structured and stored in AWS Redshift.
  • Collecting and aggregating large amounts of log data and staging data in HDFS for further analysis. Used Sqoop to transfer data between relational databases and Hadoop.
  • Wrote Hive Queries for analyzing data in Hive warehouse using Hive Query Language (HQL).
  • Queried both Managed and External tables created by Hive using Impala. Developed a different kind of custom filters and handled pre-defined filters on HBase data using API.
  • Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive and then loading data into HDFS.
  • Analyzed data stored in S3 buckets using SQL, PySpark and stored the processes data in Redshift and validated data sets by implementing Spark components
  • Worked as ETL developer and Tableau developer and widely involved in Designing, Development Debugging of ETL mappings using Informatica designer tool as well as Created advanced chart types, visualizations, and complex calculations to manipulate the data using Tableau Desktop

Environment: Spark, Hive, Python, HDFS, Sqoop, Tableau, HBase, Scala, MySQL, Impala, AWS, S3, EC2, Redshift, Tableau, Informatica

Confidential

Software Developer

Responsibilities:

  • Involved in complete SDLC - Requirement Analysis, Development, Testing and Deployments.
  • Involved in the business meetings to develop the application and make it work effectively for the important business segment of the client.
  • Used spring framework and J2EE components to develop Action classes, backend processes, complex reports, and database interaction, Configured and worked with Apache Tomcat Server.
  • Developed JUnit test cases to unit test the business logic. Added constraints, indexes to the database design. Developed business objects and other based on the database tables.
  • Wrote PL/SQL Stored Procedures, Views and Queries using SQL Developer for archiving data daily for daily and monthly report and scheduled the job using Spring Scheduler.
  • Apache Maven is used as Build tool to automate the build process for the entire application and Hudson for continuous integration.
  • Ensure that coding standards are maintained throughout the development process by all developers.
  • Followed agile methodology that included iterative application development, weekly Sprints and daily stand-up meetings.
  • Involved in project documentation, status reporting and presentation.
  • ImplementedLog4J for the debug and error logging purpose.
  • Worked along with the Development team & QA team to resolve the issues in SIT/UAT/Production environments.
  • Used Oracle, MySQL database for storing user information.
  • Implementation of this project included scalable coding using JAVA, JDBC, and JMS with spring.
  • Involved in implementation of the presentation layer GUI for the application using JSF, HTML, XHTML, CSS, and JavaScript.

Environment: Web Logic 9.2, Oracle 10g, Java 1.6, PL/SQL, JMS, Unix Shell Scripting, JavaScript, Apache Maven, Hudson, SLF4J, Log4j, REST Web services, Oracle SQL Developer

We'd love your feedback!