We provide IT Staff Augmentation Services!

Sr. Data Engineer Resume

5.00/5 (Submit Your Rating)

Dania Beach, FL

SUMMARY

  • Passionate and enthusiast proffesional working over 8+ years in Data Engineering, Data Pipeline Design, Development and Implementation as a Sr. Data Engineer/Data Developer and Hadoop developer.
  • Had more than 6 yearsof industrial experience inBig Data analytics,Data manipulation using Hadoop Eco system toolsMap - Reduce, HDFS, Yarn/MRv2, Pig, Hive, HBase, Spark, Kafka, Flume, Sqoop, Oozie, Avro,AWS,Spark integration with Cassandra, Zookeeper.
  • Strong experience in writing scripts usingPythonAPI, PySpark API and Spark API for analyzing the data.
  • Hands On experience on Spark Core, Spark SQL, Spark Streaming and creating the Data Frames handle in SPARK with snowflake.
  • Hands-on use of Spark andScalaAPI's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames inScala.
  • Experience in developing data pipelines using AWS services including EC2, S3, Redshift, Glue, Lambda functions, Step functions, CloudWatch, SNS, DynamoDB, SQS.
  • Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, Experienced in Maintaining the Hadoop cluster on AWS EMR.
  • EMR with Hive to handle less important bulk ETL jobs
  • Experience on Migrating SQL database toAzure data Lake, Azure data lake Analytics,Azure SQL Database, Data Bricks andAzure SQL Data warehouseand Controlling and granting database accessandMigrating On premise databases toAzure Data lake storeusing Azure Data factory.
  • Worked with Spark to improve efficiency of existing algorithms using Spark Context, Spark SQL, Data Frame, Pair RDD's and Spark YARN.
  • Experience in application of various data sources like Oracle SE2, SQL Server, Flat Files and Unstructured files, MongoDB, Cassandra into a data warehouse.
  • Able to use Sqoop to migrate data between RDBMS, NoSQL databases and HDFS.
  • Experience in developing, support and maintenance for the ETL (Extract, Transform and Load) processes using Talend Integration Suite and Informatica.
  • Experience in data processing like collecting, aggregating, and moving data from various sources using Apache Flume, Kafka, Power BI and Microsoft SSIS.
  • Extensively worked with Teradata utilities Fast export, and Multi Load to export and load data to/from different source systems including flat files.
  • Solid experience in working withcsv, text, sequential, Avro, parquet, orc, Jasonformats of data.
  • Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB usingPython.
  • Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture.

TECHNICAL SKILLS

Big Data Ecosystem: HDFS, MapReduce, Hive, Pig, Sqoop, Flume, Oozie, Zookeeper, Kafka, Cassandra, Apache Spark, Spark Streaming, HBase, Flume, Impala

Hadoop Distribution: Cloudera CDH, Horton Works HDP, Apache, AWS

Machine Learning Algorithms: Logistic Regression, Decision Tree, Random Forest, K-Nearest Neighbour (KNN), Principal Component Analysis

Languages: Python (NumPy, SciPy, Pandas, Gensim, Keras), Scala, Java and R

Web Technologies: HTML, CSS, and JavaScript.

Operating Systems: Windows (XP/7/8/10), UNIX, LINUX, UBUNTU, CENTOS.

Version Control: GIT, GIT HUB

IDE & Tools, Design: Eclipse, Visual Studio, Net Beans, Junit, CI/CD, SQL Developer, MySQL, SQL Developer, Workbench, Tableau

Databases: Oracle, SQL Server, MySQL, Cassandra, Teradata, PostgreSQL, MS Access, Snowflake, NoSQL Database (HBase, MongoDB).

Operating Systems: Windows 98, 2000, XP, Windows 7,10, Mac OS, Unix, Linux

Cloud Technologies: Amazon Web Services (AWS), MS Azure

Data Engineer/Big Data Tools / Cloud / Visualization / Other Tools: Databricks, Hadoop Distributed File System (HDFS), Hive, Pig, Sqoop, MapReduce, Spring Boot, Flume, YARN, Hortonworks, Cloudera, MLlib, Oozie, Zookeeper, etc. AWS, Azure Databricks, Azure Data Explorer, Azure HDInsight, Salesforce, Google Shell, Linux, Bash Shell, Unix, etc., Tableau, Power BI, SAS, Crystal Reports, Dashboard Design.

PROFESSIONAL EXPERIENCE

Confidential, Dania Beach, FL

Sr. Data Engineer

Responsibilities:

  • Performed data analysis and developed analytic solutions. Data investigation to discover correlations / trends and the ability to explain them.
  • Worked with Data Engineers, Data Architects, to define back-end requirements for data products (aggregations, materialized views, tables - visualization)
  • Developed data ingestion modules (both real time and batch data load) to data into various layers in S3, Redshift and Snowflake using AWS Kinesis, AWS Glue, AWS Lambda and AWS Step Functions
  • Designing and Developing Oracle PL/SQL and Shell Scripts, Data Import/Export, Data Conversions and Data Cleansing.
  • Responsible for importing data from PostgreSQL to HDFS, HIVE using SQOOP tool.
  • Implemented Avro and parquet data formats for Apache Hive computations to handle custom business requirements.
  • Developing Spark applications using Scala and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats. Using Kafka and integrating with the Spark Streaming.
  • Authoring Python (PySpark) Scripts for custom UDF’s for Row/ Column manipulations, merges, aggregations, stacking, data labeling and for all Cleaning and conforming tasks. Migrate data from on-premises to AWS storage buckets.
  • Analysed the SQL scripts and designed it by using PySpark SQL for faster performance.
  • Developed spark applications in python(PySpark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.
  • Worked on reading and writing multiple data formats like JSON,ORC,Parquet on HDFS using PySpark.
  • Written AWS Lambda code in Python for nested Json files, converting, comparing, sorting etc.
  • Involved in Data Extraction for various Databases & Files using Talend. Created Talend jobs using the dynamic schema feature.
  • Worked on Custom Component Design and used to have embedded in Talend Studio.
  • Configured Spark Streaming to receive real time data from the Apache Kafka and store the stream data toHDFSusingScala.
  • Implemented a Continuous Delivery pipeline with Docker, and Git Hub and AWS
  • Create Pyspark frame to bring data from PostgreSQL to Amazon S3.
  • Optimize the Pyspark jobs to run on Kubernetes Cluster for faster data processing.
  • Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
  • DevelopedSparkjobs using Scala for faster real-time analytics and usedSparkSQL for querying
  • Working on Talend Enterprise studio for ingesting the data intoHadoop Data Lake.
  • Worked on Dimensional and Relational Data Modeling using Star and Snowflake Schemas, OLTP/OLAP system, Conceptual, Logical and Physical data modeling using Erwin.
  • Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python.
  • Data Extraction, aggregations, and consolidation of data within AWS Glue using PySpark.
  • Uselambdafunctions inpythonand invoke the scripts for data transformations on large data sets inEMRclusters.
  • Experience in Converting existing AWS Infrastructure to Server less architecture(AWS Lambda, Kinesis),deploying viaTerraformand AWS Cloud Formation templates.
  • ImplementedSpark ScriptsusingScala,Spark SQLto access hive tables to spark for faster processing of data.
  • Strong understanding of AWS components such as EC2 and S3.
  • Participated in a collaborative team designing software and developing a Snowflake data warehouse within AWS.
  • Expertise in usingDocker to run and deploy the applications in multiple containers likeDocker SwarmandDocker Wave.
  • Created yaml files for each data source and including Athena table stack creation.
  • Developed complexTalend ETL jobstomigratethe data from flat files to database.
  • Architect and design serverless application CI/CD by using AWS Serverless (Lambda) application model.
  • Developedstored procedures/views in Snowflakeand use inTalendfor loading Dimensions and Facts.
  • Developed merge scripts toUPSERTdata intoSnowflakefrom an ETL source.

Environment: Hadoop, Map Reduce, HDFS, Hive, Sqoop, Oozie, Kafka, Spark, AWS, GitHub, Docker, Talend Big Data Integration, Lambda SQL, Unix, PostgreSQL, Snowflake, Terraform, Glue, S3,EC2,Kubernetes,Redshift Shell Scripting

Confidential, Ohio

Big Data Engineer

Responsibilities:

  • Implemented Apache Airflow for authoring, scheduling, and monitoring Data Pipelines.
  • Performed data extraction, transformation, loading, and integration in data warehouse, operational data stores and master data management.
  • CreatedHiveFact tables on top of raw data from Confidential retailer’s which indeed partitioned byTime dimension key, Retailer name, Data supplier namewhich further processed pulled by analytics service engine.
  • ModeledHivepartitions extensively for data separation and faster data processing and followed Hive best practices for tuning.
  • As a part of Data Migration, wrote many SQL Scripts for Mismatch of data and worked on loading the history data from Teradata SQL to snowflake.
  • Created Pipelines inADFusingLinked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources likeAzure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Developed various Mappings with the collection of all Sources, Targets, and Transformations using Informatica Designer
  • Implemented large Lambda architectures using Azure Data platform capabilities like Azure Data Lake, Azure Data Factory, Azure Data Catalog, HDInsight, Azure SQL Server, Azure ML and Power BI.
  • Designed end to end scalable architecture to solve business problems using various Azure Components like HDInsight, Data Factory, Data Lake, Storage and Machine Learning Studio.
  • Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.
  • Developed Spark applications usingScalaandSpark-SQLfor data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
  • Worked on setting up high availability for major production cluster and designed automatic failover control using zookeeper and quorum journal nodes.
  • Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.
  • Produce unit tests for Spark transformations and helper methods.
  • Multiple batch jobs were written for processing hourly and daily data received through multiple sources like Adobe, No-SQL databases.Experience managing Azure Data Lakes (ADLS) and Data Lake Analytics and an understanding of how to integrate with other Azure Services. Knowledge of USQL and how it can be used for data transformation as part of a cloud data integration strategy.
  • Used Azure Event Gridfor managing eventservice that enables you to easily manage events across many differentAzureservices and applications.
  • Used Service Busto decouple applications andservicesfrom each other, providing the benefits like Load-balancing work across competing workers.
  • Scalable metadata handling, Streaming and batch unification are offered by Delta Lake.
  • Designed and implemented configurable data delivery pipeline for scheduled updates to customer facing data stores built with Python.
  • Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in InAzure Databricks.
  • UsedAzure Data Catalogwhich helps in organizing and to get more value from their existing investments.
  • Compiled data from various sources to perform complex analysis for actionable results. Created Azure SQL database, performed monitoring and restoring of Azure SQL database. Performed migration of Microsoft SQL server to Azure SQL database.
  • Preparing associated documentation for specifications, requirements, and testing

Environment: Hadoop, Hive, Spark, Databricks, Azure data grid, Azure Synapse analytics, Azure data catalog, Service bus ADF, Delta lake, Blob, cosmos DB, Python, PySpark,, Scala, SQL, Sqoop, Kafka, Airflow, Oozie, HBase, Terraform, Tableau,, Git.

Confidential, Bowie, MD

Data Engineer

Responsibilities:

  • Gathered business requirements, definition and design of the data sourcing, worked with the data warehouse architect on the development of logical data models.
  • Created sophisticated visualizations, calculated columns and custom expressions anddeveloped Map Chart, Cross table, Bar chart, Tree map and complex reports which involves Property Controls, Custom Expressions.
  • Investigated market sizing, competitive analysis and positioning for product feasibility. Worked on Business forecasting, segmentation analysis and Data mining.
  • Automated Diagnosis of Blood Loss during Emergencies and developed Machine Learning algorithm to diagnose blood loss.
  • Created several types of data visualizations using Python and Tableau. Extracted Mega Data from Azure using SQL Queries to create reports.
  • Performed reverse engineering using Erwin to redefine entities, attributes, and relationships existing database.
  • Analyzed functional and non-functional business requirements and translate into technical data requirements and create or update existing logical and physical data models. Developed a data pipeline using Kafka to store data into HDFS.
  • Used pandas, NumPy, seaborne, SciPy, Matplotlib, Scikit-learn, NLTK in Python for developing various machine learning algorithms. Expertise inR, Matlab, pythonand respective libraries.
  • Research on Reinforcement Learning and control (TensorFlow, Torch), andmachinelearning model (Scikit-learn).
  • Hands on experience in implementing Naive Bayes and skilled inRandom Forests, Decision Trees, Linear,and Logistic Regression, SVM, Clustering, Principal Component Analysis.
  • Implemented various statistical techniques to manipulatethe datalike missingdataimputation, principal component analysis and sampling.
  • Developed Talend ESB services and deployed them on ESB servers on different instances.
  • Developed data validation rule in the Talend MDM to confirm the golden record.
  • Performed univariate and multivariate analysis on the data to identify any underlying pattern in the data and associations between the variables.
  • Responsible for design and development of Python programs/scripts to prepare transform and harmonize data sets in preparation for modeling.
  • Worked with Market Mix Modeling to strategize the advertisement investments to better balance the ROI on advertisements.
  • Used Python 3.X (NumPy, SciPy, pandas, scikit-learn, seaborn) and Spark 2.0 (PySpark, MLlib) to develop variety of models and algorithms for analytic purposes.
  • Performed Data Cleaning, features scaling, features engineering using pandas and NumPy packages in python and build models using deep learning frameworks
  • Implemented application of various machine learning algorithms and statistical modeling like Decision Tree, Text Analytics, Sentiment Analysis, Naive Bayes, Logistic Regression and Linear Regression using Python to determine the accuracy rate of each model
  • Implemented Univariate, Bivariate, and Multivariate Analysis on the cleaned data for getting actionable insights on the 500-product sales data by using visualization techniques in Matplotlib, Seaborn, Bokeh, and created reports in Power BI.
  • Used PCA to reduce dimension and compute eigenvalue and eigenvector and used OpenCV to analysis the CT scan pictures to figure out the disease in CT scan.
  • Processed the image data through the Hadoop distributed system by using Map and Reduce then stored into HDFS.
  • Analyze, design, and build Modern data solutions using Azure PaaS service to support visualization of data. Understand current Production state of application and determine the impact of new implementation on existing business processes.
  • Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Developed Spark applications usingPysparkandSpark-SQLfor data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.

Environment: Spark, YARN, HIVE, Pig, Scala, Mahout, TDD, Python, Hadoop, Azure, Dynamo DB, Kibana, NOSQL, Sqoop, MYSQL.

Confidential

Data Analyst / Hadoop Developer

Responsibilities:

  • Created performance dashboards in Tableau/ Excel / Power point for the key stakeholders.
  • Incorporated predictive modeling (rule engine) to evaluate the Customer/Seller health score using python scripts, performed computations, and integrated with the Tableau viz.
  • Worked with stakeholders to communicate campaign results, strategy, issues or needs.
  • Analyzed marketing campaigns from various perspectives including CTR, conversion rates, seasonal/geographical trends, search queries, landing page, conversion funnel, quality score, competitors, distribution channel, etc. to achieve maximum ROI for clients.
  • Understood Business requirements to the core and came up with Test Strategy based on Business rules
  • Writing SQL queries to extract data from the Sales data marts as per the requirements.
  • Developed Tableau data visualization using Scatter Plots, Geographic Map, Pie Charts and Bar Charts and Density Chart.
  • Designed and deploy rich Graphic visualizations with Drill Down and Drop-down menu option and Parameterized using Tableau.
  • Created action filters, parameters, and calculated sets for preparing dashboards and worksheets in Tableau.
  • Explored traffic data from databases connecting them with transaction data, and presenting as well as writing report for every campaign, providing suggestions for future promotions.
  • Extracted data using SQL queries and transferred it to Microsoft Excel and Python for further analysis.
  • Data Cleaning, merging, and exporting the dataset was done in Tableau Prep.
  • Created Metric tables, End user views in Snowflake to feed data for Tableau refresh.
  • Generated Custom SQL to verify the dependency for the daily, Weekly, Monthly jobs.
  • Developed spark code and spark-SQL/streaming for faster testing and processing of data.
  • Closely involved in scheduling Daily, Monthly jobs with Precondition/Postcondition based on the requirement.

Environment: Snowflake,Hadoop, Map Reduce,Spark SQL, Python, Pig, GitHub,Tableau, Metadata,Teradata, SQL Server, Apache Spark, Sqoop

Confidential

Java/Hadoop Developer

Responsibilities:

  • Involved in review of functional and non-functional requirements.
  • Installed and configured Hadoop MapReduce, HDFS, Developed multiple MapReduce jobs in java for data cleaning and preprocessing.
  • Installed and configured Pig and also written Pig Latin scripts.
  • Wrote MapReduce job using Pig Latin. Involved in ETL, Data Integration and Migration.
  • Imported data using Sqoop to load data from Oracle to HDFS on regular basis.
  • Developing Scripts and Batch Job to schedule various Hadoop Program.
  • Written Hive queries for data analysis to meet the business requirements.
  • Creating Hive tables and working on them using Hive QL. Experienced indefining jobflows.
  • Utilized various utilities like Struts Tag Libraries, JSP, JavaScript, HTML, & CSS.
  • Build and deployed war file in WebSphere application server.
  • Implemented Patterns such as Singleton, Factory, Facade, Prototype, Decorator, Business Delegate and MVC.
  • Involved in frequent meeting with clients to gather business requirement & converting them to technicalspecification for development team.
  • Adopted agile methodology with pair programming technique and addressed issues during system testing.
  • Involved in Bug fixing and Enhancement phase, used find bug tool.
  • Version Controlled using SVN.
  • Developed application in Eclipse IDE. Experience in developingspring Bootapplications for transformations.
  • Primarily involved in front-end UI using HTML5, CSS3, JavaScript, jQuery, and AJAX.
  • Used struts framework to build MVC architecture and separate presentation from business logic.

Environment: Hadoop, MapReduce, HDFS, Hive, Java, Hadoop distribution of Cloudera, Pig, HBase, Linux, XML, Java 6, Eclipse, Oracle 10g, PL/SQL, MongoDB, Toad

We'd love your feedback!