We provide IT Staff Augmentation Services!

Data Engineer Resume

2.00/5 (Submit Your Rating)

Detroit, MI

SUMMARY

  • 7+ years of IT expertise with a solid background in Big Data, Hive, Pig, Kubernetes and ETL tool Informatica power center, Informatica cloud using salesforce for customer data and data modeling, data warehousing, ETL data Integration.
  • Good knowledge in Software Development Life Cycle SDLC and Software Testing Life Cycle STLC on Agile Scrum, Waterfall, V - Model and Agile Environments.
  • Extensive experience in Text Analytics, developing different Statistical Machine Learning, Data Mining solutions to various business problems and generating data visualizations using R, Python.
  • Predictive Modelling Algorithms: Logistic Regression, Linear Regression, Decision Trees, K-Nearest Neighbors, Bootstrap Aggregation (Bagging), Naive Bayes Classifier, Random Forests, Boosting, Support Vector Machines.
  • Good experience with a focus onBig data, Deep Learning, Machine Learning, Image processing or AI.
  • Very good hands-on experience working with large datasets and Deep Learning algorithms usingapache spark and TensorFlow.
  • Strong experience using Informatica Data Quality (IDQ) toolkit, Analysis, data cleansing, data matching, data conversion, exception handling, and reporting and monitoring capabilities of IDQ.
  • Implemented Agile Methodology for building an internal application.
  • Worked with Spark to improve efficiency of existing algorithms using Spark Context, Spark SQL, Spark MLlib, Pair RDD's and Spark YARN.
  • Proficient with Spark Core, Spark SQL, Spark MLlib, Spark GraphX, and Spark Streaming for processing and transforming complex data using in-memory computing capabilities written in Scala.
  • Experience working with NoSQL databases like Cassandra and HBase and developed real-time read/write access to very large datasets via HBase.
  • Expertise working with AWS cloud services like EMR, S3, Redshift, EMR cloud watch, Autoscaling, Redshift, DynamoDB,
  • Route53 for big data development.
  • Hands-on experience in developing and deploying enterprise-based applications using major Hadoop ecosystem components like MapReduce, YARN, Hive, HBase, Flume, Sqoop, Spark MLlib, Spark GraphX, Spark SQL, Kafka.
  • Experience in Extraction, Transformation and Loading (ETL) data from various sources into Data Warehouses, as well as data processing like collecting, aggregating and moving data from various sources using Apache Flume, Kafka and PowerBI.
  • Hands-on experience with Hadoop architecture and various components such as Hadoop File System HDFS, Job Tracker, Task Tracker, Name Node,DataNode and Hadoop MapReduce programming.
  • Experience on Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and controlling and granting database access and migrating on premise databases to Azure Data Lake store using Azure Data factory
  • Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB usingPython.
  • Experience in extracting files from MongoDB through Sqoop and placed in HDFS.
  • Expertise in Python andScala, Pyspark/Spark user-defined functions (UDF) for Hive and Pig using Python.
  • Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server, and Oracle)
  • Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
  • Hands on Spark MLlib utilities such as including classification, regression, clustering, collaborative filtering, dimensionality reduction.

TECHNICAL SKILLS

Big Data Tools: Hadoop Ecosystem Map Reduce, Spark 2.3, Spark 3.1,Airflow 1.10.8, Nifi 2, HBase 1.2, Hive 2.3, Pig 0.17 Sqoop 1.4, Kafka 1.0.1, Oozie 4.3, Hadoop 3.0

Data Modeling Tools: Erwin Data Modeler, ER Studio v17

Programming Languages: SQL, PL/SQL, and UNIX.

Machine Learning Libraries: Scikit-Learn, Pandas, Numpy, Sci-py, MatplotLib, Seaborn, NLTK.

Deep Learning Frameworks: TensorFlow, PyTorch, Keras.

Methodologies: RAD, JAD, System Development Life Cycle (SDLC), Agile

Cloud Platform: AWS, Azure, Google Cloud.

Cloud Management: Amazon Web Services (AWS)- EC2, EMR, S3, Redshift, EMR, Lambda, Athena, Azure Services - Azure Data Factory, Azure Data Lake, Azure Databricks

RDBMS: Oracle 12c/11g/10g, Mysql. Sql Server

No SQl Databases: MongoDB, Cassandra, HBase

OLAP Tools: Tableau, SSAS, Business Objects and Crystal Reports 9

ETL/Data warehouse Tools: Informatica 9.6/9.1 and Tableau.

Operating System: Windows, Unix, Sun Solaris

PROFESSIONAL EXPERIENCE

Confidential, Detroit, MI

Data Engineer

Responsibilities:

  • Developed ApacheSparkapplications by usingScalafor data processing from various streaming sources.
  • ConfiguredSparkStreaming to receive real time data from theApache Kafkaand store the stream data toDynamoDBusingScala.
  • Compiling and validating data from all departments and Presenting to Director Operation.
  • Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.
  • Developed a detailed project plan and helped manage the data conversion migration from the legacy system to the target snowflake database.
  • Worked on AWS Data pipeline to configure data loads from S3 to into Redshift.
  • Utilize AWS services that focus on big data architect, analytics and business intelligence solutions to assure optimal design, scalability, flexibility, availability, and performance, as well as to offer relevant and valuable data for improved decision-making.
  • Using AWS Redshift, Extracted, transformed, and loaded data from various heterogeneous data sources and destinations.
  • Responsible for Design, Development, and testing of the database and Developed Stored Procedures, Views, and Triggers.
  • Create, modify and execute DDL in table AWS Redshift and snowflake tables to load data.
  • Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.
  • Worked on developing PySpark/Spark script to encrypting the raw data by using hashing algorithms concepts on client specified columns.
  • Developed data pipeline using Pyspark/Spark, Hive, Pig, python, Impala, and HBase to ingest customer
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
  • Selected and generated data into csv files and stored them into AWS S3 by using AWS EC2 and then structured and stored in AWS Redshift.
  • Expertise in analyzing data using Pig scripting, Hive Queries, Sparks (python) and Impala.
  • Created functions and assigned roles inAWS Lambdato run python scripts, andAWS Lambdausing java to perform event driven processing. Created Lambda jobs and configured Roles usingAWS CLI.
  • Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, Experienced in Maintaining the Hadoop cluster on AWS EMR.
  • Files extracted from Hadoop and dropped on daily hourly basis intoS3. Working with Data governance and Data quality to design various models and processes.
  • Worked on the tuning of SQL Queries to bring down run time by working on Indexes and Execution Plan.
  • Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL,postgreSQL,DataFrame,OpenShift, Talend,pair RDD's
  • Created a Lambda Deployment function, and configured it to receive events from S3 buckets
  • Build large-scale data processing systems in data warehousing solutions, and work with unstructured data mining on NoSQL Databases like MongoDB, Hbase, Cassandra.
  • Performing ETL testing activities like running the Jobs, Extracting the data using necessary queries from database transform, and upload into the Data warehouse servers.

Environment: Apache Spark, Kafka, Scala, AWS, EC2, Redshift, Lambda, DynamoDB, S3 Buckets, CloudWatch, Pig, Impala, Python, Pandas, Pyspark, Star, Snowflake, PL/SQL, Tableau, Oracle 12g, SQL Server, Spark SQL, Openshift, PostgreSQL, Talend, MongoDB, Hbase, Cassandra, Zookeeper, Oozie.

Confidential, Westlake, TX

Data Engineer

Responsibilities:

  • Thenear real time reportingwas achieved by anevent-based processingapproach adoption instead ofmicro-batchingto deal with data coming fromKafka.
  • Used Scala to convertHive/SQLqueries into RDD transformations inApache Spark.
  • ImplementedSparksolutions to generate reports, fetch and load data inHive.
  • ImplementedSparkusingScala, Pythonand utilizing Data frames andSpark SQL APIfor faster processing of data.
  • Responsible for wide-ranging data ingestion using Sqoop and HDFS commands. Accumulate ‘partitioned’ data in various storage formats like text, Json, Parquet, etc. Involved in loading data from LINUX file system to HDFS.
  • Implemented Copy activity, Custom Azure Data Factory Pipeline Activities.
  • Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap. Installing, configuring, and maintaining Data Pipelines.
  • Developed thefeatures,scenarios,step definitionsforBDD (Behavior Driven Development)andTDD (Test Driven Development)usingCucumber, Gherkinandruby.
  • KPI calculator Sheet and maintain that sheet within SharePoint.
  • Primarily involved in Data Migration using SQL, AzureSQL, Azure Storage, and Azure Data Factory.
  • Designing the business requirement collection approach based on the project scope and SDLC methodology.
  • Creating Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Developed various Mappings with the collection of all Sources, Targets, and Transformations using Informatica Designer.
  • Extracted the raw data from Microsoft Dynamics CRM to staging tables using Informatica Cloud.
  • Tuned the existing Informatica mappings to maximize performance to reduce run times
  • Data Integrationingests, transforms, and integrates structured data and delivers data to a scalable data warehouse platform using traditional ETL (Extract, Transform, Load) tools and methodologies to collect of data from various sources into a single data warehouse.
  • Migration of on-premise data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake and Stored (ADLS) using Azure Data Factory (ADF V1/V2).
  • Create and publish multiple dashboards and reports usingTableau server and work onText Analytics, Naive Bayes, Sentiment analysis, creating word cloudsand retrieving data fromTwitterand othersocial networking platforms.
  • Work on data that was a combination of unstructured and structured data from multiple sources and automate the cleaning usingPython scripts.
  • Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB).
  • Tackle highly imbalanced Fraud dataset using under sampling with ensemble methods, oversampling and cost sensitivealgorithms.
  • Created action filters, parameters and calculated sets for preparing dashboards and worksheets using PowerBI
  • Developed visualizations and dashboards using PowerBI
  • Implement ad-hoc analysis solutions using Azure Data Lake Analytics/Store, HDInsight.
  • Created and maintained SQL Server scheduled jobs, executing stored procedures for the purpose of extracting data from Oracle into SQL Server. Extensively used Tableau for customer marketing data visualization
  • Worked on analyzing Hadoop Cluster and different big data analytic tools including Pig, Hive.
  • Performed all necessary day-to-day GIT support for different projects, Responsible for design and maintenance of the GIT Repositories, and the access control strategies.

Environment: Hadoop, Hive, Pig, Spark, Zookeeper, Kafka, Flume, Impala, Sqoop, Informatica, Azure, Azure data factory, Azure databricks, HdInsight, Azure Data Lake, PowerBI, PL/SQL, Oracle 11g, SQL Server, DB2, MongoDB, Python, Yarn, Git.

Confidential, Los Angeles, CA

Data Engineer

Responsibilities:

  • DevelopedMapReducejobs in bothPIGandHivefor data cleaning and pre-processing.
  • Imported Legacy data from SQL Server and Teradata into Amazon S3.
  • Created consumption views on top of metrics to reduce the running time for complex queries.
  • Exported Data into Snowflake by creating Staging Tables to load Data of different files from Amazon S3.
  • Compare the data in a leaf level process from various databases when data transformation or data loading takes place.
  • Developed spark code and spark-SQL/streaming for faster testing and processing of data.
  • Closely involved in scheduling Daily, Monthly jobs with Precondition/Post condition based on the requirement.
  • Monitor the Daily, Weekly, Monthly jobs and provide support in case of failures/issues.
  • Worked on analyzing Hadoop Cluster and different big data analytic tools.
  • Worked with various HDFS file formats like Avro, Sequence File, Nifi, Json and various compression formats like Snappy, bzip2.
  • Working experience with data streaming process with Kafka, Apache Spark, Hive.
  • Developed and ConfiguredKafka brokersto pipeline server logs data into spark streaming.
  • Developed Spark scripts by usingscalashell commands as per the requirement.
  • Imported the data fromCASSANDRAdatabases and Stored it intoAWS.
  • Involved in convertingHive/SQLqueries into Spark transformations using Spark RDDs.
  • Used AmazonCLIfor data transfers to and fromAmazon S3 buckets.
  • ExecutedHadoop/Sparkjobs onAWS EMRusing programs and data is stored inS3 Buckets.
  • Implemented the workflows using ApacheOozieframework to automate tasks.
  • ImplementedSpark RDDtransformations, actions to implement business analysis.
  • DevelopedSparkscriptsby usingScalashell commands as per the requirement
  • Devised PL/SQL Stored Procedures, Functions, Triggers, Views and packages. Made use of Indexing, Aggregation and Materialized views to optimize query performance.
  • Using Nebula Metadata, registered Business and Technical Datasets for corresponding SQL scripts.
  • Used Spark-Streaming APIs to perform necessary transformations and actions on the data got from Kafka.
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.

Environment: HDFS, MapReduce,Snowflake, Pig, Hive, Kafka, Spark, PL/SQL, AWS, S3 Buckets, Scala, Sql Server, Cassandra, Oozie.

Confidential

Data & Reporting Analyst

Responsibilities:

  • Researched and recommended suitable technology stack for Hadoop migration considering current enterprise architecture.
  • Responsible for building scalable distributed data solutions using Hadoop.
  • Selected and generated data into csv files and stored them into AWS S3 by using AWS EC2 and then structured and stored in AWS Redshift.
  • Collecting and aggregating large amounts of log data and staging data in HDFS for further analysis.
  • Managed and reviewing Hadoop Log files.
  • Used Sqoop to transfer data between relational databases and Hadoop.
  • Worked on HDFS to store and access huge datasets within Hadoop.
  • Wrote Hive Queries for analyzing data in Hive warehouse using Hive Query Language (HQL).
  • Queried both Managed and External tables created by Hive using Impala. Developed a different kind of custom filters and handled pre-defined filters on HBase data using API.
  • Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive and then loading data into HDFS.
  • Good hands on experience with GitHub.

Environment: Hive, Python, HDFS, Tableau, Hbase, MySQL, Impala., AWS, Redshift, Tableau, GitHub.

Confidential

Data Analyst

Responsibilities:

  • Extract, manipulate and analyze data and create reports using T-SQL.
  • Set up pivot tables in Excel to create multiple reports based on data from a SQL query.
  • Involved in requirement gathering, analysis, documentation, follow-ups, reporting and coordination between the business owners and technical team.
  • Developed stored procedures and SQL scripts for performing automation.
  • Validated the data by using SQL queries extensively.
  • Worked on ETL process to clean and load large data extracted from several websites (JSON/ CSV files) to the SQL server.
  • Performed Data Profiling, Data pipelining, and Data Mining, validating and analyzing data (Exploratory analysis / Statistical analysis) and generating reports.
  • Used Microsoft SSIS and Informatica for extracting, transforming, and loading (ETL) data from spreadsheets, database tables, and other sources.

Environment: s:T-SQL, MS Excel, MS SQL Server, PowerPoint, Microsoft SSIS, Informatica.

We'd love your feedback!