Data Engineer Resume
New York, NY
SUMMARY
- 8+ years of IT experience in a variety of industries working on Big Data technology using technologies such as Cloudera and Hortonworks distributions. Hadoop working environment includes Hadoop, Spark, MapReduce, Kafka, Hive, Ambari, Sqoop, HBase, and Impala.
- Fluent programming experience with Scala, Java, Python, SQL, T - SQL, R.
- Hands-on experience in developing and deploying enterprise-based applications using major Hadoop ecosystem components like MapReduce, YARN, Hive, HBase, Flume, Sqoop, Spark MLlib, Spark GraphX, Spark SQL, Kafka.
- Adept at configuring and installing Hadoop/Spark Ecosystem Components.
- Proficient with Spark Core, Spark SQL, Spark MLlib, Spark GraphX and Spark Streaming for processing and transforming complex data using in-memory computing capabilities written in Scala. Worked with Spark to improve efficiency of existing algorithms using Spark Context, Spark SQL, Spark MLlib, Data Frame, Pair RDD's and Spark YARN.
- Experience in application of various data sources like Oracle SE2, SQL Server, Flat Files and Unstructured files into a data warehouse.
- Able to use Sqoop to migrate data between RDBMS, NoSQL databases and HDFS.
- Experience in Extraction, Transformation and Loading (ETL) data from various sources into Data Warehouses, as well as data processing like collecting, aggregating and moving data from various sources using Apache Flume, Kafka, PowerBI and Microsoft SSIS.
- Hands-on experience with Hadoop architecture and various components such as Hadoop File System HDFS, Job Tracker, Task Tracker, Name Node, Data Node and Hadoop MapReduce programming.
- Comprehensive experience in developing simple to complex Map reduce and Streaming jobs using Scala and Java for data cleansing, filtering and data aggregation. Also possess detailed knowledge of MapReduce framework.
- Used IDEs like Eclipse, IntelliJ IDE, PyCharm IDE, Notepad ++, and Visual Studio for development.
- Seasoned practice in Machine Learning algorithms and Predictive Modeling such as Linear Regression, Logistic Regression, Naïve Bayes, Decision Tree, Random Forest, KNN, Neural Networks, and K-means Clustering.
- Ample knowledge of data architecture including data ingestion pipeline design, Hadoop/Spark architecture, data modeling, data mining, machine learning and advanced data processing.
- Experience working with NoSQL databases like Cassandra and HBase and developed real-time read/write access to very large datasets via HBase.
- Developed Spark Applications that can handle data from various RDBMS (MySQL, Oracle Database) and Streaming sources.
- Proficient SQL experience in querying, data extraction/transformations and developing queries for a wide range of applications.
- Capable of processing large sets (Gigabytes) of structured, semi-structured or unstructured data.
- Experience in analyzing data using HiveQL, Pig, HBase and custom MapReduce programs in Java 8.
- Experience working with GitHub/Git 2.12 source and version control systems.
- Strong in core Java concepts including Object-Oriented Design (OOD) and Java components like Collections Framework, Exception handling, I/O system.
TECHNICAL SKILLS
Hadoop/Big Data Technologies: HDFS, Hive, Pig, Sqoop, Yarn, Spark, Spark SQL, Kafka
Hadoop Distributions: Horton works and Cloudera Hadoop
Languages: C, C++, Python, Scala, UNIX Shell Script, COBOL, SQL and PL/SQL
Tools: Teradata SQL Assistant, Pycharm, Autosys
Operating Systems: Linux, Unix, ZOS and Windows
Databases: Teradata, Oracle 9i/10g, DB2, SQL Server, MySQL 4.x/5.x
ETL Tools: IBM InfoSphere Information Server V8, V8.5 & V9.1
Reporting: Tableau
PROFESSIONAL EXPERIENCE
Data Engineer
Confidential, New York, NYResponsibilities:
- Analyze and cleanse raw data using HiveQL
- Experience in data transformations using Map-Reduce, HIVE for different file formats.
- Involved in converting Hive/SQL queries into transformations using Python
- Performed complex joins on tables in hive with various optimization techniques
- Created Hive tables as per requirements, internal or external tables defined with appropriate static and dynamic partitions, intended for efficiency
- Worked extensively with HIVE DDLS and Hive Query language(HQLs)
- Involved in loading data from edge node to HDFS using shell scripting.
- Understand and manage Hadoop Log Files.
- Manage Hadoop infrastructure with Cloudera Manager.
- Created and maintained technical documentation for launching Hadoop cluster and for executing Hive queries.
- Build Integration between applications primarily Salesforce.
- Extensive work in Informatica Cloud.
- Expertise in Informatica cloud apps Data Synchronization (ds), Data Replication (dr), Task Flows, Mapping configurations, Real Time apps like process designer and process developer.
- Work extensively with flat files. Loading them into on-premise applications and retrieve data from applications to files.
- Develop Informatica cloud real time processes (ICRT).
- Work with WSDL, SOAP UI for APIs
- Write SOQL queries, create test data in salesforce for informatica cloud mappings unit testing.
- Prepare TDDs, Test Case documents after each process has been developed.
- Identify and validate data between source and target applications.
- Verify data consistency between systems.
Technologies Used: Bigdata ECO systems, Hadoop, HDFS, Hive, PIG, Cloudera, MapReduce, Python, Informatica Cloud Services, Salesforce, Unix scripts, FlatFiles, XML files
Data Engineer
Confidential, Cincinnati, OH
Responsibilities:
- Implemented Apache Airflow for authoring, scheduling and monitoring Data Pipelines
- Designed several DAGs (Directed Acyclic Graph) for automating ETL pipelines
- Performed data extraction, transformation, loading, and integration in data warehouse, operational data stores and master data management
- Strong understanding of AWS components such as EC2 and S3
- Performed Data Migration to GCP
- Responsible for data services and data movement infrastructures
- Experienced in ETL concepts, building ETL solutions and Data modeling
- Worked on architecting the ETL transformation layers and writing spark jobs to do the processing.
- Aggregated daily sales team updates to send report to executives and to organize jobs running on Spark clusters
- Loaded application analytics data into data warehouse in regular intervals of time
- Designed & build infrastructure for the Google Cloud environment from scratch
- Experienced in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension)
- Leveraged cloud and GPU computing technologies for automated machine learning and analytics pipelines, such as AWS, GCP
- Worked on confluence and Jira
- Designed and implemented configurable data delivery pipeline for scheduled updates to customer facing data stores built with Python
- Proficient in Machine Learning techniques (Decision Trees, Linear/Logistic Regressors) and Statistical Modeling
- Compiled data from various sources to perform complex analysis for actionable results
- Measured Efficiency of Hadoop/Hive environment ensuring SLA is met
- Optimized the Tensorflow Model for efficiency
- Analyzed the system for new enhancements/functionalities and perform Impact analysis of the application for implementing ETL changes
- Implemented a Continuous Delivery pipeline with Docker, and Git Hub and AWS
- Built performant, scalable ETL processes to load, cleanse and validate data
- Participated in the full software development lifecycle with requirements, solution design, development, QA implementation, and product support using Scrum and other Agile methodologies
- Collaborate with team members and stakeholders in design and development of data environment
- Preparing associated documentation for specifications, requirements, and testing
Environment: AWS, Gcp, Bigquery, Gcs Bucket, G-Cloud Function, Apache Beam, Cloud Dataflow, Cloud Shell, Gsutil, Bq Command Line Utilities, Dataproc, Cloud Sql, Mysql, Posgres, Sql Server, Python, Scala, Spark, Hive, Spark - Sql.
Data Engineer
Confidential, Chevy Chase, Maryland
Responsibilities:
- Implemented reporting Data Warehouse with online transaction system data.
- Developed and maintained data warehouse for PSN project.
- Provided reports and publications to Third Parties for Royalty payments.
- Managed user account, groups and workspace creation for different users in Powercenter.
- Wrote complex UNIX/windows scripts for file transfers, emailing tasks from FTP/SFTP.
- Worked with PL/SQL procedures and used them in Stored Procedure Transformations.
- Extensively worked on oracle and SQL server. Wrote complex sql queries to query ERP system for data analysis purpose
- Worked on most critical Finance projects and had been the go to person for any data related issues for team members.
- Migrated ETL code from Talend to Informatica. Involved in development, testing and post production for the entire migration project.
- Tuned ETL jobs in the new environment after fully understanding the existing code.
- Maintained Talend admin console and provided quick assistance on production jobs.
- Involve in designing Business Objects universes and creating reports.
- Built adhoc reports using stand-alone tables.
- Involved in creating and modifying new and existing Web Intelligence reports.
- Created Publications which split into various reports based on specific vendor.
- Wrote Custom SQL for some complex reports.
- Worked with business partners internal and external during requirement gathering.
- Worked closely with Business Analyst and report developers in writing the source to target specifications for Data warehouse tables based on the business requirement needs.
- Exported data into excel for business meetings which made the discussions easier while looking at the data.
- Performed analysis after requirements gathering and walked team through major impacts.
- Provided and debugged crucial reports for finance teams during month end period.
- Addressed issue reported by Business Users in standard reports by identifying the root cause.
- Get the reporting issues resolved by identifying whether it is report related issue or source related issue.
- Creating Ad hoc reports as per users needs.
- Investigating and Analysing any discrepancy found in data and then resolving it.
Technologies Used: Informatica Power Center 9.1/9.0, Talend 4.x & Integration suite, Business Objects XI, Oracle 10g/11g, Oracle ERP, EDI, SQL Server 2005, UNIX, Windows Scripting, JIRA
Spark Developer
Confidential
Responsibilities:
- Imported required modules such as Keras and NumPy on Spark session, also created directories for data and output.
- Read train and test data into the data directory as well as into Spark variables for easy access and proceeded to train the data based on a sample submission.
- The images upon being displayed are represented as NumPy arrays, for easier data manipulation all the images are stored as NumPy arrays.
- Created a validation set using Keras2DML in order to test whether the trained model was working as intended or not.
- Defined multiple helper functions that are used while running the neural network in session. Also defined placeholders and number of neurons in each layer.
- Created neural networks computational graph after defining weights and biases.
- Created a TensorFlow session which is used to run the neural network as well as validate the accuracy of the model on the validation set.
- After executing the program and achieving acceptable validation accuracy a submission was created that is stored in the submission directory.
- Executed multiple SparkSQL queries after forming the Database to gather specific data corresponding to an image.
Environment: Scala, Python, PySpark, Spark, Spark ML Lib, Spark SQL, TensorFlow, NumPy, Keras, PowerBI
ETL/Data Warehouse Developer
Confidential
Responsibilities:
- Gathered requirements from Business and documented for project development.
- Coordinated design reviews, ETL code reviews with teammates.
- Developed mappings using Informatica to load data from sources such as Relational tables, Sequential files into the target system.
- Extensively worked with Informatica transformations.
- Created datamaps in Informatica to extract data from Sequential files.
- Extensively worked on UNIX Shell Scripting for file transfer and error logging.
- Scheduled processes in ESP Job Scheduler.
- Performed Unit, Integration and System testing of various jobs.
Technologies Used: Informatica Power Center 8.6, Oracle 10g, SQL Server 2005, UNIX Shell Scripting, ESP job scheduler